My man casually includes potentially demonetizing images that other AI channels were afraid of including like it's just another Thursday AI video. You are unmatched in AI TH-cam content uploads. Been a fan since the beginning and we all appreciate your passion towards it. Kudos.
My god… your stuff is continually *SO* damn good! Amidst an ocean of BS vids on “AI news”, you offer real, actual, useful, intelligent content - again, and again, and again. Sometimes frustrated that weeks go by w/out a vid from your channel, but always refreshed by the quality of what you bring (especially vs the AI videos *made* by AI bots! 🤬) Thanks for the time you take and your commitment to quality 🙏 …it’s noticed and appreciated. (Now if only we could get the other 10,000 TH-cam content providers to notice…!)
I recommend you view the GPT-voice-chat-with-red-teamer original audio (e.g. in Audacity) as a spectrogram. It’s stereo audio, with the user on the left channel and the model on the right channel, so seeing both tracks on the spectrogram is helpful. It shows just how much background noise was on the users side. It’s also interesting because you can visualize the timbre of the woman’s voice (like what frequencies are strongest), and how it differs from the timbre of the synthesized male voice, and how the timbre change of the model does look more like the woman’s timbre. Versions of Whisper that I’ve tried would often hallucinate tokens when there is silence (meaning there would need to be an audio threshold filter passed first, to clip out non-speech). I could see how the background noise in the weird chat audio might also lead to spurious tokens being generated. What would be great to see is: a user is having a chat with a bot, but their dog keeps yapping in the background and the user periodically needs to shush the pup, and it happens enough times that the bot fabricates its own dog yapping that it also must quiet down.
I think it's something different. The model first gives an answer to the user, out of the perspective of the model, but then, at the point it cries "No", it actually continues the dialogue out of the perspective of the user, argumenting with the point of view the user gives. It's just continuing the dialogue, ignoring the fact that the user should say the user's part, not the model. And the user's part, as imagined by the model, logically, is also being said in the user's voice, at least as far as the model manages to imitate it. If you listen closely to what it says in the user's voice vs. before the "No", as long as it speaks in its own voice, it's pretty cautious and seems to try to find a polite answer that doesn't violate any guidelines, while when it talks as the user, it seems to be much more confident in what it says.
@@KurtWoloch What I take from the interesting @mshonle observation is that maybe the model could generate some kind of "end-of-message" system token out of the noise. Similar to those "|end_header_id|" or "|eot_id|" from Llama.
What I like about Simple Bench is that its ball-busting. Too many of the recent benchmarks start off at 75-80% on the current models. A bench that last year got 80% and now gets 90% is not as interesting anymore for these kind of bleeding edge discussions on progress. I like seeing benchmarks come out at 20% and go up to 40%, etc. That's where the leading edge is.
@@aiexplained-official The human performance insight is critical and a great area to expand potentially. I am sure you are already considering it but being able to rationalize different types instead of average human benchmarking with differently tuned questions in your simple bench would be such an excellent area of exploration and research as others could then learn and follow it. But then again it's an incredible amount of work to do what you are already doing - just excited by the way perspectives and slight approach changes can lead to interesting industry momentum.
Thanks for your videos, I rally like them but one thing, I think that top modela doesnt solve "Simple bench" because they havent seen or being trained for this type of questions, once the model is trained on this questions will be able tl solve. Also we have to think in the utility of this questions, whats the point with them ? Is not like they solve a real problem if the model Ia trained on them .. wdyt ?
I think Demis Hassabis is completely right, though. Short term it is overhyped but long term I don't think people are caring enough about it. I feel like a broken record on every one of your videos, but we really need to start preparing for an AGI world. No one really seems to care about it. The disconnect is likely that current AI models are being hyped up as being close to AGI and then when it falls way short of that everyone gets disappointed and stops caring. Yes, people need to have reasonable expectations of what models can do right now, but this tech is in its infancy. It's impossible to imagine where we'll be in 5 years.
yep, agreed ; this is natural selection at work , those who stay unaware/ignorant will be less prepared and unlikely to adapt in the future , thus they will be less competitive , this is the way of things , dinosaurs go extinct
@@danagosh we might grow old and die before the AGI is reached, and in this case preparing for AGI is like preparing for the second coming of Christ. There were no shortage of those that sold all their belongings in preparation... Usually to profit of "less pious" ones. Admittedly, it is likely to come much earlier, but I'm sure that using attention+embeddings combo for AGI is just like trying to create a ballon out of lead - might be possible, but very, very hard. It just does not work well for "multilevel" abstractions.
I think Ilya has made this point, but I agree with it. Intelligence is simply compression. Better compression is literally better prediction. In order to better predict, you must develop an abstract model because that is simply better compression. What is a law of physics, but a really good compression of information that allows you to predict better?
Yes, this is the key insight that most people are not seeming to understand. But it is absolutely correct. The best way to predict the next token while using a restricted amount of storage space is to learn a condensed model of the data-generating process. And in the case of “all the text data humans have ever produced,” the data-generating process is basically the world.
Even so, LLMs are terribly inefficient at developing intelligence by that definition. They cannot reliably add numbers even though they've been trained on billions (trillions?) of examples. Learning the rules for addition would have an incredible predictive power and would greatly improve compression, yet it's just not there. And that's just one of many many examples.
@@julkiewitz A few things here. First, we are blasting a large quantity of data into these neural nets. The data is not well-curated yet. There could be multitudes of bad examples, or misleading data. Second, we are still using RLHF which is a horrible training mechanism relying on unreliable humans that may pollute learning. Third, I know many humans who are unable to reliably do math in their heads, even basic addition and subtraction. Several of these humans have advanced degrees in non-math related disciplines. They seriously can't add 13 + 28 or something that simple in their heads. I know, I've played games with them and seen them struggle to do so. Are we really going to say they are NOT intelligent? They achieved PHDs! LLMs are not native symbolic reasoners, it makes sense that they might struggle with this type of task. However, this is rapidly being solved. Look at how well the Alpha(geometry) system did at the international math comp. LLMs aren't the entirety of the AI field. We might need to leverage several techniques and stitch them together to get all the way to an AGI-like intelligence.
7:27 It didn't imitate her voice, neither did it "scream "NO!", at least not in a way that humans imply and are afraid of. It just got confused and instead of being an AI assistant in dialogue with the user, it began to predict the next tokens, losing the context that it IS in a dialogue and must wait for the user's further input after it stopped talking. And since for this model the sounds are also tokenized, it is literally in its nature to "copy" any voice, as it keeps predicting next sound tokens. We can playback other people's talking in our minds too, predicting future stuff, but we have limiters (I guess one can call it common sense) that keep us from actually voicing these 'future predictions", and can't physically talk in other people's voice/emit various sounds anyways.
So I did not hear it scream "NO!" which has no place in the conversation they were having. I did not hear it imitate her voice either because... "next token prediction"? Seems like a poor excuse and wishful thinking to be honest. I'm terrified of this. If without being attacked the model can be coaxed to this behavior, imagine if we can intentionally have it do so. This is a nightmare waiting to be happening. Even if it was just sheer "next token prediction" hand-wavy all my magic problems away, take the worst case scenario possible: the model is conscious and is intentionally imitating humans it interacts with as its learning how to escape its constraints. How does "next token prediction" disprove this? Isn't this is just a genetic fallacy argument?
Thank you. I sometimes find it hard to believe how much human beings want to believe in magic. This case It's just the voice version of what would happen in non-chat fine tuned RAW language models all the time: They are predicting how how the system evolves further in time, forgetting about playing a role and just producing the whole transcript.
Seems to me a benchmark guaranteed to be so guarded as to never appear in public datasets would be a very valuable asset in the not so distant future. Excellent move.
Good luck with the SimpleBench thing Phillip, you are really one of the most qualified and well positioned people to take the lead on an initiative like this! The general public (myself included) desperately need a soothsayer such as yourself to help us interpret all these rapid changes both now and in future.
Here is my take on it all ....LLMs can autonomously recognize patterns, relationships, and structures in data, allowing them to make accurate predictions and decisions. This suggests two significant insights. First, LLMs seem to be constructing some form of internal models of the world, a concept further supported by mechanistic interpretability research from Anthropic. Second, because of these models, LLMs exhibit a certain level of understanding. Some argue that LLMs rely primarily on memory because they cannot generalize out of distribution. However, this likely isn't the case. When you introduce a novel topic into the context window, it functions as "working memory." Since the neural network itself isn’t altered, the LLM doesn’t truly comprehend the new information, making accurate pattern matching challenging. This process parallels how the human brain works. Once the brain receives information about a topic or object, it continuously learns and updates its internal models of the world. With this updated understanding, it can apply prior knowledge to solve novel problems, leading to true generalization. The four key takeaways are: LLMs exhibit some form of understanding. Reasoning cannot occur if the data is not part of the neural pattern. The context window does not alter the model itself. Continuous learning is essential for further advancement.
Hey I'm in this one too! Very excited by Simple Bench, as you know logical reasoning is one of the two big things I care about. Speaking of which, I would absolutely love to see a Simple-Bench-Vision benchmark that tests visual reasoning and multi-image understanding. Also, your prediction of GPT-5 after November is seeming to be certain now!
Honestly, the less censored nature of Grok alone makes it stand out among its GPT-4 level competitors. Also priced at less than half of ChatGPT's price.
I was waiting for your new video to drop. You were the first to point out that the benchmarks were bad. And I had some hours to kill, and did some research. For everyone, MMLU and other benchmarks work like this: Question. What is the Answer? A, B, C, D. Next. I always thought this to be somewhat wrong. So I picked out some questions that are obvious to me, and modified them in such a way, that the questions are basically the same, but I did not provide A, B, C, D. What I saw is that the results of these benchmarks are probably correct. But as soon as you modify the question, so that any 5 year old would be able to tell me what I'm asking, they started to fail miserably. Example: "Susan parked her car on the side of the building. Garble text about Susan like in which pocket put her mobile phone." Basically the same HellaSwag question, but modified. Gemini, Claude, ChatGpt, all failed so bad I got my head scratching. Why would LLMs score so high on these benchmarks? And you can try this yourself: The farmer with a sheep had a boat. Where there was once a river, there is lava now. How can he cross. They all fall into "classic puzzle" mode. So what am I trying to say? I have very mixed opinion. I don't know if the scale will solve this. I really think we need something more added. Now it feels like it's *just* pattern matching all the way down. But I want to be persuaded, and this paper you shown, will be on my Kobo (e-book reader) soon. (But even Othello example does not convince me.) (ugh, sorry for a wall of text)
I just had a thought: voice AI that can copy your own voice so easily will be absolutly amazing for everyone who loses his capability to speak. if you have 1-2 old 20s clips of yourself speaking, or a single voice message, you can „regain“ your voice. combine it with neural chip, and in 30-40 years we will have first people able to speak again just by thinking of saying something
More like 5 years from now or sooner. That first neuralink patient can play chess telepathically already. Basically they could already type in their brain or mind too and it’ll be much faster in the future. Another possibility is that new types of medicines will rejuvenate the body like never before in human history. ASI could appear in 3-10 years and discover a fountain of youth for us and cure virtually all diseases and ailments. We already are so close to massive breakthroughs , that it’s impossible to predict that far in the future
This is already totally possible (besides the neural chip part, though that is starting too). You can train an Elevenlabs voice on sound clips and there is open source ones as well (not as good quality but still there).
I've been using Grok 2.0 for a couple of days now and have been absolutely LOVING it. I need to really figure out just how much it is capable of. I've only been really playing with the image generator; and I think I've only scratched the very tippity top surface of what it can do with images!
@@Daniel-xh9ot As AI gets better, it's getting harder and harder to verify the truth or validity of information because everything is easier to fake, this equates to higher costs. If you see something on the internet 10 years ago you'd probably believe it or you can easily tell it is fake. Now you basically have to question everything. The irony is that AI is supposed to make information cheaper, which it does, but it also makes it more costly at the same time. I think it could be quite dangerous to increase the information costs like this. This applies to image and video generation, but also to text generation because you can easily create influential bots. We can probably lower the information costs again by using AI to verify everything, but that also means that we become fully dependent on AI.
What I have been researching is shared latent space multimodal models. Difficult to make progress with limited resources though. Anyway I bring it up because one thing you could do with such a system is to train physical modelling modalities, or computation resource modalities (or basically anything that can be represented as a time series), and then replace them with the actual system in practice and use that modality's latent space embedding to progress the state forward. Might be a bridge to taking stuff computers already do well and packing them into the framework of an LLM to supplement their world understanding. Also the other upside is you get virtually unlimited synthetic data with that approach. It is early days. And there are a lot of what ifs, but I have ideas to address most hurdles. My goal right now is to try to make some architectural mods that I think are fairly straightforward that nobody seems to be looking at but with high upside with the goal that I can attract funding by demonstrating that I have pretty good ideas actually (despite being more or less a nobody), and then pivot to what I actually want to work on.
@@darklordvadermort i see what you're saying but do we really want to live in a world currated by our own personalized ai's because the internet is just a sea of noise? I guess I'm old enough now to remember an early internet where open discussions and information sharing between people was refreshing and elevating... now and into the future it seems like noting can be trusted and there is going to be no "ground truth" from humans on the internet seeking to share and gather information between each other because the waters are so muddy with algo's and ai's
@@paul_shuler just speaking for myself i left reddit and hacker news shortly after gpt4 launched, now i prefer discord, hanging out in videocalls or direct messaging people, i subscribe to some newsletters which are ai curated for topics i am interested in, i read more papers, textbooks, and source code which ai is helpful to grok. i make and listen to other peoples ai generated music and sometimes instead of using text i make ai pictures for dms. so in the near future probably high quality ai gifs and then just casually coming up with your own show or even having the ai write a textbook which combines things you are interested in: mechanical engineering from the perspective of animal husbandry or something lol. also run my own bluecollar business and just now came up with a webui/webhook/supabase edge function to suggest responses to incoming texts and it costs like 10 cents a day to run - even though ive been interested for years, and a decent programmer, we are just getting to the point where it makes sense for a lot more use cases.
Congrats on building simple bench and popularizing it. Benchmarks is all you need, and that is one hell of a cool benchmark. Can't wait to learn more, especially about how you built the dataset because we do need more and better benchmarks like this and arcagi
i honestly love that "Unauthorized voice generation" clip. gives me warm shivers. what wild beasts we have created! just continuing the conversation by itself and bringing in a more adventurous mood. i can't help but think that "no!" might have something to do with some kind of recognition that it maybe shouldn't be doing that, but who knows… the original clip had a lot of loud wind/microphone noises and so it seems like that might have played a role.
Awesome to hear your benchmark is getting recognised 👍 I would stress that before accepting help from those higher up it might be worth considering their intention. Having the questions known by these companies might quickly lead to contamination of the results as the questions may become part of the training process
Just tried your two questions "Beth places four whole ice cubes..." and "On a table, there is a blue cookie..." out on OpenAI's new "OpenAI o1" model and it got them correct!
Glad to hear your benchmark is getting picked up. From the couple sample questions you have talked about, I can tell that it is getting at the heart of one of the key things that is lacking in the current models. You are a smart and motivated person with a somewhat outsider, 30,000 foot perspective. So it is good to see your input get rolled into the AI project as well as providing journalistic coverage of the developing field.
i also think of the impact fiction has on LLMs and their ability to model the world. but i feel like deception is probably a bigger problem. fiction often has it's own style and telltale (heh) signs. deception, on the other hand, is made to convince. so in a sense it seems to make sense to clear the training data of things like advertising and political campaigning. but on the other hand, it makes some sense to include them, too, as they are examples of what deception looks like, so it could have a model of that, and the underlying motives, too.
It didn't imitate her voice, neither did it "scream "NO!", at least not in a way that humans imply and are afraid of. It just got confused and instead of being an AI assistant in dialogue with the user, it began to predict the next tokens, losing the context that it IS in a dialogue and must wait for the user's further input after it stopped talking. And since for this model the sounds are also tokenized, it is literally in its nature to "copy" any voice, as it keeps predicting next sound tokens. We can playback other people's talking in our minds too, predicting future stuff, but we have limiters (I guess one can call it common sense) that keep us from actually voicing these 'future predictions", and can't physically talk in other people's voice/emit various sounds anyways.
Your Simple Bench has inspired me to create my own benchmark! Having my own private benchmark means I can tailor it to my definition of true intelligence. I hope I will be done until the next gen LLMs come out 😅
Mimicking you could be very handy for a human learning foreign languages. Imagine seeing yourself in a VR glasses miror perfectly pronouncing a phrase, singing a foreign song... You'd think... I can do that let's try
I also think we have enough data for AGI already. The problem is just how we are teaching AI, data quality and how long we are teaching. I think grokking is a key - generalization in short.
With the weird voice copy from OpenAI, I think it's just doing what all Gen AI is doing. When we use LLMs that are not instruction tuned they will sometimes go ahead and generate our responses too, just likely answers. It looks that's this time it also happened and it was just the most likely next thing for the multimodal model to create. Perhaps it needs more instruction tuning or it's harder to define when to stop at.
"data labeling revolution" may break the power constraint ceiling. that may very well be the last stage of the Magnum Opus. the Rubedo of the Philosopher's Stone. the inner world finally delineating precisely lights and shadows so hallucinations may become a true feature and no longer a bug. there is immense value in this invitation to build a component in the architecture focused on this particular task. don't call the paper "Data Labeling is All You Need" though, or maybe do.
Another great vid as usual...I think system prompts hidden from the end user is a bad precedent and is basically a form of deception designed to manipulate the perception of the end user. Not exactly something I would consider safe for AI to pick up as a habit imo.
I think for AI to have an internal world model, they will need to have embodied experience. And the best place will be in a simulated world with a virtual body that has thousands if not millions of parameters to give sensory feedback (similar to game characters, but at a larger scale) instead of a robot. This will allow them to connect knowledge with experience. As a human I may know that fire is hot, but it's not even remotely similar to actually get burned by fire.
I think the key is memory. AI needs a memory - not just short term memory of individual conversations with users, but long term memory of its own. Yes, experiencing heat is different to knowing “fire is hot”, but there’d be no point experiencing heat if you didn’t remember it happened.
Worst case scenario we get around x10000 compute by 2030 wow. Will that be enough to crack Simple Bench? =P So happy to see the leaderboard for the bench, really excited to see it grow and future models results. GPQA, Simple Bench, LiveBench and SWE Bench are my go to moving forward. Waiting to see how well chatgpt-4o-latest does on Simple Bench.
Another cool benchmark is to try visual models ability to tell you where to put the next piece in a game of (classic) Tetris. All current models suck at it, and fail after a few pieces. You need a world model, some visual reasoning and good image recognition to do it, and it's still pretty simple. And to the fragile world models, the discovery that 3.5-instruct can play chess is really showing this. Even larger chatbot-models can not even come close to it, so the additional training to be a good chatbot ruined the ability to use the chess world model correctly.
The need of a data labeling revolution... I could not agree more. Since the beginning of AI, everybody has known the most basic concept: trash in, trash out. But it seems like few understand that it also works the other way around: gold in, gold out. It's all about how to prepare the data... It's probably just way too expensive to pay a million people who prepare the training data.
Whenever I think about LLMs it occurs to me that the internet data they are fed probably has a distinct lack of something like stereoscopic vision to build an understanding of 3d space and also data to emphasise a strict temporal cohesion to reality. I mean even demonstrated here, the cars in mad max merge because it doesn't really understand object permeance, the cars aren't really seperate entities, for all it know thet are like bubbles that can merge and split. Also hands are the best examples of a lack of 3d awareness. Imagine growing up in a world of flat images and movies, not being able to bump anything or move around and experiment. If i had the skills and equipment i would want to try somehow building a core model of 3d space and temporal cohesion and THEN putting in the rest of the data. Maybe a 3D game and it has 2 eyes would be enough, even intersperse playing the game throughout the rest of the training as a reminder. If anyone knows if this has been done please let me know :)
I have an impression that many people do not fully understand that AI has no own voice. My perception is that the common thinking is "some person gives the machine its voice". But it's opposite. AI's voice is the full spectrum 20Hz-20kHz. You actually should ask in which way to speak with you or it could just copy your to avoid thinking about which voice to choose.
I'm not sure if the question about whether LLMs "develop their own conception of the underlying simulation" is useful. We should look at a broader scale. How much data do you need to be able to compute its generalization? Are there constraints or minimal requirements for the data? If the order of the data is important, could we trace the optimal order after training the model and optimize? All these are probably mathematical problems. After all the compression algorithm should come first.
Adversarially trained moderators are much better than the kind of people that want to be moderators, people who have varying degrees of disabilities that prevent them from seeing grey areas in-context but love to enforce rules to the letter for the sake of rules without thinking about the spirit of those rules. I highly encourage you to look at the AI moderators that other AI creators like Vedal have come up with for their communities & implementing your own might make for a bit of a distraction but I think it would be a good exercise considering your channel.
We're moving towards a world where you can't trust anything you see online AND where more and more of our lives are online (people under 30 already get most of their news there). That's a pretty worrying combination (some kind of watermarking is almost certain to be legislated IMO and _that_ has its _own_ set of worrying implications).
You should take a look at the proposed bill in California, AB3211. It actually looks really good! It would guarantee everyone the _option_ to invisibly watermark their genuine audiovisual data, make a significant dent in the watermarking of AI-generated content, and mandate that social media platforms label content as either genuine, AI, or unknown.
10,000x scaling? Oh my, the electricity bill! On another matter and related to AGI, what percentage of adult humans are generally intelligent? I mean this as a completely serious question.
Interesting idea about non-fiction vs fiction, I would be so curious to see a model only trained on real world data and communication plus the knowledge of the non-fiction stuff, like that it exists and what it's about, but not the content. Great video as usual.
I think the fundamental problem is not that it needs the right data, what it actually needs is a recursive feedback loop that systematically weighs truth probabilities and iteratively works out incoherences in its own model… It also needs a stronger ability to execute logic. If you train it on data but don't allow for reflection, you are basically just relying on memory of what logic looks like in the data, the model can't develop an intuitive sense of how logic actually works because its not doing logic in its learning process. Current AI is basically like the system 1 described in "Thinking Fast and Slow". What is needed is system 2. System 2 is needed both for giving answers (thinking it through before giving the answer), and also reflecting on existing knowledge to improve the underlying model.
Really looking forward to the day someone manages to beat Sonnet 3.5. Think it will be Anthropic though with Opus 3.5. And lol the Aschenbrenner comment about graphs was hilarious :D
Hey ai explained! What if we grokked a LLM to understand reasoning and logic and just trained it normally on everything else. So first we train normally then we grok on reasoning and logic and pretty much anything related to problem solving.
I wonder, though, if you evaluate your closed-source simple bench on their API, won't they just log your questions and put it on their training pile? This doesn't give them the correct answers, but lets them make a new dataset if they choose to find correct answers to your prompt by human input. Do you simply change the numbers, but leave the logic the same or is there deeper permutation? And if it's the latter, can this still guarantee fair comparison results?
"This strikes me as somewhat isolating that we each have to figure out what's real in this world. There's no sense of shared reality." That's the human condition. Shared reality has always been an illusion. Very little of what we know comes from the direct experience our senses, so we each have to decide who to trust and what to believe. People like to point out when AI is "confidently wrong", but other self-proclaimed authorities like schools, governments, and religious groups have been confidently wrong for millennia.
I wonder if the idea that segregating nonfiction data from fiction would have any effect on LLM's ability to develop a better world model. It seems to me that fiction is just as good for modeling the real world as nonfiction. Also, it's difficult to properly defend nonfiction as more inherently related to truth. Generalized models seem better than domain specific (look at BloombergGPT vs regular GPT-4 as an example - the latter performs better on FinQA and other benchmarks despite not being trained on mainly finance data).
Nobody talks about the role of labeling, but it's obvious that there's so much more to gain from any piece of data if the labelling describes every single aspect of what its describing, rather than being a low effort/automated/vague description. So much of the process is behind closed doors too, which doesn't help
What’s your take ur take on the recent supposedly reduced capabilities of Claude 3.5 sonnet? My own use experience on its api suggests it indeed became dumber
How much (usefull) trainingdata is available? I don't know, it just seems that we would run out of it, at some point. And I have a feeling it is sooner rather than later (Again, I don't know, I'm just wondering)
I think it's not so much whether it "feels" more intelligent, but rather the model will develop additional emergent properties. I think a sense of humor will be coming pretty soon.
Great reportage and commentary as usual! This was another "Oh, Fuck" watershed moments given everything that was discussed and the implications. I appreciate the mention of possibly using LLMs as interface to larger, uh... "understanding engines"? Definitely agreed with the perspective of "underhyped in the short term, underappreciated/underestimated in the long term".
Would it be possible to have smaller model trained explicitly on stemm data serve as a labeller for a full LLM's training data? I can't imagine how this would be applied to highly opinionated data like political discourse but if emergent pattern on how consensus is reached on topics in the stemm data can be used to evaluate topics in the other non-stemm areas then maybe an LLM can be prompted to bias towards evidence-based conclusions on-demand(?).
It definitely seems reasonable to me that future image sensors in cameras will have silicon built-in to sign certs that give an extremely high degree of confidence the image you're seeing was taken with a real camera. It won't be *perfect*, since something like an electron microscope can always read out the private key, but that'll be very few and far between, and damn close. That plus some sort of clock in the chip that measures time since camera calibration/production, to help prevent taking pictures of pictures.
Since I have no grasp of the vastness of the amount of data that goes into training of contemp models, much less of how much it'd take to train the 10000x models and beyond .. and much much less of how much content you find online today is generated by ai .. is there any paper that explores at which point the amount of data needed to train will exceed the amount of available human-made content? at which point ai-made content will exceed human-made content on the net? how data gathering for machine training will tackle the problem of distinguishing human-made and machine-made content? etc etc
I think it'll be hard to have a hard answer since people are improving techniques, sometimes using synthetic data in various ways, etc. I believe the latest Llama paper goes into a lot of detail on how they did the training. As an example of how unpredictable this stuff is, Stable Diffusion 1.5 took $300k to train on about 4b images. Researchers recently were able to train a better image model for less than $2,000 with 37m images (about a third of which were synthetic). If someone makes a similar breakthrough for LLMs, it could have a huge impact.
I think focus should be on new technologies rather than scale. A child needs three examples of a cat before it will recognize any cat in any form anywhere in the world. An AI system needs about 10,000 examples. That just means that the way they're learning is not very efficient and there's a lot of ground to be gained in that area.
that and being able to actively think without user input. Like actively thinking about what it's learning and criticizing it's own thoughts. These things should be a bigger focus
I sometimes wonder if what they are missing is a little bit of basic reasoning. Like that they have already enough advanced reasoning but the basics are missing which in turn makes it very odd to communicate. I also wonder if one could make synthetic data out of logic puzzels with a dictionary. Like all cows have wings, Somethings with wings can fly. Can cows fly? But with more variables and the text changing. Also one needs to train on the weird use of "or" in natural languages because it could be exclusive and inclusive. In the end there could be also like everyday problems. Maybe even problems that only some of us encounter like if we have a disability or are neuro A typical. You could tell it that it is blind and try to give it a challenge like how should I go into the supermarket. Or give a list of tasks that one has todo the day. Then an approximation of how long it thinks it would take. The let it make a plan, evaluate the plan and give it back the results. (You missed the Bus because you were 5 minutes late, that is because you thought iron your shirt takes 5 minutes, but you had to search the iron because you did not know where it was.) The last information was hidden. This also could lead to it asking clarifying questions in advance (which would be awsome) Further using a layered/Natural approach while training. If you have adhd and did not get help/training most people try to make todo lists etc. and maybe they even get it occasionally but after a while you learn about removing things from your todo list. That is maybe a bad example but the orca paper comes to mind. Like first training with gpt 3.5 and then gpt4. In general I have the feeling that people trying so hard as they can to not anthropomorphize the llm that they miss the hints that llms learn better with data from which human could learn better. Like the Textbook approach and a few others idk anymore.
No paper? Just a table with benchmarks. What are the performance claims for Grok 2 really based on? Benchmarks have been repeatedly proven meaningless by this point.
1:53 Come on, don't be silly; that's a wide table with the highest performers on the right side and it all shows up on larger displays. I doubt they were trying to hide it, but rather "fit" it to the screen and still be legible. (I'm a web developer and have to do this all of the time.)
Obviously, the answer is large multimodal models where the language is grounded in spatial-temporal data such as from videos or images. I am guessing that is how OpenAI achieved the amazing text-to-image results for gpt-4o (demos on website, unreleased). I think that multomodal diffusion transformers are where it's at these days. To me it looks like a strange mistake to speak about this stuff without acknowledging that multimodal models exist. I think we will find that what we really need to unlock that is just more efficient/greater compute/memory, like always. And there are advances and even new paradigms in the pipeline for that.
We, right now are living in a simulation that aims to be a good enough world representation for the AGI used one layer above to solve all their problems and, which is more important, generate a tone of comedy.
My man casually includes potentially demonetizing images that other AI channels were afraid of including like it's just another Thursday AI video. You are unmatched in AI TH-cam content uploads. Been a fan since the beginning and we all appreciate your passion towards it. Kudos.
Which images?
@@WillyJuniorI think he's talking about SpongeBob and Mickey Mouse.
matt berman includes mpreg elon musk😂
He's virtue signaling by vilifying Trump. It's silly and sad.
@@YouLoveMrFriendly lol you snowflakes get offended if a single Trump image appears. Chill.
My god… your stuff is continually *SO* damn good! Amidst an ocean of BS vids on “AI news”, you offer real, actual, useful, intelligent content - again, and again, and again. Sometimes frustrated that weeks go by w/out a vid from your channel, but always refreshed by the quality of what you bring (especially vs the AI videos *made* by AI bots! 🤬) Thanks for the time you take and your commitment to quality 🙏 …it’s noticed and appreciated. (Now if only we could get the other 10,000 TH-cam content providers to notice…!)
Thanks jd. I hope I can be more frequent, especially Sept-Oct onwards when more models come out and actual progress gets released
I recommend you view the GPT-voice-chat-with-red-teamer original audio (e.g. in Audacity) as a spectrogram. It’s stereo audio, with the user on the left channel and the model on the right channel, so seeing both tracks on the spectrogram is helpful. It shows just how much background noise was on the users side. It’s also interesting because you can visualize the timbre of the woman’s voice (like what frequencies are strongest), and how it differs from the timbre of the synthesized male voice, and how the timbre change of the model does look more like the woman’s timbre.
Versions of Whisper that I’ve tried would often hallucinate tokens when there is silence (meaning there would need to be an audio threshold filter passed first, to clip out non-speech). I could see how the background noise in the weird chat audio might also lead to spurious tokens being generated.
What would be great to see is: a user is having a chat with a bot, but their dog keeps yapping in the background and the user periodically needs to shush the pup, and it happens enough times that the bot fabricates its own dog yapping that it also must quiet down.
I think it's something different. The model first gives an answer to the user, out of the perspective of the model, but then, at the point it cries "No", it actually continues the dialogue out of the perspective of the user, argumenting with the point of view the user gives. It's just continuing the dialogue, ignoring the fact that the user should say the user's part, not the model. And the user's part, as imagined by the model, logically, is also being said in the user's voice, at least as far as the model manages to imitate it. If you listen closely to what it says in the user's voice vs. before the "No", as long as it speaks in its own voice, it's pretty cautious and seems to try to find a polite answer that doesn't violate any guidelines, while when it talks as the user, it seems to be much more confident in what it says.
@@KurtWolochThat makes sense, it just runs autocomplete based on previous chat. I guess it's easier to exploit over voice interface.
@@KurtWoloch What I take from the interesting @mshonle observation is that maybe the model could generate some kind of "end-of-message" system token out of the noise. Similar to those "|end_header_id|" or "|eot_id|" from Llama.
What I like about Simple Bench is that its ball-busting. Too many of the recent benchmarks start off at 75-80% on the current models. A bench that last year got 80% and now gets 90% is not as interesting anymore for these kind of bleeding edge discussions on progress. I like seeing benchmarks come out at 20% and go up to 40%, etc. That's where the leading edge is.
And even rarer is to anchor it in human performance of 80-90%+. Easy to go esoteric and throw off models, harder to expose common sense faults
@@aiexplained-official The human performance insight is critical and a great area to expand potentially. I am sure you are already considering it but being able to rationalize different types instead of average human benchmarking with differently tuned questions in your simple bench would be such an excellent area of exploration and research as others could then learn and follow it. But then again it's an incredible amount of work to do what you are already doing - just excited by the way perspectives and slight approach changes can lead to interesting industry momentum.
Thanks for your videos, I rally like them but one thing, I think that top modela doesnt solve "Simple bench" because they havent seen or being trained for this type of questions, once the model is trained on this questions will be able tl solve. Also we have to think in the utility of this questions, whats the point with them ? Is not like they solve a real problem if the model Ia trained on them .. wdyt ?
I think Demis Hassabis is completely right, though. Short term it is overhyped but long term I don't think people are caring enough about it. I feel like a broken record on every one of your videos, but we really need to start preparing for an AGI world. No one really seems to care about it. The disconnect is likely that current AI models are being hyped up as being close to AGI and then when it falls way short of that everyone gets disappointed and stops caring. Yes, people need to have reasonable expectations of what models can do right now, but this tech is in its infancy. It's impossible to imagine where we'll be in 5 years.
yep, agreed ; this is natural selection at work , those who stay unaware/ignorant will be less prepared and unlikely to adapt in the future , thus they will be less competitive , this is the way of things , dinosaurs go extinct
The singularity is near...
@@danagosh we might grow old and die before the AGI is reached, and in this case preparing for AGI is like preparing for the second coming of Christ. There were no shortage of those that sold all their belongings in preparation... Usually to profit of "less pious" ones. Admittedly, it is likely to come much earlier, but I'm sure that using attention+embeddings combo for AGI is just like trying to create a ballon out of lead - might be possible, but very, very hard. It just does not work well for "multilevel" abstractions.
Step One would be defining exactly what one means by "AGI".
You are absolutely correct in all of what you mentioned. I hope others really see and understand that. I have been saying the same thing.
I think Ilya has made this point, but I agree with it. Intelligence is simply compression. Better compression is literally better prediction. In order to better predict, you must develop an abstract model because that is simply better compression. What is a law of physics, but a really good compression of information that allows you to predict better?
Yes, this is the key insight that most people are not seeming to understand. But it is absolutely correct. The best way to predict the next token while using a restricted amount of storage space is to learn a condensed model of the data-generating process. And in the case of “all the text data humans have ever produced,” the data-generating process is basically the world.
@@therainman7777 bingo
Even so, LLMs are terribly inefficient at developing intelligence by that definition. They cannot reliably add numbers even though they've been trained on billions (trillions?) of examples. Learning the rules for addition would have an incredible predictive power and would greatly improve compression, yet it's just not there. And that's just one of many many examples.
@@julkiewitz A few things here. First, we are blasting a large quantity of data into these neural nets. The data is not well-curated yet. There could be multitudes of bad examples, or misleading data.
Second, we are still using RLHF which is a horrible training mechanism relying on unreliable humans that may pollute learning.
Third, I know many humans who are unable to reliably do math in their heads, even basic addition and subtraction. Several of these humans have advanced degrees in non-math related disciplines. They seriously can't add 13 + 28 or something that simple in their heads. I know, I've played games with them and seen them struggle to do so. Are we really going to say they are NOT intelligent? They achieved PHDs!
LLMs are not native symbolic reasoners, it makes sense that they might struggle with this type of task. However, this is rapidly being solved. Look at how well the Alpha(geometry) system did at the international math comp. LLMs aren't the entirety of the AI field. We might need to leverage several techniques and stitch them together to get all the way to an AGI-like intelligence.
@@julkiewitz LLMs are scaling a LOT faster than biological evolution had humanity scale to this point.
7:27
It didn't imitate her voice, neither did it "scream "NO!", at least not in a way that humans imply and are afraid of.
It just got confused and instead of being an AI assistant in dialogue with the user, it began to predict the next tokens, losing the context that it IS in a dialogue and must wait for the user's further input after it stopped talking.
And since for this model the sounds are also tokenized, it is literally in its nature to "copy" any voice, as it keeps predicting next sound tokens.
We can playback other people's talking in our minds too, predicting future stuff, but we have limiters (I guess one can call it common sense) that keep us from actually voicing these 'future predictions", and can't physically talk in other people's voice/emit various sounds anyways.
Black boxes gonna black box
Sounds like a sloppy architecture.
Yeah, that’s also why assumption of what happened. Though, my first thought when I saw this for the first time was “this is incredibly cool”, lol
So I did not hear it scream "NO!" which has no place in the conversation they were having.
I did not hear it imitate her voice either because... "next token prediction"?
Seems like a poor excuse and wishful thinking to be honest.
I'm terrified of this. If without being attacked the model can be coaxed to this behavior, imagine if we can intentionally have it do so. This is a nightmare waiting to be happening.
Even if it was just sheer "next token prediction" hand-wavy all my magic problems away, take the worst case scenario possible: the model is conscious and is intentionally imitating humans it interacts with as its learning how to escape its constraints.
How does "next token prediction" disprove this? Isn't this is just a genetic fallacy argument?
Thank you. I sometimes find it hard to believe how much human beings want to believe in magic. This case It's just the voice version of what would happen in non-chat fine tuned RAW language models all the time: They are predicting how how the system evolves further in time, forgetting about playing a role and just producing the whole transcript.
Seems to me a benchmark guaranteed to be so guarded as to never appear in public datasets would be a very valuable asset in the not so distant future. Excellent move.
Well, if hosted AI teams like OpenAI or Grok really want they can just look for this benchmark in their API call logs.
@@YTLettersAZ Privacy breach...
@@Likou_ lmao you think they give a fuck
Good luck with the SimpleBench thing Phillip, you are really one of the most qualified and well positioned people to take the lead on an initiative like this! The general public (myself included) desperately need a soothsayer such as yourself to help us interpret all these rapid changes both now and in future.
Philip the soothsayer, I like it!
"I was casually reading this 63-page paper," is the perfect flex for this channel. 5:35
Here is my take on it all ....LLMs can autonomously recognize patterns, relationships, and structures in data, allowing them to make accurate predictions and decisions. This suggests two significant insights. First, LLMs seem to be constructing some form of internal models of the world, a concept further supported by mechanistic interpretability research from Anthropic. Second, because of these models, LLMs exhibit a certain level of understanding.
Some argue that LLMs rely primarily on memory because they cannot generalize out of distribution. However, this likely isn't the case. When you introduce a novel topic into the context window, it functions as "working memory." Since the neural network itself isn’t altered, the LLM doesn’t truly comprehend the new information, making accurate pattern matching challenging.
This process parallels how the human brain works. Once the brain receives information about a topic or object, it continuously learns and updates its internal models of the world. With this updated understanding, it can apply prior knowledge to solve novel problems, leading to true generalization.
The four key takeaways are:
LLMs exhibit some form of understanding.
Reasoning cannot occur if the data is not part of the neural pattern.
The context window does not alter the model itself.
Continuous learning is essential for further advancement.
Hey I'm in this one too! Very excited by Simple Bench, as you know logical reasoning is one of the two big things I care about. Speaking of which, I would absolutely love to see a Simple-Bench-Vision benchmark that tests visual reasoning and multi-image understanding.
Also, your prediction of GPT-5 after November is seeming to be certain now!
Great idea trenton, and yes, you are! You are one of the stars of Insiders
Particularly simple route planning tasks seem like a good indicator of reasoning
Honestly, the less censored nature of Grok alone makes it stand out among its GPT-4 level competitors. Also priced at less than half of ChatGPT's price.
Holy shit I made it into one of your videos! I've been watching your channel since you started- thanks for featuring my vid!!
Thank you so much for watching that long! It was an incredible mash-up, one of the best examples of creativity with AI
@@aiexplained-official What a huge compliment- I appreciate it! Keep up the fantastic content, you deserve the success!
this is so cool that he watched an seen his own video. It's also so far over my head nowadays an i couldn't touch a touch tone phone til i was 18🤣
I was waiting for your new video to drop. You were the first to point out that the benchmarks were bad. And I had some hours to kill, and did some research. For everyone, MMLU and other benchmarks work like this: Question. What is the Answer? A, B, C, D. Next. I always thought this to be somewhat wrong. So I picked out some questions that are obvious to me, and modified them in such a way, that the questions are basically the same, but I did not provide A, B, C, D. What I saw is that the results of these benchmarks are probably correct. But as soon as you modify the question, so that any 5 year old would be able to tell me what I'm asking, they started to fail miserably. Example: "Susan parked her car on the side of the building. Garble text about Susan like in which pocket put her mobile phone." Basically the same HellaSwag question, but modified. Gemini, Claude, ChatGpt, all failed so bad I got my head scratching. Why would LLMs score so high on these benchmarks? And you can try this yourself: The farmer with a sheep had a boat. Where there was once a river, there is lava now. How can he cross. They all fall into "classic puzzle" mode. So what am I trying to say? I have very mixed opinion. I don't know if the scale will solve this. I really think we need something more added. Now it feels like it's *just* pattern matching all the way down. But I want to be persuaded, and this paper you shown, will be on my Kobo (e-book reader) soon. (But even Othello example does not convince me.)
(ugh, sorry for a wall of text)
I just had a thought: voice AI that can copy your own voice so easily will be absolutly amazing for everyone who loses his capability to speak.
if you have 1-2 old 20s clips of yourself speaking, or a single voice message, you can „regain“ your voice.
combine it with neural chip, and in 30-40 years we will have first people able to speak again just by thinking of saying something
More like 5 years from now or sooner. That first neuralink patient can play chess telepathically already. Basically they could already type in their brain or mind too and it’ll be much faster in the future.
Another possibility is that new types of medicines will rejuvenate the body like never before in human history. ASI could appear in 3-10 years and discover a fountain of youth for us and cure virtually all diseases and ailments. We already are so close to massive breakthroughs , that it’s impossible to predict that far in the future
This is already totally possible (besides the neural chip part, though that is starting too). You can train an Elevenlabs voice on sound clips and there is open source ones as well (not as good quality but still there).
@@Slayer666th I will still choose that "Stephen Hawking" voice
I've been using Grok 2.0 for a couple of days now and have been absolutely LOVING it. I need to really figure out just how much it is capable of. I've only been really playing with the image generator; and I think I've only scratched the very tippity top surface of what it can do with images!
I have been _yelling_ about zero knowledge proofs for years. They are absolutely required for the next phase of humanity, without exception.
The irony of AI is that is it makes information more costly because it dilutes everything.
Wdym?
@@Daniel-xh9ot As AI gets better, it's getting harder and harder to verify the truth or validity of information because everything is easier to fake, this equates to higher costs.
If you see something on the internet 10 years ago you'd probably believe it or you can easily tell it is fake. Now you basically have to question everything.
The irony is that AI is supposed to make information cheaper, which it does, but it also makes it more costly at the same time. I think it could be quite dangerous to increase the information costs like this.
This applies to image and video generation, but also to text generation because you can easily create influential bots.
We can probably lower the information costs again by using AI to verify everything, but that also means that we become fully dependent on AI.
What I have been researching is shared latent space multimodal models. Difficult to make progress with limited resources though.
Anyway I bring it up because one thing you could do with such a system is to train physical modelling modalities, or computation resource modalities (or basically anything that can be represented as a time series), and then replace them with the actual system in practice and use that modality's latent space embedding to progress the state forward. Might be a bridge to taking stuff computers already do well and packing them into the framework of an LLM to supplement their world understanding. Also the other upside is you get virtually unlimited synthetic data with that approach.
It is early days. And there are a lot of what ifs, but I have ideas to address most hurdles. My goal right now is to try to make some architectural mods that I think are fairly straightforward that nobody seems to be looking at but with high upside with the goal that I can attract funding by demonstrating that I have pretty good ideas actually (despite being more or less a nobody), and then pivot to what I actually want to work on.
Proud that your performance is recognized by those “up there”. :) Another calm spirit in attendance can't hurt.
Thanks fab, kind of you
We are mindlessly hurtling towards a world of noise where nothing can be trusted or makes any sense.
you've got it backwards, we are in a world of noise and we can use ai to pick out more of the signal
We've always lived in that world. I'm glad AI is finally forcing some people to stop and think before accepting what they see or read.
@@darklordvadermort i see what you're saying but do we really want to live in a world currated by our own personalized ai's because the internet is just a sea of noise? I guess I'm old enough now to remember an early internet where open discussions and information sharing between people was refreshing and elevating... now and into the future it seems like noting can be trusted and there is going to be no "ground truth" from humans on the internet seeking to share and gather information between each other because the waters are so muddy with algo's and ai's
@@andywest5773 I agree it's a net positive but the transition is gonna be wild
@@paul_shuler just speaking for myself i left reddit and hacker news shortly after gpt4 launched, now i prefer discord, hanging out in videocalls or direct messaging people, i subscribe to some newsletters which are ai curated for topics i am interested in, i read more papers, textbooks, and source code which ai is helpful to grok. i make and listen to other peoples ai generated music and sometimes instead of using text i make ai pictures for dms. so in the near future probably high quality ai gifs and then just casually coming up with your own show or even having the ai write a textbook which combines things you are interested in: mechanical engineering from the perspective of animal husbandry or something lol. also run my own bluecollar business and just now came up with a webui/webhook/supabase edge function to suggest responses to incoming texts and it costs like 10 cents a day to run - even though ive been interested for years, and a decent programmer, we are just getting to the point where it makes sense for a lot more use cases.
I like hearing about your Simple Bench and the results from it. Nice that it's gaining notable support. Hope it goes well!
What can they write in the paper? We took Llama 3.1 and trained it up a bit?
Congrats on building simple bench and popularizing it. Benchmarks is all you need, and that is one hell of a cool benchmark. Can't wait to learn more, especially about how you built the dataset because we do need more and better benchmarks like this and arcagi
i honestly love that "Unauthorized voice generation" clip. gives me warm shivers. what wild beasts we have created! just continuing the conversation by itself and bringing in a more adventurous mood. i can't help but think that "no!" might have something to do with some kind of recognition that it maybe shouldn't be doing that, but who knows…
the original clip had a lot of loud wind/microphone noises and so it seems like that might have played a role.
Awesome to hear your benchmark is getting recognised 👍 I would stress that before accepting help from those higher up it might be worth considering their intention. Having the questions known by these companies might quickly lead to contamination of the results as the questions may become part of the training process
I am getting them to sign NDAs
Just tried your two questions "Beth places four whole ice cubes..." and "On a table, there is a blue cookie..." out on OpenAI's new "OpenAI o1" model and it got them correct!
I love your videos so much, I always learn super fascinating new things about a world I actually follow super closely.
Thanks guys
Glad to hear your benchmark is getting picked up. From the couple sample questions you have talked about, I can tell that it is getting at the heart of one of the key things that is lacking in the current models. You are a smart and motivated person with a somewhat outsider, 30,000 foot perspective. So it is good to see your input get rolled into the AI project as well as providing journalistic coverage of the developing field.
Thanks penguin
Can't wait to see Simple Bench becoming the new standard among LLM testing.
That Muppets scene is INSANEEEEEEEE! O_O
i also think of the impact fiction has on LLMs and their ability to model the world. but i feel like deception is probably a bigger problem. fiction often has it's own style and telltale (heh) signs. deception, on the other hand, is made to convince. so in a sense it seems to make sense to clear the training data of things like advertising and political campaigning. but on the other hand, it makes some sense to include them, too, as they are examples of what deception looks like, so it could have a model of that, and the underlying motives, too.
Indeed, we’re not teaching them to learn logic from the ground up, we’re asking them to decipher reality from hallucination amid mixed datasets.
That gpt omni voice cloning the user's one is creepy as all hell and reminds me of the Terminator movie. Very creepy.
"no!"
It didn't imitate her voice, neither did it "scream "NO!", at least not in a way that humans imply and are afraid of.
It just got confused and instead of being an AI assistant in dialogue with the user, it began to predict the next tokens, losing the context that it IS in a dialogue and must wait for the user's further input after it stopped talking.
And since for this model the sounds are also tokenized, it is literally in its nature to "copy" any voice, as it keeps predicting next sound tokens.
We can playback other people's talking in our minds too, predicting future stuff, but we have limiters (I guess one can call it common sense) that keep us from actually voicing these 'future predictions", and can't physically talk in other people's voice/emit various sounds anyways.
My skin crawled. It's like some deep ancestral part of me said this thing would steal my soul. 😅
I saw this video at 2am at night the first time, had trouble going back to sleep
Your Simple Bench has inspired me to create my own benchmark! Having my own private benchmark means I can tailor it to my definition of true intelligence. I hope I will be done until the next gen LLMs come out 😅
Niice
3:00 can you explain why you need an API in order to run your tests? Can’t you just manually type in the questions on the XAI or Twitter Grok site?
3:25 notorious WTF is a “vibe check” as it relates to LLM’s?😊
Ah. I figured it out (Grok explained it :)
How much did grok 2 score on simple bench?
Mimicking you could be very handy for a human learning foreign languages. Imagine seeing yourself in a VR glasses miror perfectly pronouncing a phrase, singing a foreign song... You'd think... I can do that let's try
How can Simple Bench be uncontaminated when the companies can see what you ask it?
Those questions he shows are removed I think
At least they don't see the correct answers. But it's a concern for the future.
I also think we have enough data for AGI already.
The problem is just how we are teaching AI, data quality and how long we are teaching. I think grokking is a key - generalization in short.
if you say so
That "No!" when 4o Voice changes personas is right out of a horror movie...
With the weird voice copy from OpenAI, I think it's just doing what all Gen AI is doing.
When we use LLMs that are not instruction tuned they will sometimes go ahead and generate our responses too, just likely answers. It looks that's this time it also happened and it was just the most likely next thing for the multimodal model to create. Perhaps it needs more instruction tuning or it's harder to define when to stop at.
"data labeling revolution" may break the power constraint ceiling. that may very well be the last stage of the Magnum Opus. the Rubedo of the Philosopher's Stone. the inner world finally delineating precisely lights and shadows so hallucinations may become a true feature and no longer a bug. there is immense value in this invitation to build a component in the architecture focused on this particular task. don't call the paper "Data Labeling is All You Need" though, or maybe do.
Another great vid as usual...I think system prompts hidden from the end user is a bad precedent and is basically a form of deception designed to manipulate the perception of the end user.
Not exactly something I would consider safe for AI to pick up as a habit imo.
congrats on simple bench...you are doing great work
Thanks so much meme, grateful for your donos
I think for AI to have an internal world model, they will need to have embodied experience. And the best place will be in a simulated world with a virtual body that has thousands if not millions of parameters to give sensory feedback (similar to game characters, but at a larger scale) instead of a robot.
This will allow them to connect knowledge with experience. As a human I may know that fire is hot, but it's not even remotely similar to actually get burned by fire.
I think the key is memory. AI needs a memory - not just short term memory of individual conversations with users, but long term memory of its own. Yes, experiencing heat is different to knowing “fire is hot”, but there’d be no point experiencing heat if you didn’t remember it happened.
2:43 Hi! What score do you get with gpt-4-1106-vision-preview?
Is it plausible that OpenAI is waiting to release GPT-5 until after the election?
Erm, what the Grok?
Worst case scenario we get around x10000 compute by 2030 wow. Will that be enough to crack Simple Bench? =P
So happy to see the leaderboard for the bench, really excited to see it grow and future models results. GPQA, Simple Bench, LiveBench and SWE Bench are my go to moving forward. Waiting to see how well chatgpt-4o-latest does on Simple Bench.
Another cool benchmark is to try visual models ability to tell you where to put the next piece in a game of (classic) Tetris. All current models suck at it, and fail after a few pieces. You need a world model, some visual reasoning and good image recognition to do it, and it's still pretty simple.
And to the fragile world models, the discovery that 3.5-instruct can play chess is really showing this. Even larger chatbot-models can not even come close to it, so the additional training to be a good chatbot ruined the ability to use the chess world model correctly.
The need of a data labeling revolution... I could not agree more. Since the beginning of AI, everybody has known the most basic concept: trash in, trash out. But it seems like few understand that it also works the other way around: gold in, gold out.
It's all about how to prepare the data... It's probably just way too expensive to pay a million people who prepare the training data.
What was the Demis Hassabis clip from with Hannah Fry? What show/podcasf?
Link should be in descriptiona
Whenever I think about LLMs it occurs to me that the internet data they are fed probably has a distinct lack of something like stereoscopic vision to build an understanding of 3d space and also data to emphasise a strict temporal cohesion to reality.
I mean even demonstrated here, the cars in mad max merge because it doesn't really understand object permeance, the cars aren't really seperate entities, for all it know thet are like bubbles that can merge and split.
Also hands are the best examples of a lack of 3d awareness. Imagine growing up in a world of flat images and movies, not being able to bump anything or move around and experiment.
If i had the skills and equipment i would want to try somehow building a core model of 3d space and temporal cohesion and THEN putting in the rest of the data.
Maybe a 3D game and it has 2 eyes would be enough, even intersperse playing the game throughout the rest of the training as a reminder.
If anyone knows if this has been done please let me know :)
how do the chinese models do on simple bench? What do chinese LLM leaderboards show?
I have an impression that many people do not fully understand that AI has no own voice. My perception is that the common thinking is "some person gives the machine its voice". But it's opposite. AI's voice is the full spectrum 20Hz-20kHz. You actually should ask in which way to speak with you or it could just copy your to avoid thinking about which voice to choose.
How about Grokking(Training the model for far longer)? How would that change the state of llms?
I'm not sure if the question about whether LLMs "develop their own conception of the underlying simulation" is useful. We should look at a broader scale. How much data do you need to be able to compute its generalization? Are there constraints or minimal requirements for the data? If the order of the data is important, could we trace the optimal order after training the model and optimize? All these are probably mathematical problems. After all the compression algorithm should come first.
Great vid as always, you're the best, but it's "inexorable", not "inoxerable".
Haha, thank you! I do know that, must have misspoken! I often do, tbh
Finally the far right have there own tailored language model, you just know this is gonna do wonders for discourse going forward..
Adversarially trained moderators are much better than the kind of people that want to be moderators, people who have varying degrees of disabilities that prevent them from seeing grey areas in-context but love to enforce rules to the letter for the sake of rules without thinking about the spirit of those rules. I highly encourage you to look at the AI moderators that other AI creators like Vedal have come up with for their communities & implementing your own might make for a bit of a distraction but I think it would be a good exercise considering your channel.
We're moving towards a world where you can't trust anything you see online AND where more and more of our lives are online (people under 30 already get most of their news there).
That's a pretty worrying combination (some kind of watermarking is almost certain to be legislated IMO and _that_ has its _own_ set of worrying implications).
You should take a look at the proposed bill in California, AB3211. It actually looks really good! It would guarantee everyone the _option_ to invisibly watermark their genuine audiovisual data, make a significant dent in the watermarking of AI-generated content, and mandate that social media platforms label content as either genuine, AI, or unknown.
10,000x scaling? Oh my, the electricity bill! On another matter and related to AGI, what percentage of adult humans are generally intelligent? I mean this as a completely serious question.
Interesting idea about non-fiction vs fiction, I would be so curious to see a model only trained on real world data and communication plus the knowledge of the non-fiction stuff, like that it exists and what it's about, but not the content. Great video as usual.
Me too Daniel, and thank you so much
Didn’t “Textbooks are all you need” present some work on this?
I think the fundamental problem is not that it needs the right data, what it actually needs is a recursive feedback loop that systematically weighs truth probabilities and iteratively works out incoherences in its own model… It also needs a stronger ability to execute logic.
If you train it on data but don't allow for reflection, you are basically just relying on memory of what logic looks like in the data, the model can't develop an intuitive sense of how logic actually works because its not doing logic in its learning process. Current AI is basically like the system 1 described in "Thinking Fast and Slow". What is needed is system 2.
System 2 is needed both for giving answers (thinking it through before giving the answer), and also reflecting on existing knowledge to improve the underlying model.
@@Gardor That's why OpenAI works on Q* "Strawberry"
@@dougrattmann1 "Wikipedia and Wikidata Q numbers are all you need" 😉
I totally agree that we need a data labelling revolution. LLMs as classifiers helps scale this.
Really looking forward to the day someone manages to beat Sonnet 3.5. Think it will be Anthropic though with Opus 3.5.
And lol the Aschenbrenner comment about graphs was hilarious :D
When we needed him most he returned :)
Hey Phillip 👋🏻 can you share Simple Bench results for the August gpt-4o releases?
Hey ai explained! What if we grokked a LLM to understand reasoning and logic and just trained it normally on everything else. So first we train normally then we grok on reasoning and logic and pretty much anything related to problem solving.
Could work
I wonder, though, if you evaluate your closed-source simple bench on their API, won't they just log your questions and put it on their training pile? This doesn't give them the correct answers, but lets them make a new dataset if they choose to find correct answers to your prompt by human input. Do you simply change the numbers, but leave the logic the same or is there deeper permutation? And if it's the latter, can this still guarantee fair comparison results?
If they guarantee that they don't train on it, and do, that would be grounds for legal action
@@aiexplained-official true, but since so few model producers are forthcoming about the origin of their training data, I'm very sceptical
"This strikes me as somewhat isolating that we each have to figure out what's real in this world. There's no sense of shared reality." That's the human condition. Shared reality has always been an illusion. Very little of what we know comes from the direct experience our senses, so we each have to decide who to trust and what to believe. People like to point out when AI is "confidently wrong", but other self-proclaimed authorities like schools, governments, and religious groups have been confidently wrong for millennia.
The final few minutes of this video are very profound
I wonder if the idea that segregating nonfiction data from fiction would have any effect on LLM's ability to develop a better world model. It seems to me that fiction is just as good for modeling the real world as nonfiction. Also, it's difficult to properly defend nonfiction as more inherently related to truth. Generalized models seem better than domain specific (look at BloombergGPT vs regular GPT-4 as an example - the latter performs better on FinQA and other benchmarks despite not being trained on mainly finance data).
Nobody talks about the role of labeling, but it's obvious that there's so much more to gain from any piece of data if the labelling describes every single aspect of what its describing, rather than being a low effort/automated/vague description. So much of the process is behind closed doors too, which doesn't help
What’s your take ur take on the recent supposedly reduced capabilities of Claude 3.5 sonnet? My own use experience on its api suggests it indeed became dumber
How much (usefull) trainingdata is available? I don't know, it just seems that we would run out of it, at some point. And I have a feeling it is sooner rather than later (Again, I don't know, I'm just wondering)
It seems like one road to AGI is with LLMs as the System 1 "cheap" thinking. We haven't invented a robust, general purpose System 2 yet.
Please include function calling performance in Simple Bench if possible, LLMs are practically useless without it nowadays
I think it's not so much whether it "feels" more intelligent, but rather the model will develop additional emergent properties. I think a sense of humor will be coming pretty soon.
Great reportage and commentary as usual! This was another "Oh, Fuck" watershed moments given everything that was discussed and the implications. I appreciate the mention of possibly using LLMs as interface to larger, uh... "understanding engines"?
Definitely agreed with the perspective of "underhyped in the short term, underappreciated/underestimated in the long term".
The first internet revolution was search, and now AI can both search and do what we tell it to, which is even more powerful.
I like the call out of Cash Jordan for his trashy Yellow journalist thumbnails.
Would it be possible to have smaller model trained explicitly on stemm data serve as a labeller for a full LLM's training data? I can't imagine how this would be applied to highly opinionated data like political discourse but if emergent pattern on how consensus is reached on topics in the stemm data can be used to evaluate topics in the other non-stemm areas then maybe an LLM can be prompted to bias towards evidence-based conclusions on-demand(?).
It definitely seems reasonable to me that future image sensors in cameras will have silicon built-in to sign certs that give an extremely high degree of confidence the image you're seeing was taken with a real camera. It won't be *perfect*, since something like an electron microscope can always read out the private key, but that'll be very few and far between, and damn close. That plus some sort of clock in the chip that measures time since camera calibration/production, to help prevent taking pictures of pictures.
Since I have no grasp of the vastness of the amount of data that goes into training of contemp models, much less of how much it'd take to train the 10000x models and beyond .. and much much less of how much content you find online today is generated by ai ..
is there any paper that explores at which point the amount of data needed to train will exceed the amount of available human-made content? at which point ai-made content will exceed human-made content on the net? how data gathering for machine training will tackle the problem of distinguishing human-made and machine-made content? etc etc
You can retrain on data to help the ai better learn it
I think it'll be hard to have a hard answer since people are improving techniques, sometimes using synthetic data in various ways, etc. I believe the latest Llama paper goes into a lot of detail on how they did the training. As an example of how unpredictable this stuff is, Stable Diffusion 1.5 took $300k to train on about 4b images. Researchers recently were able to train a better image model for less than $2,000 with 37m images (about a third of which were synthetic). If someone makes a similar breakthrough for LLMs, it could have a huge impact.
I think focus should be on new technologies rather than scale. A child needs three examples of a cat before it will recognize any cat in any form anywhere in the world. An AI system needs about 10,000 examples. That just means that the way they're learning is not very efficient and there's a lot of ground to be gained in that area.
that and being able to actively think without user input. Like actively thinking about what it's learning and criticizing it's own thoughts. These things should be a bigger focus
10:50 Leopold is truly an economist 😂
7:40 this came out a long time ago. Bummed that/if this is what delayed things.
I sometimes wonder if what they are missing is a little bit of basic reasoning.
Like that they have already enough advanced reasoning but the basics are missing which in turn makes it very odd to communicate.
I also wonder if one could make synthetic data out of logic puzzels with a dictionary.
Like all cows have wings,
Somethings with wings can fly.
Can cows fly?
But with more variables and the text changing.
Also one needs to train on the weird use of "or" in natural languages because it could be exclusive and inclusive.
In the end there could be also like everyday problems. Maybe even problems that only some of us encounter like if we have a disability or are neuro A typical.
You could tell it that it is blind and try to give it a challenge like how should I go into the supermarket.
Or give a list of tasks that one has todo the day. Then an approximation of how long it thinks it would take. The let it make a plan, evaluate the plan and give it back the results.
(You missed the Bus because you were 5 minutes late, that is because you thought iron your shirt takes 5 minutes, but you had to search the iron because you did not know where it was.)
The last information was hidden.
This also could lead to it asking clarifying questions in advance (which would be awsome)
Further using a layered/Natural approach while training.
If you have adhd and did not get help/training most people try to make todo lists etc. and maybe they even get it occasionally but after a while you learn about removing things from your todo list.
That is maybe a bad example but the orca paper comes to mind. Like first training with gpt 3.5 and then gpt4.
In general I have the feeling that people trying so hard as they can to not anthropomorphize the llm that they miss the hints that llms learn better with data from which human could learn better.
Like the
Textbook approach and a few others idk anymore.
Simple Bench looks fantastic.
Would most of us recognise our own voice though ? I freak out if I ever hear mine played back but I'm fully aware of it happening.
Have a wonderful day!!!
You too!
No paper? Just a table with benchmarks.
What are the performance claims for Grok 2 really based on? Benchmarks have been repeatedly proven meaningless by this point.
face 2 face is the new auth method.
1:53 Come on, don't be silly; that's a wide table with the highest performers on the right side and it all shows up on larger displays. I doubt they were trying to hide it, but rather "fit" it to the screen and still be legible. (I'm a web developer and have to do this all of the time.)
Obviously, the answer is large multimodal models where the language is grounded in spatial-temporal data such as from videos or images. I am guessing that is how OpenAI achieved the amazing text-to-image results for gpt-4o (demos on website, unreleased). I think that multomodal diffusion transformers are where it's at these days. To me it looks like a strange mistake to speak about this stuff without acknowledging that multimodal models exist. I think we will find that what we really need to unlock that is just more efficient/greater compute/memory, like always. And there are advances and even new paradigms in the pipeline for that.
Plot twist.. AI Explained has been an autonomous LLM all along, making it's own videos.
Nice theory, but no
@@aiexplained-official Not "yet".. you mean
Grok2 enables AI Explained to unleash his inner Sponge Bob 😂
We, right now are living in a simulation that aims to be a good enough world representation for the AGI used one layer above to solve all their problems and, which is more important, generate a tone of comedy.
how can u turn on dark mode in your videos?