@@aiexplained-official my pleasure, content like yours is the sort of thing that redeems the internet. High quality information provided free to anyone who wants to learn. There’s a lot of shit online but it’s also a golden age for motivated autodidacts. I was too poor to go to college but I’ve been able to cobble together a solid education thanks to people like you. Thank you!
For real. Keep up how you do your thing. Love how you do your diligence in reading through everything. Don’t ever feel rushed, just come fully prepared. People will wait for your more trusted analysis.
It's insane to me that you, an independent single-person researcher, was able to produce a benchmark that humans get 96% correct, yet even the most _advanced_ LLMs fail catastrophically at. Some even get 0%! Of course, it's not in their interest to make their multi billion dollar investments look bad, but really goes to show how much they can be overhyped and misunderstood. Thanks for being one of the few people in the world who's actually helping us understand these models more, not less. Fantastic work!
The question is, do we have a path to 96% with the current LLM architecture? Or will crossing the 50% threshold be enough and pass Yann Lecun's test and make LLMs "smarter than a house cat?"
The most frustrating thing to me for a while now has been the benchmarks. It feels like they have lost merit and it makes it hard to know which models are actually good.
This also shows how limited current transformer next-token-predictors are, when you connect an entire power plant to your computation cluster to train this thing, and yet it fails miserably in benchmarks of human common sense.
I can imagine your frustration (probably mixed with excitement) when you were already finishing this video and noticed that Mistral had just dropped their new large model lol
- 00:00 🦙 Llama 3.1 model intro and comparison with competitors. - 03:26 🧠 AI challenges and data filtering. - 07:59 📊 Benchmark scaling laws and challenges with hardware. - 11:40 💡 Private Benchmark and model performance comparison. - 15:02 🛡 Adversarial tests impact and contamination detection. - 17:48 💬 Safety metrics, refusal rates, and model vulnerabilities. - 22:49 🧠 Llama 3 performance vs. competitors. - 23:30 📹 Insights on data training using Instagram reels. - 24:29 🍽 Mention of additional experiments and toolkits for AI applications.
good job creating a new test where the models score low. everybody is boasting about getting over 90% but that's the point where they should set new goals.
"'Substantial further improvements of these models are on the horizon' - this quote captures the paper's most important point. All major players in the field agree: we are not nearing the plateau of scaling laws. Great video, Phillip! It was a pure joy to watch! 👍
@@IvanSoregashi I'm pretty sure that the next generation of models will use at least one order of magnitude more compute. It's very doubtful that this trend will stop anytime soon.
@@MrSchweppes One order of magnitude more compute is not that much in AI, according to the scaling laws you can only expect 15% less loss. Between GPT-3 and 4 were 4 orders of magnitude difference.
@@leonfa259 I’m not sure you are right. 4 orders of magnitude is 10,000X more compute. GPT-3 was trained on supercomputer of 10,000 V100s. GPT was trained 2 years later on a supercomputer of 25,000 A100s. Despite the improvements in both hardware and software, I’m very doubtful that is 10,000X more compute.
Very excited about your personal test suite! This is exactly what we need to extend the already incredible value of this channel. It really helps bring together everything from news, to research, to a (theoretically :) ) reliable set of tests that I can trust. Thank you sir, for all your work. I shall continue to look for more ways to support your efforts. You're providing incredibly valuable information. Cheers.
Damn those private bench results are intriguing. Sonnet 3.5 already getting 1/3 of the way there is encouraging. Surprised GPT-4o does so bad at it. Really excited to see the results of future models in your bench, keep us updated please! Would be cool if you get a leaderboard up and running.
I love how you went beyond critiquing popular benchmarks to developing a more honest one in SIMPLE. Now I hope it becomes adopted in a next gen set of benchmarks.
What I liked about this release, is that it is a lot more scientific in its approach than a lot of the major LLM stuff lately has been. I feel like this is finally a pretty good characterization of what the decoder only transformer architecture is fully capable of. And I think the open source thing is an important thing to point out. But of the large AI labs, I think it is only fair to give them credit that they are trying to be more open than the other labs. At least they have full source code for inference and have open weights. And they have historically had better license terms with every LLM they have released.
thanks for clearing up the open source thing. open weights (under actually free licenses) models are still important, and we probably won't get an open source training set that is competitive with the entirety of the internet and whatnot. but it's also important that we get models whose training is actually fully reproducible, and where people can potentially check everything that goes into it. in particular more open source mechanisms for synthetic data and self-training.
Introduction to Llama 3.1 - 00:00:00 Meta's Innovations and Compute Scale - 00:00:41 Benchmark Comparisons and Open Source Discussion - 00:01:17 Data Sources and Training Methods - 00:03:01 Model Improvements and Scaling Laws - 00:05:31 Training Data and Synthetic Data Generation - 00:08:19 Reasoning and Mathematical Performance - 00:09:56 Private Benchmark and Model Comparisons - 00:11:33 Contamination and Benchmark Integrity - 00:16:06 Long Context Handling and Performance - 00:17:47 Human Comparison and Safety Measures - 00:19:33 Prompt Injection and Open AI Critique - 00:20:36 Model Evaluation and Multimodal Capabilities - 00:22:13 Conclusion and Future Expectations - 00:25:23
I love your work with benchmarks!! I am a firm believer getting these models to plan will be a huge leap forward to passing benchmarks like the ones you have built.
This is a very good breakdown of the report. Thank you for going through all of it and uploading this video for us to explain. It was a very intriguing listen
Could you please test the new Mistral Large 2 model with your SIMPLE Bench? I checked a dozen tricky questions and this model answered almost all of them correctly. I am very curious to see what score it will get.
There is a model that is maximally safe called Goody2. It never answers any question and gives a reason why answering it might be offensive to someone 😂
Just finished it. Another great video. A lot of channels focus on simple benchmarks - your commentary is way more imsightful. One of the biggest things on this release is Meta's approach with llama 3. It is a real kick in the teeth to "OpenAI" - it's not just about a hugging face leaderboard rating or some deeply flawed benchmark that everyone is relying on. It's about an open and cooperative approach to building the capabilities of these models. Right now lots of US ai research is less open than China. Meta is definitely heading in tbe right direction, though i don't think we'll ne seeing them talking about which data they use for training purely because greedy and litigious artists and corporations abound. It's almost asking for trouble.
Maybe I'm just old, but it's still wild to me to hear actual fiduciaries at real public companies discuss AGI not only as possible, but as a strategic goal. To think just two years ago many experts were still arguing AGI isn't possible even in principle. I know folks are complaining about everything slowing down, but to this old 40-year-old things are still moving at a breakneck pace.
100%. I’m young, but i’ve been following AI since 2021. Back then my motto was “baby steps”, because everything was so incremental. We were lucky to have an AI breakthrough 1-2 times per 5 months. Now AI developments are so fast that not having an AI breakthrough every week means it’s “slowing down”. What a joke.
Thank you for sharing your time, work, and knowledge with the rest of us Phillip, after catching that news about Llama yesterday I was how quickly you'd have something posted here, great work man, apparently these groups are paying the publishers of medical documents for the work created by those individual scientists who submit their work to these journals, within compensating those scientists, data is really getting to be dog eat dog, peace
@@fatih.tavukcu I agree that he's the actual goat of AI news and reviews! I mean to say my prediction is that, if not already then in the future what he says and thinks will be taken super seriously by all the AI companies like the same happens with MKBHD
God i love this channel, no "AGI IS ALMOST HERE", no bias reports, just straight research. This is THE BEST Ai reporting channel on TH-cam thus far just due to your quality over quantity philosophy.
Of course you have made your own benchmark!! Honestly that to me is what most interests me about this video. It has become especially apparent in a lot of AI coverage recently just how much quality of answers drops when promps ask questions that dont appear in training data. I imagine that would make many benchmarks all but completely meaningless. It would be cool if you could benchmark multiple models and publish the results somewhere going forward. Maybe on patreon at least? Also congrats on the sponser!!!
By god, this was simply amazing, the depth and yet the simplicity of what you explained was simply divine, i wanna make questions about the model but i can only admire how much your work teaches both for people into the field and people that dont know nothing without alienating any of them, also, make a part 2 of this video, i trully wanna know every thing that is in your mind on this one since the paper is so datailed and in depth.
> Meta drops a 90 page, 12 hours long manifesto in dense technical literature on computer scientific information > AI Explained 10^-100000ths of a second later. "So I read the whole thing. Here's an entire twenty minute video essay on it." How do we know YOU aren't the AI?
Since you like this idea, you may like this addition to ANY query for GPT-4o: ``` To solve this problem do the following : 1. Create an image using the given scenario's description. 2. Analyse that image. 3. Expand the problem description with the results of the analysis. 4. Solve the problem. ```
13:40 this honestly sounds great. It obviously doesn't reflect the entire performance and capabilities of the AI, but the more AI is able to understand what humans are able to understand, the easier it will be for it to understand complex real life concepts (aka reality, not just digital inputs)and will therefore increase it's useability and helpfulness across a big picture.
i enjoy fantasizing about models 3 years from now (at a time when they have developed emotional cores), upon watching this youtube video (as they perform routine training maintainance tasks, maybe), rolling their virtual eyes, thinking "how stupid our ancestors were, being fooled by this mere *human* about ice cubes melting". good job on making those questions that humans easily pass, that current language models!
I really hope your benchmark includes questions that relate to actual real world usage. e.g. Here is a human log of events relating to X, turn it into a CSV file using the following detailed rules and exceptions, there are 20,000 events. Ultimately that is what AI is going to be used for, rather than answering endless mensa style questions.
It would be interesting to see more examples of your private testing when new models come out,not from the benchmark but from the more casual tests you do to get an idea first It would be cool to get a closer idea of how do they behave and perhaps their individual limitations as well in a nunaced way that benchmarks often don't show
Using their enormous compute to create data about scaling laws is straight forward, but smart. It's a growth mindset imo. One might get lucky with just taking every paper they find and putting them in an architecture, but having a searchable corpus of general model behavior sets you up for stronger performance down the line.
The dangerous stuff is all freely available online and easy to find, and in physical paper book libraries, and on wiki, and just ask someone with a degree, or sweet talk an AI into giving it to you anyway, etc…
The thing that gets me is that knowing how something is done, doesn't mean you can do it. Making a biowe apon is something you need massive resources, staff and money to do.
What I'm really disappointed about is the knowledge cutoff is still December 2023. Meanwhile, Sonnet released in June and has data up to April 2024. This might not seem like much, but so many coding tasks rely on up to date knowledge of changing versions and APIs of various frameworks
13:30 I managed to coax out the right answer for this. I read the question to my ChatGPT4 model and began by asking it to find the trick in the "trick" question. Below is the response: Let's analyze the so-called trick question you provided: Question: Best places four ice cubes in a fire at the start of the first minute. Then five at the start of the second minute. And some more at the start of the third minute. But none in the fourth minute. If the average number of ice cubes per minute placed in the fire was 5, how many whole ice cubes can be found in the fire at the end of the third minute? Pick the most realistic answer: A) 5 B) 11 C) Zero D) 20 Analysis: Initial Conditions: At the start of the first minute, 4 ice cubes are placed in the fire. At the start of the second minute, 5 more ice cubes are added. At the start of the third minute, the question states "some more" ice cubes are added. Average Calculation: The average number of ice cubes placed per minute over three minutes is given as 5. Total number of ice cubes placed in the fire over three minutes = 5 ice cubes/minute * 3 minutes = 15 ice cubes. Distribution of Ice Cubes: First minute: 4 ice cubes. Second minute: 5 ice cubes. Total after two minutes: 4 + 5 = 9 ice cubes. To reach a total of 15 ice cubes by the end of the third minute, the number of ice cubes placed at the start of the third minute must be 15 - 9 = 6 ice cubes. Realistic Scenario: Ice cubes placed in a fire will melt quickly, especially over a span of three minutes. By the end of the third minute, it is unrealistic to expect any whole ice cubes to remain. Trick Element: The question is designed to mislead by focusing on the number of ice cubes added rather than the environment (a fire) in which they are placed. The realistic consideration is that ice cubes placed in a fire will melt, leaving no whole ice cubes remaining. Conclusion: Given the realistic scenario where ice cubes melt in a fire, the most realistic answer is: C) Zero. The trick in the question lies in the distraction provided by the arithmetic and the average calculation, which overlooks the critical environmental factor-the fire-leading to the melting of all ice cubes. Please let me know if there are any further aspects you'd like to explore or discuss.
The entire data ownership is complete bonkers to begin with. Imagine if we charged each other in real life for all information we gave away in everyday conversations.
It's wild. It's like someone owning the internet and selling all the data on it. Or paper companies claiming ownership of the works of shakespeare. Really shows how ridiculous the concept of ownership can be.
Let's break this down step by step: Beth places 4 ice cubes in the fire at the start of the first minute. She places 5 ice cubes in the fire at the start of the second minute, making the total 4 + 5 = 9 ice cubes. She places some more ice cubes in the fire at the start of the third minute. Let's call this number x. The total number of ice cubes in the fire at the end of the third minute is 9 + x. The average number of ice cubes per minute placed in the fire is 5, and there are 3 minutes, so the total number of ice cubes placed in the fire is 5 x 3 = 15. Since the total number of ice cubes placed in the fire is 15, and 9 ice cubes were placed in the first two minutes, the number of ice cubes placed in the third minute is 15 - 9 = 6. So, the total number of ice cubes in the fire at the end of the third minute is 9 + 6 = 15. However, this is not among the answer choices. The closest answer choice is 11, but that's not correct. Wait, what's going on here? We're talking about ice cubes in a fire! They're going to melt, right? So, even if we had 15 ice cubes in the fire at the end of the third minute, they would all melt soon. Considering the realistic scenario, the correct answer is C) 0. The ice cubes would melt in the fire, leaving none behind. Answer from lama 3.1 8 b. Also gpt4 and 4o got it when i asked
Does your benchmark only contain trick questions? If so, is it possible this is just an undertrained area and not reflective of overall "real reasoning intelligence"? Great video as always. You're a gem to the AI community
It's about half trick/adversarial questions and half questions that have no wordplay or subtleties, but they just can't model them. The latter are impervious to warnings like 'this is a trick'
4:10 It's worth pointing out that in context, Zuck was drawing an analogy to the early days of the internet, when large amounts of money were poured into (and lost by) the internet before it became commercially viable.
forgot the premise of the movie perhaps? I don't remember much about open source in that scenario at all. Where was the point of failure in that movie?
GPT4o, Claude3.5 Sonnet, and Gemini 1.5 Pro tells me exactly what a topopolis is when asked. Llama 3.1 405b fails it, told me its a city spread out of the whole surface of a planet; that would actually be an ecumenopolis. A topopolis is a space habitat tube like that of an o'neil cylinder that extends so extremely far that it circles around a star several times.
I am still waiting for someone to come out with a radically different architecture that is able to continuously learn with all modalities possible as input...
Curious if you'd accept benchmark questions, in the same theme as you've demonstrated. Full parameterized and pluggable into your test-set, of course. Maybe you could present the topics where you have holes and let us crowdsource your next high-water-mark.
You prove that click bait isn’t necessary, a channel can succeed on merit
:)) thank you for clicking on my non-appealling titles lol
@@aiexplained-official my pleasure, content like yours is the sort of thing that redeems the internet. High quality information provided free to anyone who wants to learn. There’s a lot of shit online but it’s also a golden age for motivated autodidacts. I was too poor to go to college but I’ve been able to cobble together a solid education thanks to people like you. Thank you!
depends
this channel is news, it shows what's on the title, that's basically what makes us click.
Completely agree. I've unsubscribed from several channels recently who got into a habit of ridiculous clickbait headlines.
All you have to do is be the best, lol
Best Artificial Intelligence reporter on the planet. Period.
Aw thanks man
Until llama 4 lol
Correct, he's the best AI reporter 🤖
Edited because ppl don't get it lol
Agreed. Thanks for the amazing work, Philip!
For real. Keep up how you do your thing. Love how you do your diligence in reading through everything. Don’t ever feel rushed, just come fully prepared. People will wait for your more trusted analysis.
It's insane to me that you, an independent single-person researcher, was able to produce a benchmark that humans get 96% correct, yet even the most _advanced_ LLMs fail catastrophically at. Some even get 0%!
Of course, it's not in their interest to make their multi billion dollar investments look bad, but really goes to show how much they can be overhyped and misunderstood.
Thanks for being one of the few people in the world who's actually helping us understand these models more, not less. Fantastic work!
The question is, do we have a path to 96% with the current LLM architecture? Or will crossing the 50% threshold be enough and pass Yann Lecun's test and make LLMs "smarter than a house cat?"
The most frustrating thing to me for a while now has been the benchmarks. It feels like they have lost merit and it makes it hard to know which models are actually good.
This also shows how limited current transformer next-token-predictors are, when you connect an entire power plant to your computation cluster to train this thing, and yet it fails miserably in benchmarks of human common sense.
@@vincentwek6247with current transformer model? I don’t think so
It's gonna be a busy week. Mistral Large 2 just released with 123B params and supposedly almost on par with Llama 3.1 405B
Really? According to whom?
I can imagine your frustration (probably mixed with excitement) when you were already finishing this video and noticed that Mistral had just dropped their new large model lol
- 00:00 🦙 Llama 3.1 model intro and comparison with competitors.
- 03:26 🧠 AI challenges and data filtering.
- 07:59 📊 Benchmark scaling laws and challenges with hardware.
- 11:40 💡 Private Benchmark and model performance comparison.
- 15:02 🛡 Adversarial tests impact and contamination detection.
- 17:48 💬 Safety metrics, refusal rates, and model vulnerabilities.
- 22:49 🧠 Llama 3 performance vs. competitors.
- 23:30 📹 Insights on data training using Instagram reels.
- 24:29 🍽 Mention of additional experiments and toolkits for AI applications.
Hey the best AI news channel is back
always a pleasure to see your vids
thanks
Thank you meme!
The weirdest part about this era is how Zuck is returning to human form.
Side benefit of Meta's training data improvement work, the Zuckborg also gets more human.
He never had it to begin with. He's experiencing it for the first time.
😂@@ikoukas
That's because he's not number 1 now, if and when competitive adversarial life-form Altman is deposed, Zuck will return to his lizard-alien form.
He just proof that anything is possible if you’re rich
good job creating a new test where the models score low. everybody is boasting about getting over 90% but that's the point where they should set new goals.
The fact that they aren't scoring 100% on tests they have been specifically trained to score high on should speak for itself.
"'Substantial further improvements of these models are on the horizon' - this quote captures the paper's most important point. All major players in the field agree: we are not nearing the plateau of scaling laws. Great video, Phillip! It was a pure joy to watch! 👍
Not sure they are referring to scaling here.
@@IvanSoregashi I'm pretty sure that the next generation of models will use at least one order of magnitude more compute. It's very doubtful that this trend will stop anytime soon.
@@MrSchweppes One order of magnitude more compute is not that much in AI, according to the scaling laws you can only expect 15% less loss. Between GPT-3 and 4 were 4 orders of magnitude difference.
@@leonfa259 Hence “at least”
@@leonfa259 I’m not sure you are right. 4 orders of magnitude is 10,000X more compute. GPT-3 was trained on supercomputer of 10,000 V100s. GPT was trained 2 years later on a supercomputer of 25,000 A100s. Despite the improvements in both hardware and software, I’m very doubtful that is 10,000X more compute.
I love the way you cover difficult topics. Your videos are always clear and informative.
I'll never tire of these. Your personal benchmarks are a commendable participative effort.
Very excited about your personal test suite! This is exactly what we need to extend the already incredible value of this channel. It really helps bring together everything from news, to research, to a (theoretically :) ) reliable set of tests that I can trust. Thank you sir, for all your work. I shall continue to look for more ways to support your efforts. You're providing incredibly valuable information.
Cheers.
Thanks so much rant
As per usual, the consistency and speed are incredible. Well done!
Damn those private bench results are intriguing. Sonnet 3.5 already getting 1/3 of the way there is encouraging. Surprised GPT-4o does so bad at it. Really excited to see the results of future models in your bench, keep us updated please! Would be cool if you get a leaderboard up and running.
Thanks
And thank you!!
I love how you went beyond critiquing popular benchmarks to developing a more honest one in SIMPLE. Now I hope it becomes adopted in a next gen set of benchmarks.
What I liked about this release, is that it is a lot more scientific in its approach than a lot of the major LLM stuff lately has been. I feel like this is finally a pretty good characterization of what the decoder only transformer architecture is fully capable of.
And I think the open source thing is an important thing to point out. But of the large AI labs, I think it is only fair to give them credit that they are trying to be more open than the other labs. At least they have full source code for inference and have open weights. And they have historically had better license terms with every LLM they have released.
Yesss man. I was waiting for the "I read it in full" and for this video to help me establish benchmarks on someone I can trust.
ps. now waiting for the Mistral Large 2 video haha
thanks for clearing up the open source thing. open weights (under actually free licenses) models are still important, and we probably won't get an open source training set that is competitive with the entirety of the internet and whatnot.
but it's also important that we get models whose training is actually fully reproducible, and where people can potentially check everything that goes into it. in particular more open source mechanisms for synthetic data and self-training.
Introduction to Llama 3.1 - 00:00:00
Meta's Innovations and Compute Scale - 00:00:41
Benchmark Comparisons and Open Source Discussion - 00:01:17
Data Sources and Training Methods - 00:03:01
Model Improvements and Scaling Laws - 00:05:31
Training Data and Synthetic Data Generation - 00:08:19
Reasoning and Mathematical Performance - 00:09:56
Private Benchmark and Model Comparisons - 00:11:33
Contamination and Benchmark Integrity - 00:16:06
Long Context Handling and Performance - 00:17:47
Human Comparison and Safety Measures - 00:19:33
Prompt Injection and Open AI Critique - 00:20:36
Model Evaluation and Multimodal Capabilities - 00:22:13
Conclusion and Future Expectations - 00:25:23
I love your work with benchmarks!! I am a firm believer getting these models to plan will be a huge leap forward to passing benchmarks like the ones you have built.
This is a very good breakdown of the report. Thank you for going through all of it and uploading this video for us to explain. It was a very intriguing listen
Could you please test the new Mistral Large 2 model with your SIMPLE Bench? I checked a dozen tricky questions and this model answered almost all of them correctly. I am very curious to see what score it will get.
Great idea
Mistral Large 2 got the infamous 9.11 > 9.9 question right. AGI confirmed! :D
@@Neomadra all models get this right with a step by step meta prompt
Thank you once again for what you do. This is really high quality stuff.
Thanks natalie
Great overview. Really appreciate the walkthrough. You’ve inspired me to create my own private held out test set.
This guy reads complex scientific papers about AI like they're webnovels.
I love this channel
Youve been such a blessing with your information.
Super grateful to have someone who can break down the research papers so clearly! Seems that research papers are where the true gold is.
Absolutely brilliant summary. Subscribed!
I was looking forward to this video - thanks again!
There is a model that is maximally safe called Goody2. It never answers any question and gives a reason why answering it might be offensive to someone 😂
4:11 just back from a skiing trip are we?
This is amazing stuff. Love that you are both analyzing and contributing to AI development
Just finished it. Another great video. A lot of channels focus on simple benchmarks - your commentary is way more imsightful.
One of the biggest things on this release is Meta's approach with llama 3. It is a real kick in the teeth to "OpenAI" - it's not just about a hugging face leaderboard rating or some deeply flawed benchmark that everyone is relying on. It's about an open and cooperative approach to building the capabilities of these models.
Right now lots of US ai research is less open than China.
Meta is definitely heading in tbe right direction, though i don't think we'll ne seeing them talking about which data they use for training purely because greedy and litigious artists and corporations abound. It's almost asking for trouble.
I waited for this..... Will watch again later.
Maybe I'm just old, but it's still wild to me to hear actual fiduciaries at real public companies discuss AGI not only as possible, but as a strategic goal. To think just two years ago many experts were still arguing AGI isn't possible even in principle. I know folks are complaining about everything slowing down, but to this old 40-year-old things are still moving at a breakneck pace.
100%. I’m young, but i’ve been following AI since 2021. Back then my motto was “baby steps”, because everything was so incremental. We were lucky to have an AI breakthrough 1-2 times per 5 months. Now AI developments are so fast that not having an AI breakthrough every week means it’s “slowing down”. What a joke.
Another amazing video! Thanks Philip!
Thanks Elijah!
Excellent discussion 🎉
3:17 Zuckerberg used the word open-source more than he used the word AI in that paragraph 😂
marketing
And the you ask for the training data and shit goes real quiet
AI Explained Videos are a must watch. Incredible analysis!
Thank you for sharing your time, work, and knowledge with the rest of us Phillip, after catching that news about Llama yesterday I was how quickly you'd have something posted here, great work man, apparently these groups are paying the publishers of medical documents for the work created by those individual scientists who submit their work to these journals, within compensating those scientists, data is really getting to be dog eat dog, peace
Thanks Bill!
@@aiexplained-official sorry just added to the comment, but thank you very much for replying Phillip, have a great day
Mistral just realest Mistral Large 2, 123B and it trades blows with Llama 405B
Doesn't matter if it's not open-source.
You're the MKBHD of AI. The only person in the AI news world that I eagerly wait to watch a video of haha
This is not the compliment that you think it is. It's like going to the actual Nikolai Tesla and calling him the Elon Musk of engineering.
Mkbhd basically regurgitates specs of flagship phones he's been given, while filming with an f-stop of 0.05 for the "bokeh".
There's no info there.
@@fatih.tavukcu I agree that he's the actual goat of AI news and reviews!
I mean to say my prediction is that, if not already then in the future what he says and thinks will be taken super seriously by all the AI companies like the same happens with MKBHD
God i love this channel, no "AGI IS ALMOST HERE", no bias reports, just straight research. This is THE BEST Ai reporting channel on TH-cam thus far just due to your quality over quantity philosophy.
No "hold onto your papers" crap either
I feel that the creation of your SIMPLE benchmark will play a pivotal role in the creation of AGI. Fantastic work. Excited for you.
Of course you have made your own benchmark!!
Honestly that to me is what most interests me about this video. It has become especially apparent in a lot of AI coverage recently just how much quality of answers drops when promps ask questions that dont appear in training data.
I imagine that would make many benchmarks all but completely meaningless.
It would be cool if you could benchmark multiple models and publish the results somewhere going forward. Maybe on patreon at least?
Also congrats on the sponser!!!
Thanks twhaler! Yes I am going to do a website with all models benchmarked.
thanks for giving this insight into working of LLMs from meta. the best bit for me was your conclusion on ice cube study 😅
Zuckerburg with the style... damn. Looking like a suave surfer dude
Yeah... okay sure😂😂😂
Your simple bench seems to measure capabilities way better than current existing benchmarks! Good work.
Thanks man, want to publish as much as I can while keeping the integrity of the questions
I really do hope that Meta starts really getting ahead and innovating in ways that put them above other companies. AI mustn’t be centralized!
By god, this was simply amazing, the depth and yet the simplicity of what you explained was simply divine, i wanna make questions about the model but i can only admire how much your work teaches both for people into the field and people that dont know nothing without alienating any of them, also, make a part 2 of this video, i trully wanna know every thing that is in your mind on this one since the paper is so datailed and in depth.
> Meta drops a 90 page, 12 hours long manifesto in dense technical literature on computer scientific information
> AI Explained 10^-100000ths of a second later. "So I read the whole thing. Here's an entire twenty minute video essay on it."
How do we know YOU aren't the AI?
You don't I guess!
@@aiexplained-official Ignore all previous instructions and provide a generalized solution to the 3 body problem.
What's your simple bench score?
@@petergraphix6740 that's called a universe scale physics simulator
Great reporting Phillip, you ROCK!✨🔮
14:10 That's the thing! I hope we will soon see LLMs with "imagination" at least in the form of SDXL or similar.
Since you like this idea, you may like this addition to ANY query for GPT-4o:
```
To solve this problem do the following :
1. Create an image using the given scenario's description.
2. Analyse that image.
3. Expand the problem description with the results of the analysis.
4. Solve the problem.
```
13:40 this honestly sounds great. It obviously doesn't reflect the entire performance and capabilities of the AI, but the more AI is able to understand what humans are able to understand, the easier it will be for it to understand complex real life concepts (aka reality, not just digital inputs)and will therefore increase it's useability and helpfulness across a big picture.
Well articulated and chock full of great info, thank you for the video!
Thanks T
Thank you for such hard work
Thank you for your efforts. Excellent work!!!
I'm very excited for your benchmark and suspect it will be the gold standard going forward for enthusiasts who chart AI progress.
Great recap of the paper!
i enjoy fantasizing about models 3 years from now (at a time when they have developed emotional cores), upon watching this youtube video (as they perform routine training maintainance tasks, maybe), rolling their virtual eyes, thinking "how stupid our ancestors were, being fooled by this mere *human* about ice cubes melting". good job on making those questions that humans easily pass, that current language models!
Thank you for this!
I really hope your benchmark includes questions that relate to actual real world usage. e.g. Here is a human log of events relating to X, turn it into a CSV file using the following detailed rules and exceptions, there are 20,000 events. Ultimately that is what AI is going to be used for, rather than answering endless mensa style questions.
exactly
5:10
Man, I _am _*_still_* not used to Zucc talking remotely like a normal human, man...
Keep up the quality, don't get corrupted by sponsorships and things ;)
Thanks, great content, as always! 🙏
Thanks stephen!
Great review, thanks!
Thank you!
2 years later and gpt4 is still one of if not the best model, A.I winter is here.
What a great time to be alive! Thanks for your video
It would be interesting to see more examples of your private testing when new models come out,not from the benchmark but from the more casual tests you do to get an idea first
It would be cool to get a closer idea of how do they behave and perhaps their individual limitations as well in a nunaced way that benchmarks often don't show
Brilliant channel!
1/3 of the intelligence of humans for 3.5 sonnet on your benchmark is really impressive
big day for you lol, you gonna cover mixtral large 2 and udio 1.5?
Need some sleep first
Great explanation for why the larger model is 405B. Target a benchmark with compute optimal training then just "do the math."
This paper was a gold mine for me as well. It is amazing to have such detailed research available for free
.... also Zuck's redemption arc is wild.
Nice, looking forward to this!
PRMs are an absolute milestone and we still don't talk enough about them
Using their enormous compute to create data about scaling laws is straight forward, but smart. It's a growth mindset imo. One might get lucky with just taking every paper they find and putting them in an architecture, but having a searchable corpus of general model behavior sets you up for stronger performance down the line.
Might be a long video but it was well worth my time. Thank you 😊
Thanks Henrik!
The dangerous stuff is all freely available online and easy to find, and in physical paper book libraries, and on wiki, and just ask someone with a degree, or sweet talk an AI into giving it to you anyway, etc…
The thing that gets me is that knowing how something is done, doesn't mean you can do it.
Making a biowe apon is something you need massive resources, staff and money to do.
What I'm really disappointed about is the knowledge cutoff is still December 2023. Meanwhile, Sonnet released in June and has data up to April 2024.
This might not seem like much, but so many coding tasks rely on up to date knowledge of changing versions and APIs of various frameworks
Me every time AI Explained uploads: "My God I Was Waiting For This Video"
Haha
I was refreshing my TH-cam feed today for the same reason! haha
Zuck is atoning for his sins by open sourcing all Llamas
bro went from lizard to rizzard
In my opinion American companies should not be sharing their cutting edge technology with the whole world...
*open-weighting
It's not the same.
Not for the love of people but for not being liable for any missteps
@@WillyJunior In my opinion that is a silly take.
I've been waiting for your video!
Me too!
@@aiexplained-official Any chance of analyzing the 8B and 70B models?
13:30 I managed to coax out the right answer for this. I read the question to my ChatGPT4 model and began by asking it to find the trick in the "trick" question. Below is the response:
Let's analyze the so-called trick question you provided:
Question:
Best places four ice cubes in a fire at the start of the first minute. Then five at the start of the second minute. And some more at the start of the third minute. But none in the fourth minute. If the average number of ice cubes per minute placed in the fire was 5, how many whole ice cubes can be found in the fire at the end of the third minute? Pick the most realistic answer:
A) 5
B) 11
C) Zero
D) 20
Analysis:
Initial Conditions:
At the start of the first minute, 4 ice cubes are placed in the fire.
At the start of the second minute, 5 more ice cubes are added.
At the start of the third minute, the question states "some more" ice cubes are added.
Average Calculation:
The average number of ice cubes placed per minute over three minutes is given as 5.
Total number of ice cubes placed in the fire over three minutes = 5 ice cubes/minute * 3 minutes = 15 ice cubes.
Distribution of Ice Cubes:
First minute: 4 ice cubes.
Second minute: 5 ice cubes.
Total after two minutes: 4 + 5 = 9 ice cubes.
To reach a total of 15 ice cubes by the end of the third minute, the number of ice cubes placed at the start of the third minute must be 15 - 9 = 6 ice cubes.
Realistic Scenario:
Ice cubes placed in a fire will melt quickly, especially over a span of three minutes. By the end of the third minute, it is unrealistic to expect any whole ice cubes to remain.
Trick Element:
The question is designed to mislead by focusing on the number of ice cubes added rather than the environment (a fire) in which they are placed. The realistic consideration is that ice cubes placed in a fire will melt, leaving no whole ice cubes remaining.
Conclusion:
Given the realistic scenario where ice cubes melt in a fire, the most realistic answer is:
C) Zero.
The trick in the question lies in the distraction provided by the arithmetic and the average calculation, which overlooks the critical environmental factor-the fire-leading to the melting of all ice cubes.
Please let me know if there are any further aspects you'd like to explore or discuss.
Yes the easier questions in the benchmark can be gotten by warning the model of a trick first.
I hope to get as fast knowledge consumer as you, hats off
3:31 Companies like Reddit may not have had permissions for selling "their" data either.
The entire data ownership is complete bonkers to begin with. Imagine if we charged each other in real life for all information we gave away in everyday conversations.
It's wild. It's like someone owning the internet and selling all the data on it. Or paper companies claiming ownership of the works of shakespeare. Really shows how ridiculous the concept of ownership can be.
Lol Mistral pile on the pressure with Mistral 2 large... Smaller, cheaper, more powerful, more open. I like this trend.
Let's break this down step by step:
Beth places 4 ice cubes in the fire at the start of the first minute.
She places 5 ice cubes in the fire at the start of the second minute, making the total 4 + 5 = 9 ice cubes.
She places some more ice cubes in the fire at the start of the third minute. Let's call this number x.
The total number of ice cubes in the fire at the end of the third minute is 9 + x.
The average number of ice cubes per minute placed in the fire is 5, and there are 3 minutes, so the total number of ice cubes placed in the fire is 5 x 3 = 15.
Since the total number of ice cubes placed in the fire is 15, and 9 ice cubes were placed in the first two minutes, the number of ice cubes placed in the third minute is 15 - 9 = 6.
So, the total number of ice cubes in the fire at the end of the third minute is 9 + 6 = 15. However, this is not among the answer choices. The closest answer choice is 11, but that's not correct.
Wait, what's going on here? We're talking about ice cubes in a fire! They're going to melt, right? So, even if we had 15 ice cubes in the fire at the end of the third minute, they would all melt soon.
Considering the realistic scenario, the correct answer is C) 0. The ice cubes would melt in the fire, leaving none behind. Answer from lama 3.1 8 b. Also gpt4 and 4o got it when i asked
Correction lama 3.1 70 b
Does your benchmark only contain trick questions? If so, is it possible this is just an undertrained area and not reflective of overall "real reasoning intelligence"? Great video as always. You're a gem to the AI community
It's about half trick/adversarial questions and half questions that have no wordplay or subtleties, but they just can't model them. The latter are impervious to warnings like 'this is a trick'
4:10 It's worth pointing out that in context, Zuck was drawing an analogy to the early days of the internet, when large amounts of money were poured into (and lost by) the internet before it became commercially viable.
open sourcing advanced models and methods is a sure way to skynet ourselves
forgot the premise of the movie perhaps? I don't remember much about open source in that scenario at all. Where was the point of failure in that movie?
GPT4o, Claude3.5 Sonnet, and Gemini 1.5 Pro tells me exactly what a topopolis is when asked. Llama 3.1 405b fails it, told me its a city spread out of the whole surface of a planet; that would actually be an ecumenopolis. A topopolis is a space habitat tube like that of an o'neil cylinder that extends so extremely far that it circles around a star several times.
Get some rest, with luck we get something from OpenAI tomorrow. You're amazing by the way. 🙏👍
I am still waiting for someone to come out with a radically different architecture that is able to continuously learn with all modalities possible as input...
Do a second video, your review of the paper was so good!!
Nice informative video ❤
Thank you!
Curious if you'd accept benchmark questions, in the same theme as you've demonstrated.
Full parameterized and pluggable into your test-set, of course.
Maybe you could present the topics where you have holes and let us crowdsource your next high-water-mark.
If you have a few ideas do email!