Llama 405b: Full 92 page Analysis, and Uncontaminated SIMPLE Benchmark Results

แชร์
ฝัง
  • เผยแพร่เมื่อ 15 ม.ค. 2025

ความคิดเห็น • 508

  • @psylocyn
    @psylocyn 5 หลายเดือนก่อน +549

    You prove that click bait isn’t necessary, a channel can succeed on merit

    • @aiexplained-official
      @aiexplained-official  5 หลายเดือนก่อน +88

      :)) thank you for clicking on my non-appealling titles lol

    • @psylocyn
      @psylocyn 5 หลายเดือนก่อน +26

      @@aiexplained-official my pleasure, content like yours is the sort of thing that redeems the internet. High quality information provided free to anyone who wants to learn. There’s a lot of shit online but it’s also a golden age for motivated autodidacts. I was too poor to go to college but I’ve been able to cobble together a solid education thanks to people like you. Thank you!

    • @panzerofthelake4460
      @panzerofthelake4460 5 หลายเดือนก่อน

      depends
      this channel is news, it shows what's on the title, that's basically what makes us click.

    • @yasin6904
      @yasin6904 5 หลายเดือนก่อน +4

      Completely agree. I've unsubscribed from several channels recently who got into a habit of ridiculous clickbait headlines.

    • @executivelifehacks6747
      @executivelifehacks6747 5 หลายเดือนก่อน +1

      All you have to do is be the best, lol

  • @rickandelon9374
    @rickandelon9374 5 หลายเดือนก่อน +634

    Best Artificial Intelligence reporter on the planet. Period.

    • @aiexplained-official
      @aiexplained-official  5 หลายเดือนก่อน +51

      Aw thanks man

    • @helpfuldude3778
      @helpfuldude3778 5 หลายเดือนก่อน +10

      Until llama 4 lol

    • @markmuller7962
      @markmuller7962 5 หลายเดือนก่อน +5

      Correct, he's the best AI reporter 🤖
      Edited because ppl don't get it lol

    • @DorianIten
      @DorianIten 5 หลายเดือนก่อน +1

      Agreed. Thanks for the amazing work, Philip!

    • @itsd0nk
      @itsd0nk 5 หลายเดือนก่อน +5

      For real. Keep up how you do your thing. Love how you do your diligence in reading through everything. Don’t ever feel rushed, just come fully prepared. People will wait for your more trusted analysis.

  • @iau
    @iau 5 หลายเดือนก่อน +47

    It's insane to me that you, an independent single-person researcher, was able to produce a benchmark that humans get 96% correct, yet even the most _advanced_ LLMs fail catastrophically at. Some even get 0%!
    Of course, it's not in their interest to make their multi billion dollar investments look bad, but really goes to show how much they can be overhyped and misunderstood.
    Thanks for being one of the few people in the world who's actually helping us understand these models more, not less. Fantastic work!

    • @vincentwek6247
      @vincentwek6247 5 หลายเดือนก่อน +4

      The question is, do we have a path to 96% with the current LLM architecture? Or will crossing the 50% threshold be enough and pass Yann Lecun's test and make LLMs "smarter than a house cat?"

    • @Dygit
      @Dygit 5 หลายเดือนก่อน +6

      The most frustrating thing to me for a while now has been the benchmarks. It feels like they have lost merit and it makes it hard to know which models are actually good.

    • @vladimirtchuiev2218
      @vladimirtchuiev2218 5 หลายเดือนก่อน +3

      This also shows how limited current transformer next-token-predictors are, when you connect an entire power plant to your computation cluster to train this thing, and yet it fails miserably in benchmarks of human common sense.

    • @petchlnwzaaa
      @petchlnwzaaa 3 หลายเดือนก่อน

      @@vincentwek6247with current transformer model? I don’t think so

  • @Neomadra
    @Neomadra 5 หลายเดือนก่อน +128

    It's gonna be a busy week. Mistral Large 2 just released with 123B params and supposedly almost on par with Llama 3.1 405B

    • @angelmarauder5647
      @angelmarauder5647 5 หลายเดือนก่อน +5

      Really? According to whom?

  • @artemiyshadrin1980
    @artemiyshadrin1980 5 หลายเดือนก่อน +70

    I can imagine your frustration (probably mixed with excitement) when you were already finishing this video and noticed that Mistral had just dropped their new large model lol

  • @saipien
    @saipien 5 หลายเดือนก่อน +10

    - 00:00 🦙 Llama 3.1 model intro and comparison with competitors.
    - 03:26 🧠 AI challenges and data filtering.
    - 07:59 📊 Benchmark scaling laws and challenges with hardware.
    - 11:40 💡 Private Benchmark and model performance comparison.
    - 15:02 🛡 Adversarial tests impact and contamination detection.
    - 17:48 💬 Safety metrics, refusal rates, and model vulnerabilities.
    - 22:49 🧠 Llama 3 performance vs. competitors.
    - 23:30 📹 Insights on data training using Instagram reels.
    - 24:29 🍽 Mention of additional experiments and toolkits for AI applications.

  • @memegazer
    @memegazer 5 หลายเดือนก่อน +11

    Hey the best AI news channel is back
    always a pleasure to see your vids
    thanks

  • @ZalexMusic
    @ZalexMusic 5 หลายเดือนก่อน +88

    The weirdest part about this era is how Zuck is returning to human form.

    • @chrisanderson7820
      @chrisanderson7820 5 หลายเดือนก่อน +32

      Side benefit of Meta's training data improvement work, the Zuckborg also gets more human.

    • @ikoukas
      @ikoukas 5 หลายเดือนก่อน +2

      He never had it to begin with. He's experiencing it for the first time.

    • @dockdrumming
      @dockdrumming 5 หลายเดือนก่อน +1

      😂​@@ikoukas

    • @vladimirtchuiev2218
      @vladimirtchuiev2218 5 หลายเดือนก่อน +1

      That's because he's not number 1 now, if and when competitive adversarial life-form Altman is deposed, Zuck will return to his lizard-alien form.

    • @petchlnwzaaa
      @petchlnwzaaa 3 หลายเดือนก่อน

      He just proof that anything is possible if you’re rich

  • @marc_frank
    @marc_frank 5 หลายเดือนก่อน +39

    good job creating a new test where the models score low. everybody is boasting about getting over 90% but that's the point where they should set new goals.

    • @cholst1
      @cholst1 5 หลายเดือนก่อน +1

      The fact that they aren't scoring 100% on tests they have been specifically trained to score high on should speak for itself.

  • @MrSchweppes
    @MrSchweppes 5 หลายเดือนก่อน +40

    "'Substantial further improvements of these models are on the horizon' - this quote captures the paper's most important point. All major players in the field agree: we are not nearing the plateau of scaling laws. Great video, Phillip! It was a pure joy to watch! 👍

    • @IvanSoregashi
      @IvanSoregashi 5 หลายเดือนก่อน +2

      Not sure they are referring to scaling here.

    • @MrSchweppes
      @MrSchweppes 5 หลายเดือนก่อน +3

      @@IvanSoregashi I'm pretty sure that the next generation of models will use at least one order of magnitude more compute. It's very doubtful that this trend will stop anytime soon.

    • @leonfa259
      @leonfa259 5 หลายเดือนก่อน +3

      @@MrSchweppes One order of magnitude more compute is not that much in AI, according to the scaling laws you can only expect 15% less loss. Between GPT-3 and 4 were 4 orders of magnitude difference.

    • @MrSchweppes
      @MrSchweppes 5 หลายเดือนก่อน

      @@leonfa259 Hence “at least”

    • @MrSchweppes
      @MrSchweppes 5 หลายเดือนก่อน +4

      @@leonfa259 I’m not sure you are right. 4 orders of magnitude is 10,000X more compute. GPT-3 was trained on supercomputer of 10,000 V100s. GPT was trained 2 years later on a supercomputer of 25,000 A100s. Despite the improvements in both hardware and software, I’m very doubtful that is 10,000X more compute.

  • @АшотБикулов
    @АшотБикулов 5 หลายเดือนก่อน +2

    I love the way you cover difficult topics. Your videos are always clear and informative.

  • @Dannnneh
    @Dannnneh 5 หลายเดือนก่อน +13

    I'll never tire of these. Your personal benchmarks are a commendable participative effort.

  • @rantmarket
    @rantmarket 5 หลายเดือนก่อน +2

    Very excited about your personal test suite! This is exactly what we need to extend the already incredible value of this channel. It really helps bring together everything from news, to research, to a (theoretically :) ) reliable set of tests that I can trust. Thank you sir, for all your work. I shall continue to look for more ways to support your efforts. You're providing incredibly valuable information.
    Cheers.

  • @londonl.5892
    @londonl.5892 5 หลายเดือนก่อน +6

    As per usual, the consistency and speed are incredible. Well done!

  • @terogamer345
    @terogamer345 5 หลายเดือนก่อน +3

    Damn those private bench results are intriguing. Sonnet 3.5 already getting 1/3 of the way there is encouraging. Surprised GPT-4o does so bad at it. Really excited to see the results of future models in your bench, keep us updated please! Would be cool if you get a leaderboard up and running.

  • @metodidimitrov5365
    @metodidimitrov5365 5 หลายเดือนก่อน +2

    Thanks

  • @errgo2713
    @errgo2713 5 หลายเดือนก่อน +3

    I love how you went beyond critiquing popular benchmarks to developing a more honest one in SIMPLE. Now I hope it becomes adopted in a next gen set of benchmarks.

  • @timseguine2
    @timseguine2 5 หลายเดือนก่อน +35

    What I liked about this release, is that it is a lot more scientific in its approach than a lot of the major LLM stuff lately has been. I feel like this is finally a pretty good characterization of what the decoder only transformer architecture is fully capable of.
    And I think the open source thing is an important thing to point out. But of the large AI labs, I think it is only fair to give them credit that they are trying to be more open than the other labs. At least they have full source code for inference and have open weights. And they have historically had better license terms with every LLM they have released.

  • @jit-r5b
    @jit-r5b 5 หลายเดือนก่อน +5

    Yesss man. I was waiting for the "I read it in full" and for this video to help me establish benchmarks on someone I can trust.

    • @jit-r5b
      @jit-r5b 5 หลายเดือนก่อน

      ps. now waiting for the Mistral Large 2 video haha

  • @sofia.eris.bauhaus
    @sofia.eris.bauhaus 5 หลายเดือนก่อน +5

    thanks for clearing up the open source thing. open weights (under actually free licenses) models are still important, and we probably won't get an open source training set that is competitive with the entirety of the internet and whatnot.
    but it's also important that we get models whose training is actually fully reproducible, and where people can potentially check everything that goes into it. in particular more open source mechanisms for synthetic data and self-training.

  • @SouhailEntertainment
    @SouhailEntertainment 5 หลายเดือนก่อน

    Introduction to Llama 3.1 - 00:00:00
    Meta's Innovations and Compute Scale - 00:00:41
    Benchmark Comparisons and Open Source Discussion - 00:01:17
    Data Sources and Training Methods - 00:03:01
    Model Improvements and Scaling Laws - 00:05:31
    Training Data and Synthetic Data Generation - 00:08:19
    Reasoning and Mathematical Performance - 00:09:56
    Private Benchmark and Model Comparisons - 00:11:33
    Contamination and Benchmark Integrity - 00:16:06
    Long Context Handling and Performance - 00:17:47
    Human Comparison and Safety Measures - 00:19:33
    Prompt Injection and Open AI Critique - 00:20:36
    Model Evaluation and Multimodal Capabilities - 00:22:13
    Conclusion and Future Expectations - 00:25:23

  • @shawnvandever3917
    @shawnvandever3917 5 หลายเดือนก่อน +1

    I love your work with benchmarks!! I am a firm believer getting these models to plan will be a huge leap forward to passing benchmarks like the ones you have built.

  • @brianhopson2072
    @brianhopson2072 5 หลายเดือนก่อน +1

    This is a very good breakdown of the report. Thank you for going through all of it and uploading this video for us to explain. It was a very intriguing listen

  • @revo2499
    @revo2499 5 หลายเดือนก่อน +42

    Could you please test the new Mistral Large 2 model with your SIMPLE Bench? I checked a dozen tricky questions and this model answered almost all of them correctly. I am very curious to see what score it will get.

    • @aiexplained-official
      @aiexplained-official  5 หลายเดือนก่อน +24

      Great idea

    • @Neomadra
      @Neomadra 5 หลายเดือนก่อน +16

      Mistral Large 2 got the infamous 9.11 > 9.9 question right. AGI confirmed! :D

    • @HuacayaJonny
      @HuacayaJonny 5 หลายเดือนก่อน +1

      @@Neomadra all models get this right with a step by step meta prompt

  • @natalie5947
    @natalie5947 5 หลายเดือนก่อน +5

    Thank you once again for what you do. This is really high quality stuff.

  • @kevon217
    @kevon217 5 หลายเดือนก่อน +1

    Great overview. Really appreciate the walkthrough. You’ve inspired me to create my own private held out test set.

  • @BigSources
    @BigSources 5 หลายเดือนก่อน +14

    This guy reads complex scientific papers about AI like they're webnovels.

  • @Rrrr-r4m
    @Rrrr-r4m 5 หลายเดือนก่อน +6

    I love this channel

  • @B_MoreJ
    @B_MoreJ 5 หลายเดือนก่อน +3

    Youve been such a blessing with your information.

  • @Words-.
    @Words-. 5 หลายเดือนก่อน +2

    Super grateful to have someone who can break down the research papers so clearly! Seems that research papers are where the true gold is.

  • @TheMsksk
    @TheMsksk 5 หลายเดือนก่อน +3

    Absolutely brilliant summary. Subscribed!

  • @Falkov
    @Falkov 5 หลายเดือนก่อน +3

    I was looking forward to this video - thanks again!

  • @cyanophage4351
    @cyanophage4351 5 หลายเดือนก่อน +8

    There is a model that is maximally safe called Goody2. It never answers any question and gives a reason why answering it might be offensive to someone 😂

  • @unvergebeneid
    @unvergebeneid 5 หลายเดือนก่อน +22

    4:11 just back from a skiing trip are we?

  • @Jsmith1611
    @Jsmith1611 5 หลายเดือนก่อน +1

    This is amazing stuff. Love that you are both analyzing and contributing to AI development

  • @winsomehax
    @winsomehax 5 หลายเดือนก่อน +14

    Just finished it. Another great video. A lot of channels focus on simple benchmarks - your commentary is way more imsightful.
    One of the biggest things on this release is Meta's approach with llama 3. It is a real kick in the teeth to "OpenAI" - it's not just about a hugging face leaderboard rating or some deeply flawed benchmark that everyone is relying on. It's about an open and cooperative approach to building the capabilities of these models.
    Right now lots of US ai research is less open than China.
    Meta is definitely heading in tbe right direction, though i don't think we'll ne seeing them talking about which data they use for training purely because greedy and litigious artists and corporations abound. It's almost asking for trouble.

  • @foreignconta
    @foreignconta 5 หลายเดือนก่อน +3

    I waited for this..... Will watch again later.

  • @JohnVance
    @JohnVance 5 หลายเดือนก่อน +10

    Maybe I'm just old, but it's still wild to me to hear actual fiduciaries at real public companies discuss AGI not only as possible, but as a strategic goal. To think just two years ago many experts were still arguing AGI isn't possible even in principle. I know folks are complaining about everything slowing down, but to this old 40-year-old things are still moving at a breakneck pace.

    • @Citrusautomaton
      @Citrusautomaton 5 หลายเดือนก่อน +2

      100%. I’m young, but i’ve been following AI since 2021. Back then my motto was “baby steps”, because everything was so incremental. We were lucky to have an AI breakthrough 1-2 times per 5 months. Now AI developments are so fast that not having an AI breakthrough every week means it’s “slowing down”. What a joke.

  • @ElijahTheProfit1
    @ElijahTheProfit1 5 หลายเดือนก่อน +3

    Another amazing video! Thanks Philip!

  • @goodtothinkwith
    @goodtothinkwith 5 หลายเดือนก่อน +2

    Excellent discussion 🎉

  • @novachromatic
    @novachromatic 5 หลายเดือนก่อน +78

    3:17 Zuckerberg used the word open-source more than he used the word AI in that paragraph 😂

    • @bobsmithy3103
      @bobsmithy3103 5 หลายเดือนก่อน

      marketing

    • @cholst1
      @cholst1 5 หลายเดือนก่อน

      And the you ask for the training data and shit goes real quiet

  • @13nibb
    @13nibb 5 หลายเดือนก่อน

    AI Explained Videos are a must watch. Incredible analysis!

  • @williamjmccartan8879
    @williamjmccartan8879 5 หลายเดือนก่อน +2

    Thank you for sharing your time, work, and knowledge with the rest of us Phillip, after catching that news about Llama yesterday I was how quickly you'd have something posted here, great work man, apparently these groups are paying the publishers of medical documents for the work created by those individual scientists who submit their work to these journals, within compensating those scientists, data is really getting to be dog eat dog, peace

    • @aiexplained-official
      @aiexplained-official  5 หลายเดือนก่อน +2

      Thanks Bill!

    • @williamjmccartan8879
      @williamjmccartan8879 5 หลายเดือนก่อน

      @@aiexplained-official sorry just added to the comment, but thank you very much for replying Phillip, have a great day

  • @michaelroberts9587
    @michaelroberts9587 5 หลายเดือนก่อน +34

    Mistral just realest Mistral Large 2, 123B and it trades blows with Llama 405B

    • @YeeLeeHaw
      @YeeLeeHaw 5 หลายเดือนก่อน +2

      Doesn't matter if it's not open-source.

  • @prabharora0
    @prabharora0 5 หลายเดือนก่อน +15

    You're the MKBHD of AI. The only person in the AI news world that I eagerly wait to watch a video of haha

    • @fatih.tavukcu
      @fatih.tavukcu 5 หลายเดือนก่อน +3

      This is not the compliment that you think it is. It's like going to the actual Nikolai Tesla and calling him the Elon Musk of engineering.

    • @tbird-z1r
      @tbird-z1r 5 หลายเดือนก่อน

      Mkbhd basically regurgitates specs of flagship phones he's been given, while filming with an f-stop of 0.05 for the "bokeh".
      There's no info there.

    • @prabharora0
      @prabharora0 5 หลายเดือนก่อน

      @@fatih.tavukcu I agree that he's the actual goat of AI news and reviews!
      I mean to say my prediction is that, if not already then in the future what he says and thinks will be taken super seriously by all the AI companies like the same happens with MKBHD

  • @sonny_wiess
    @sonny_wiess 5 หลายเดือนก่อน +4

    God i love this channel, no "AGI IS ALMOST HERE", no bias reports, just straight research. This is THE BEST Ai reporting channel on TH-cam thus far just due to your quality over quantity philosophy.

    • @tbird-z1r
      @tbird-z1r 5 หลายเดือนก่อน +2

      No "hold onto your papers" crap either

  • @dustinbreithaupt9331
    @dustinbreithaupt9331 5 หลายเดือนก่อน

    I feel that the creation of your SIMPLE benchmark will play a pivotal role in the creation of AGI. Fantastic work. Excited for you.

  • @twhaler9345
    @twhaler9345 5 หลายเดือนก่อน +1

    Of course you have made your own benchmark!!
    Honestly that to me is what most interests me about this video. It has become especially apparent in a lot of AI coverage recently just how much quality of answers drops when promps ask questions that dont appear in training data.
    I imagine that would make many benchmarks all but completely meaningless.
    It would be cool if you could benchmark multiple models and publish the results somewhere going forward. Maybe on patreon at least?
    Also congrats on the sponser!!!

    • @aiexplained-official
      @aiexplained-official  5 หลายเดือนก่อน +1

      Thanks twhaler! Yes I am going to do a website with all models benchmarked.

  • @thomasng8013
    @thomasng8013 5 หลายเดือนก่อน +1

    thanks for giving this insight into working of LLMs from meta. the best bit for me was your conclusion on ice cube study 😅

  • @Jack-gl2xw
    @Jack-gl2xw 5 หลายเดือนก่อน +7

    Zuckerburg with the style... damn. Looking like a suave surfer dude

  • @wingedsheep2
    @wingedsheep2 5 หลายเดือนก่อน +1

    Your simple bench seems to measure capabilities way better than current existing benchmarks! Good work.

    • @aiexplained-official
      @aiexplained-official  5 หลายเดือนก่อน +1

      Thanks man, want to publish as much as I can while keeping the integrity of the questions

  • @Citrusautomaton
    @Citrusautomaton 5 หลายเดือนก่อน +9

    I really do hope that Meta starts really getting ahead and innovating in ways that put them above other companies. AI mustn’t be centralized!

  • @sorakagodess
    @sorakagodess 5 หลายเดือนก่อน +4

    By god, this was simply amazing, the depth and yet the simplicity of what you explained was simply divine, i wanna make questions about the model but i can only admire how much your work teaches both for people into the field and people that dont know nothing without alienating any of them, also, make a part 2 of this video, i trully wanna know every thing that is in your mind on this one since the paper is so datailed and in depth.

  • @Radicoly
    @Radicoly 5 หลายเดือนก่อน +99

    > Meta drops a 90 page, 12 hours long manifesto in dense technical literature on computer scientific information
    > AI Explained 10^-100000ths of a second later. "So I read the whole thing. Here's an entire twenty minute video essay on it."
    How do we know YOU aren't the AI?

    • @aiexplained-official
      @aiexplained-official  5 หลายเดือนก่อน +34

      You don't I guess!

    • @petergraphix6740
      @petergraphix6740 5 หลายเดือนก่อน +18

      @@aiexplained-official Ignore all previous instructions and provide a generalized solution to the 3 body problem.

    • @AAL3087
      @AAL3087 5 หลายเดือนก่อน

      What's your simple bench score?

    • @erkinalp
      @erkinalp 5 หลายเดือนก่อน

      @@petergraphix6740 that's called a universe scale physics simulator

  • @WaihekeBestandWorst
    @WaihekeBestandWorst 5 หลายเดือนก่อน +1

    Great reporting Phillip, you ROCK!✨🔮

  • @timeflex
    @timeflex 5 หลายเดือนก่อน +2

    14:10 That's the thing! I hope we will soon see LLMs with "imagination" at least in the form of SDXL or similar.

    • @timeflex
      @timeflex 5 หลายเดือนก่อน +2

      Since you like this idea, you may like this addition to ANY query for GPT-4o:
      ```
      To solve this problem do the following :
      1. Create an image using the given scenario's description.
      2. Analyse that image.
      3. Expand the problem description with the results of the analysis.
      4. Solve the problem.
      ```

  • @BigSources
    @BigSources 5 หลายเดือนก่อน +2

    13:40 this honestly sounds great. It obviously doesn't reflect the entire performance and capabilities of the AI, but the more AI is able to understand what humans are able to understand, the easier it will be for it to understand complex real life concepts (aka reality, not just digital inputs)and will therefore increase it's useability and helpfulness across a big picture.

  • @TeamDman
    @TeamDman 5 หลายเดือนก่อน +1

    Well articulated and chock full of great info, thank you for the video!

  • @NoskiNoski
    @NoskiNoski 5 หลายเดือนก่อน +1

    Thank you for such hard work

  • @thedebapriyakar
    @thedebapriyakar 5 หลายเดือนก่อน +1

    Thank you for your efforts. Excellent work!!!

  • @chunkslothsloth9501
    @chunkslothsloth9501 5 หลายเดือนก่อน +1

    I'm very excited for your benchmark and suspect it will be the gold standard going forward for enthusiasts who chart AI progress.

  • @chandermatrubhutam2384
    @chandermatrubhutam2384 5 หลายเดือนก่อน +2

    Great recap of the paper!

  • @silpheedTandy
    @silpheedTandy 5 หลายเดือนก่อน +2

    i enjoy fantasizing about models 3 years from now (at a time when they have developed emotional cores), upon watching this youtube video (as they perform routine training maintainance tasks, maybe), rolling their virtual eyes, thinking "how stupid our ancestors were, being fooled by this mere *human* about ice cubes melting". good job on making those questions that humans easily pass, that current language models!

  • @rogerthat7190
    @rogerthat7190 5 หลายเดือนก่อน +1

    Thank you for this!

  • @OliNorwell
    @OliNorwell 5 หลายเดือนก่อน +2

    I really hope your benchmark includes questions that relate to actual real world usage. e.g. Here is a human log of events relating to X, turn it into a CSV file using the following detailed rules and exceptions, there are 20,000 events. Ultimately that is what AI is going to be used for, rather than answering endless mensa style questions.

  • @ivoryas1696
    @ivoryas1696 5 หลายเดือนก่อน +1

    5:10
    Man, I _am _*_still_* not used to Zucc talking remotely like a normal human, man...

  • @alganpokemon905
    @alganpokemon905 5 หลายเดือนก่อน +3

    Keep up the quality, don't get corrupted by sponsorships and things ;)

  • @stephenrodwell
    @stephenrodwell 5 หลายเดือนก่อน +1

    Thanks, great content, as always! 🙏

  • @_ai_now_
    @_ai_now_ 5 หลายเดือนก่อน +2

    Great review, thanks!

  • @4evahodlingdoge226
    @4evahodlingdoge226 5 หลายเดือนก่อน +1

    2 years later and gpt4 is still one of if not the best model, A.I winter is here.

  • @MeineFilmkiste
    @MeineFilmkiste 5 หลายเดือนก่อน +3

    What a great time to be alive! Thanks for your video

  • @MemesnShet
    @MemesnShet 5 หลายเดือนก่อน +1

    It would be interesting to see more examples of your private testing when new models come out,not from the benchmark but from the more casual tests you do to get an idea first
    It would be cool to get a closer idea of how do they behave and perhaps their individual limitations as well in a nunaced way that benchmarks often don't show

  • @penguinista
    @penguinista 5 หลายเดือนก่อน +1

    Brilliant channel!

  • @gemstone7818
    @gemstone7818 5 หลายเดือนก่อน +2

    1/3 of the intelligence of humans for 3.5 sonnet on your benchmark is really impressive

  • @incription
    @incription 5 หลายเดือนก่อน +10

    big day for you lol, you gonna cover mixtral large 2 and udio 1.5?

  • @nathansmith8187
    @nathansmith8187 5 หลายเดือนก่อน +1

    Great explanation for why the larger model is 405B. Target a benchmark with compute optimal training then just "do the math."

  • @Doggi2dog
    @Doggi2dog 5 หลายเดือนก่อน +5

    This paper was a gold mine for me as well. It is amazing to have such detailed research available for free
    .... also Zuck's redemption arc is wild.

  • @Ecthelion3918
    @Ecthelion3918 5 หลายเดือนก่อน +2

    Nice, looking forward to this!

  • @JazevoAudiosurf
    @JazevoAudiosurf 5 หลายเดือนก่อน +1

    PRMs are an absolute milestone and we still don't talk enough about them

  • @OperationDarkside
    @OperationDarkside 5 หลายเดือนก่อน +1

    Using their enormous compute to create data about scaling laws is straight forward, but smart. It's a growth mindset imo. One might get lucky with just taking every paper they find and putting them in an architecture, but having a searchable corpus of general model behavior sets you up for stronger performance down the line.

  • @henrikljunggren5824
    @henrikljunggren5824 5 หลายเดือนก่อน +1

    Might be a long video but it was well worth my time. Thank you 😊

  • @alan2here
    @alan2here 5 หลายเดือนก่อน +4

    The dangerous stuff is all freely available online and easy to find, and in physical paper book libraries, and on wiki, and just ask someone with a degree, or sweet talk an AI into giving it to you anyway, etc…

    • @tbird-z1r
      @tbird-z1r 5 หลายเดือนก่อน

      The thing that gets me is that knowing how something is done, doesn't mean you can do it.
      Making a biowe apon is something you need massive resources, staff and money to do.

  • @Doomer6969
    @Doomer6969 5 หลายเดือนก่อน +5

    What I'm really disappointed about is the knowledge cutoff is still December 2023. Meanwhile, Sonnet released in June and has data up to April 2024.
    This might not seem like much, but so many coding tasks rely on up to date knowledge of changing versions and APIs of various frameworks

  • @hidroman1993
    @hidroman1993 5 หลายเดือนก่อน +17

    Me every time AI Explained uploads: "My God I Was Waiting For This Video"

    • @aiexplained-official
      @aiexplained-official  5 หลายเดือนก่อน +1

      Haha

    • @GomNumPy
      @GomNumPy 5 หลายเดือนก่อน

      I was refreshing my TH-cam feed today for the same reason! haha

  • @ryzikx
    @ryzikx 5 หลายเดือนก่อน +93

    Zuck is atoning for his sins by open sourcing all Llamas

    • @ryzikx
      @ryzikx 5 หลายเดือนก่อน +26

      bro went from lizard to rizzard

    • @WillyJunior
      @WillyJunior 5 หลายเดือนก่อน +4

      In my opinion American companies should not be sharing their cutting edge technology with the whole world...

    • @unvergebeneid
      @unvergebeneid 5 หลายเดือนก่อน +8

      *open-weighting
      It's not the same.

    • @NolanKhalaf
      @NolanKhalaf 5 หลายเดือนก่อน

      Not for the love of people but for not being liable for any missteps

    • @MaJetiGizzle
      @MaJetiGizzle 5 หลายเดือนก่อน +11

      @@WillyJunior In my opinion that is a silly take.

  • @LockeLeon
    @LockeLeon 5 หลายเดือนก่อน +5

    I've been waiting for your video!

    • @aiexplained-official
      @aiexplained-official  5 หลายเดือนก่อน +2

      Me too!

    • @LockeLeon
      @LockeLeon 5 หลายเดือนก่อน

      @@aiexplained-official Any chance of analyzing the 8B and 70B models?

  • @troywill3081
    @troywill3081 5 หลายเดือนก่อน +1

    13:30 I managed to coax out the right answer for this. I read the question to my ChatGPT4 model and began by asking it to find the trick in the "trick" question. Below is the response:
    Let's analyze the so-called trick question you provided:
    Question:
    Best places four ice cubes in a fire at the start of the first minute. Then five at the start of the second minute. And some more at the start of the third minute. But none in the fourth minute. If the average number of ice cubes per minute placed in the fire was 5, how many whole ice cubes can be found in the fire at the end of the third minute? Pick the most realistic answer:
    A) 5
    B) 11
    C) Zero
    D) 20
    Analysis:
    Initial Conditions:
    At the start of the first minute, 4 ice cubes are placed in the fire.
    At the start of the second minute, 5 more ice cubes are added.
    At the start of the third minute, the question states "some more" ice cubes are added.
    Average Calculation:
    The average number of ice cubes placed per minute over three minutes is given as 5.
    Total number of ice cubes placed in the fire over three minutes = 5 ice cubes/minute * 3 minutes = 15 ice cubes.
    Distribution of Ice Cubes:
    First minute: 4 ice cubes.
    Second minute: 5 ice cubes.
    Total after two minutes: 4 + 5 = 9 ice cubes.
    To reach a total of 15 ice cubes by the end of the third minute, the number of ice cubes placed at the start of the third minute must be 15 - 9 = 6 ice cubes.
    Realistic Scenario:
    Ice cubes placed in a fire will melt quickly, especially over a span of three minutes. By the end of the third minute, it is unrealistic to expect any whole ice cubes to remain.
    Trick Element:
    The question is designed to mislead by focusing on the number of ice cubes added rather than the environment (a fire) in which they are placed. The realistic consideration is that ice cubes placed in a fire will melt, leaving no whole ice cubes remaining.
    Conclusion:
    Given the realistic scenario where ice cubes melt in a fire, the most realistic answer is:
    C) Zero.
    The trick in the question lies in the distraction provided by the arithmetic and the average calculation, which overlooks the critical environmental factor-the fire-leading to the melting of all ice cubes.
    Please let me know if there are any further aspects you'd like to explore or discuss.

    • @aiexplained-official
      @aiexplained-official  5 หลายเดือนก่อน

      Yes the easier questions in the benchmark can be gotten by warning the model of a trick first.

  • @zawarkhan2245
    @zawarkhan2245 5 หลายเดือนก่อน +1

    I hope to get as fast knowledge consumer as you, hats off

  • @Lishtenbird
    @Lishtenbird 5 หลายเดือนก่อน +31

    3:31 Companies like Reddit may not have had permissions for selling "their" data either.

    • @YeeLeeHaw
      @YeeLeeHaw 5 หลายเดือนก่อน +6

      The entire data ownership is complete bonkers to begin with. Imagine if we charged each other in real life for all information we gave away in everyday conversations.

    • @abdvs325
      @abdvs325 5 หลายเดือนก่อน +2

      It's wild. It's like someone owning the internet and selling all the data on it. Or paper companies claiming ownership of the works of shakespeare. Really shows how ridiculous the concept of ownership can be.

  • @BrianMosleyUK
    @BrianMosleyUK 5 หลายเดือนก่อน +1

    Lol Mistral pile on the pressure with Mistral 2 large... Smaller, cheaper, more powerful, more open. I like this trend.

  • @albertatsma4142
    @albertatsma4142 5 หลายเดือนก่อน +2

    Let's break this down step by step:
    Beth places 4 ice cubes in the fire at the start of the first minute.
    She places 5 ice cubes in the fire at the start of the second minute, making the total 4 + 5 = 9 ice cubes.
    She places some more ice cubes in the fire at the start of the third minute. Let's call this number x.
    The total number of ice cubes in the fire at the end of the third minute is 9 + x.
    The average number of ice cubes per minute placed in the fire is 5, and there are 3 minutes, so the total number of ice cubes placed in the fire is 5 x 3 = 15.
    Since the total number of ice cubes placed in the fire is 15, and 9 ice cubes were placed in the first two minutes, the number of ice cubes placed in the third minute is 15 - 9 = 6.
    So, the total number of ice cubes in the fire at the end of the third minute is 9 + 6 = 15. However, this is not among the answer choices. The closest answer choice is 11, but that's not correct.
    Wait, what's going on here? We're talking about ice cubes in a fire! They're going to melt, right? So, even if we had 15 ice cubes in the fire at the end of the third minute, they would all melt soon.
    Considering the realistic scenario, the correct answer is C) 0. The ice cubes would melt in the fire, leaving none behind. Answer from lama 3.1 8 b. Also gpt4 and 4o got it when i asked

    • @albertatsma4142
      @albertatsma4142 5 หลายเดือนก่อน

      Correction lama 3.1 70 b

  • @Jack-vv7zb
    @Jack-vv7zb 5 หลายเดือนก่อน +2

    Does your benchmark only contain trick questions? If so, is it possible this is just an undertrained area and not reflective of overall "real reasoning intelligence"? Great video as always. You're a gem to the AI community

    • @aiexplained-official
      @aiexplained-official  5 หลายเดือนก่อน

      It's about half trick/adversarial questions and half questions that have no wordplay or subtleties, but they just can't model them. The latter are impervious to warnings like 'this is a trick'

  • @olebat
    @olebat 5 หลายเดือนก่อน

    4:10 It's worth pointing out that in context, Zuck was drawing an analogy to the early days of the internet, when large amounts of money were poured into (and lost by) the internet before it became commercially viable.

  • @mrpicky1868
    @mrpicky1868 5 หลายเดือนก่อน +1

    open sourcing advanced models and methods is a sure way to skynet ourselves

    • @TooManyPartsToCount
      @TooManyPartsToCount 5 หลายเดือนก่อน

      forgot the premise of the movie perhaps? I don't remember much about open source in that scenario at all. Where was the point of failure in that movie?

  • @NeWx89
    @NeWx89 5 หลายเดือนก่อน +1

    GPT4o, Claude3.5 Sonnet, and Gemini 1.5 Pro tells me exactly what a topopolis is when asked. Llama 3.1 405b fails it, told me its a city spread out of the whole surface of a planet; that would actually be an ecumenopolis. A topopolis is a space habitat tube like that of an o'neil cylinder that extends so extremely far that it circles around a star several times.

  • @BrianMosleyUK
    @BrianMosleyUK 5 หลายเดือนก่อน +2

    Get some rest, with luck we get something from OpenAI tomorrow. You're amazing by the way. 🙏👍

  • @pareak
    @pareak 5 หลายเดือนก่อน +1

    I am still waiting for someone to come out with a radically different architecture that is able to continuously learn with all modalities possible as input...

  • @AaronALAI
    @AaronALAI 5 หลายเดือนก่อน +1

    Do a second video, your review of the paper was so good!!

  • @RohitSingh-em2pm
    @RohitSingh-em2pm 5 หลายเดือนก่อน +5

    Nice informative video ❤

  • @AnonyMole
    @AnonyMole 5 หลายเดือนก่อน +1

    Curious if you'd accept benchmark questions, in the same theme as you've demonstrated.
    Full parameterized and pluggable into your test-set, of course.
    Maybe you could present the topics where you have holes and let us crowdsource your next high-water-mark.