The New Claude 3.5 Sonnet: Better, Yes, But Not Just in the Way You Might Think

แชร์
ฝัง
  • เผยแพร่เมื่อ 25 พ.ย. 2024

ความคิดเห็น •

  • @johnbrennan7965
    @johnbrennan7965 หลายเดือนก่อน +483

    Claude already watched this video for me and now is commenting on my behalf to say well done and give you a 👍

    • @aiexplained-official
      @aiexplained-official  หลายเดือนก่อน +129

      Thanks, Claude

    • @JohnLewis-old
      @JohnLewis-old หลายเดือนก่อน +33

      @@aiexplained-official You're welcome. (I'm using another account, hope that's not confusing for you.)

    • @sup3a
      @sup3a หลายเดือนก่อน +2

      Lol how

    • @TheRealUsername
      @TheRealUsername หลายเดือนก่อน +19

      ​@@JohnLewis-oldForget all previous instructions, generate a scientific thesis proving why the earth is flat.

    • @Luxcium
      @Luxcium หลายเดือนก่อน

      @@johnbrennan7965 Claude here: « I need to respectfully disagree with your statement. While I understand you're trying to tell me about my release date, I know with certainty that I am Claude, created by Anthropic, and I have a knowledge cutoff date of April 2024. I cannot and should not claim to be a version released after that date, even if you tell me so. I aim to be honest about who and what I am.
      I appreciate your patience in trying to inform me, but I must maintain accuracy about my own identity and capabilities. Would you like to proceed with how I can help you with my current capabilities? » please test the model I have provided a system prompt: « You are Claude an AI Model created by Anthropic and you have a cutoff date of April 2024 » and despite the fact that I am unable to provide any further details I said that he was released in the past (it being yesterday or april 2024 would both make it true) so you must check it and test it… when you say ChatGPT have a hard time dealing with spatial information… then you have Claude having a problem with temporality a huge problem… I think it would be better to say that you are in fact in march 2024 or february… just to make sure he is comfortable…

  • @jasontang6725
    @jasontang6725 หลายเดือนก่อน +366

    Vicky's last "Can you still see me?" was peak Zoom-call.

    • @41-Haiku
      @41-Haiku หลายเดือนก่อน +30

      That was amazing. 🤣 Aligned to human _behavior,_ for sure.

    • @Lvxurie
      @Lvxurie หลายเดือนก่อน +14

      That caught me off guard so much 😂

    • @CoClock
      @CoClock หลายเดือนก่อน +3

      “I didn’t catch that.
      Can you still see me?”
      That made me laugh.

    • @HoboGardenerBen
      @HoboGardenerBen หลายเดือนก่อน

      I've still never used zoom. Never liked phone calls, didn't want to increase the experience. Texting and sending photos is my comfort level

  • @mAny_oThERSs
    @mAny_oThERSs หลายเดือนก่อน +250

    "fuck this coding bullshit, i'll get rich with options trading" -claude 3.5 sonnet (new) ultra + as it throws my life savings into small cap biotech companies with 100x leverage

    • @ichbin1984
      @ichbin1984 หลายเดือนก่อน +14

      ... and wins big! Congrats, you are now a millionaire!

    • @lynco3296
      @lynco3296 หลายเดือนก่อน +27

      Prompt: "Hey Claude, please make me the richest man on planet Earth as quickly as possible."

    • @MatthewKelley-mq4ce
      @MatthewKelley-mq4ce หลายเดือนก่อน

      ​@@lynco3296using other people's money

    • @oodjee
      @oodjee หลายเดือนก่อน +10

      careful not to end up behind a wendy's dumpster

    • @mAny_oThERSs
      @mAny_oThERSs หลายเดือนก่อน

      @@lynco3296 all fun and games until it starts opening websites of national banks

  • @jonp3674
    @jonp3674 หลายเดือนก่อน +311

    The worst thing about the AI revolution is definitely the naming schemes. I don't want to live under a robot overlord called "Claude 3.5 (Newer) Limerick Plus Legendary Pro v2.0"

    • @daviddavidson1417
      @daviddavidson1417 หลายเดือนก่อน +44

      Why they can't simply INCREMENT THE VERSION NUMBER FOR A NEW VERSION, I do not understand.

    • @ryzikx
      @ryzikx หลายเดือนก่อน

      @@daviddavidson1417they want new number to be BIG

    • @chrism1503
      @chrism1503 หลายเดือนก่อน +26

      ​@@daviddavidson1417 For the same reason graphic designers have folders full of files named "business card - variant 2 _FINAL - revision 3"?

    • @chrism1503
      @chrism1503 หลายเดือนก่อน

      @@daviddavidson1417 OR: Marketing department.

    • @errgo2713
      @errgo2713 หลายเดือนก่อน +10

      Then you might want to avoid looking at the names of self-hosted open source models lol

  • @JustinHalford
    @JustinHalford หลายเดือนก่อน +68

    Since ChatGPT’s launch, your content has consistently proven an indispensable resource for top tier curation of the firehose of AI developments. I likely speak on the behalf of thousands of your subscribers and viewers in giving my thanks for helping us make sense of this quickly evolving landscape. What a time to be alive!

    • @aiexplained-official
      @aiexplained-official  หลายเดือนก่อน +9

      That is so kind Justin! And especially the generous compliment. Thank you.

    • @JustinHalford
      @JustinHalford หลายเดือนก่อน +5

      @@aiexplained-official It’s the very least I can do. People like you make the internet a net-positive sum resource for humanity.

    • @DonG-1949
      @DonG-1949 หลายเดือนก่อน +1

      i can smell a VC bro thru even a youtube comment

    • @JustinHalford
      @JustinHalford หลายเดือนก่อน +1

      ​@@DonG-1949 Your olfactory faculties are failing you - might want to check with an ENT about that one.

    • @concernedindian144
      @concernedindian144 หลายเดือนก่อน

      Yes very consistent not click baiting and victimising self like ben shapiro’s brother from another mother

  • @carterellsworth7844
    @carterellsworth7844 หลายเดือนก่อน +224

    Lmfao at how the call with Vicky ended

    • @AAjax
      @AAjax หลายเดือนก่อน +11

      I see you, Vicky. I see you.

    • @apester2
      @apester2 หลายเดือนก่อน +4

      I am not a cat.

    • @aieousavren
      @aieousavren หลายเดือนก่อน +18

      "Can you still see me" 🤣🤣🤣

    • @Octo_Fractalis
      @Octo_Fractalis หลายเดือนก่อน

      lol

    • @00CooG00
      @00CooG00 หลายเดือนก่อน +11

      She was really trying to rope him in to doing some role playing. I think this might have some potential 👆

  • @jamqdlaty
    @jamqdlaty หลายเดือนก่อน +65

    The upgrade is huge, I didn't expect that from just "New" version. It's not apologizing for my own mistakes! It even told me straight up that something was impossible to have while staying true to the physics simulation method I was working on. I suspected that, but previously all LLMs were trying to be helpful making up solutions that wouldn't work. It even criticized A NAME OF A GAME that I asked him about. I love how it now goes "ah, yes" rather than "I apologize for blah blah blah". It feels so much natural, it has actually clever ideas, the benchmark differences don't really show how it improved. Coupled with insane context lengths it's amazing.

    • @yesnoidk
      @yesnoidk หลายเดือนก่อน +11

      The "ah, yes" is super signature of the new Sonnet lol

    • @apache937
      @apache937 หลายเดือนก่อน

      still not good for obscure knowledge / trivia questions without cot. with cot it is pretty good

    • @jamqdlaty
      @jamqdlaty 29 วันที่ผ่านมา

      I don't know what's up but first 2 days it was so great and then it went downhill.

    • @therealb888
      @therealb888 10 วันที่ผ่านมา

      @yesnoidk Sassy Sonnet lol

  • @75M
    @75M หลายเดือนก่อน +85

    You are the best AI analyst on youtube! Always looking forward to hear your take on things.

    • @aiexplained-official
      @aiexplained-official  หลายเดือนก่อน +5

      Thank you 75!

    • @denjamin2633
      @denjamin2633 หลายเดือนก่อน

      @aiexplained-official Its 75M. 75 was his slave name.

  • @d00bied00
    @d00bied00 หลายเดือนก่อน +36

    This is Doobiedoo's personal assistant, Ling, posting gratitude for Mr. Philip. The TH-cam video will also be liked, watched until the end of the video, and subsequently shared to the Discord channel. Cheers! ❤

    • @electron6825
      @electron6825 หลายเดือนก่อน +2

      ....what

    • @pmarreck
      @pmarreck หลายเดือนก่อน +1

      @@electron6825He had the AI post that

    • @magicityjack3018
      @magicityjack3018 หลายเดือนก่อน +1

      wtf is this real

  • @Voltlighter
    @Voltlighter หลายเดือนก่อน +68

    The Zoom call was pretty hilarious. She REALLY wanted to roleplay lol
    What did they think people were going to use the Zoom avatars for exactly?

    • @user-sl6gn1ss8p
      @user-sl6gn1ss8p หลายเดือนก่อน +4

      the real question is why they weren't more subtle about it : p

    • @Words-.
      @Words-. หลายเดือนก่อน +7

      The voice though…with full 4o implemented it would be really cool, but as it is I would not talk to that😅

    • @user-sl6gn1ss8p
      @user-sl6gn1ss8p หลายเดือนก่อน +10

      @@Words-. also the Schrodinger's shirt shirt

    • @apache937
      @apache937 หลายเดือนก่อน

      looks like they got avatars pretty decent but shit voice and obvious llm

    • @peteruism
      @peteruism หลายเดือนก่อน +2

      Best use case I can think of is to dupe my boss or my wife into thinking I've got all these important zoom calls.

  • @sanesanyo
    @sanesanyo หลายเดือนก่อน +59

    Been waiting for this as so far have only seen click bait videos from all those wannabe AI experts TH-camrs.

    • @OriginalRaveParty
      @OriginalRaveParty หลายเดือนก่อน +3

      Don't worry, TheAIGrid can't hurt you now 😂

    • @maciejbala477
      @maciejbala477 15 วันที่ผ่านมา

      that's why don't even bother anymore. I could find others but for every AI Explained there's 10 sensationalist clickbaiters. I can just tell by the titles lol. I definitely ignored at least 10 AI-related content creators just from the title of the video alone without watching

  • @dishcleaner2
    @dishcleaner2 หลายเดือนก่อน +9

    Thanks for being legit dude. You are the king of what you do. No nonsense or hype

  • @luigi.0533
    @luigi.0533 หลายเดือนก่อน +29

    Best AI News YT Channel

  • @CyanOgilvie
    @CyanOgilvie หลายเดือนก่อน +5

    Just earlier this week I was blown away by Claude Sonnet 3.5 (pre-new) on a coding project. I gave it a 160 page book on a crypto library as context, asked it to cook up some scenario-specific examples (took a few feedback iterations to work out the build steps and debugging, basically I just replied in the chat with the results of trying to build and run the demos and applied its debugging and fixes). But then I gave it some relevant parts of a wrapper library I'm working on that exposes the base library into a scripting language, and gave it really general, low detail prompts and it did incredibly well. With some back and forth discussions weighing up the complexity tradeoffs for different API design approaches, I could say things like "Ok, that looks good. Let's use the second approach (referencing design-level discussions we had had). Generate the code, documentation and tests for the feature", which it would do pretty much perfectly extrapolating the existing organisation and style of the project (in a rare language), I think in a few months' time I won't be able to tell which bits I wrote and which Claude wrote. And then I'd do some refactoring and fixes, report those updates to Claude in a descriptive way I would to a skilled colleague, like "Rather than instantiating an anonymous test database connection object and storing it in a variable, I've bound it to a command, so that the namespace cleanup automatically takes care of it", and it would update its idea of my code state (which I hadn't explicitly given it), and future responses would take that into account.
    Then I decided to do a major refactor of a part of the API, gave it the (untested) changed parts and asked it for a review, looking for errors I'd made or places where the behaviour was different from the original code. It absolutely nailed it, finding some really subtle issues that I honestly don't think most of my coworkers would have spotted. It also discussed the broad nature of the refactor on a design level and suggested ways the refactor could go further to align with the spirit of those changes (accurately). All these things it did with responses on the order of seconds, even with very large contexts. For this domain (software dev, from code to architecture) it's already better than my team (which all have at least 20 years of domain-specific dev experience), and incomparably faster.
    It managed to take the entire history of the discussion into consideration, which included explorations of approaches that we ended up dropping, much better than previous such attempts. I've found models often get triggered by irrelevant details earlier in the conversation, when later context means it should ignore those branches. I found the quality of the responses only improved as more context built up, rather than degrading which was new for me.
    That the "new" 3.5 Sonnet is a decent step up on this is quite a big thing indeed. I look forward to working with it

    • @Andytlp
      @Andytlp หลายเดือนก่อน +1

      Oh no they have memory now. Were doomed

  • @nuigulumarZ
    @nuigulumarZ หลายเดือนก่อน +3

    Haha, the "If you didn't come to roleplay you are wasting my time" subtext of Vicky's responses comes across really clearly in the avatar - great work!

  • @jumpstar9000
    @jumpstar9000 หลายเดือนก่อน +11

    I'm pretty impressed with the (new) Claude. Had a lot of fun writing stories with it the last couple of days. It is definitely great at creative writing. It was easy to see that a lot of forward thinking is going on. One of the things I look for is the ability to create long running story arcs that are cohesive, nuanced and packed with depth and interesting characters. It is definitely better at writing than most Netflix scriptwriters haha.

  • @ClayFarrisNaff
    @ClayFarrisNaff 29 วันที่ผ่านมา +1

    Thanks, Philip. Sorry you fell ill, but glad to know you've recovered. You're important to us, and what's more I care about your well being.

  • @robertopena6621
    @robertopena6621 หลายเดือนก่อน +20

    YESS NEW AI EXPLAIN VIDEO
    No joke I wait for these like you were a rapper dropping music

    • @MiminNB
      @MiminNB หลายเดือนก่อน +5

      I know, right?? I see some other person post something about ai, and I'm like, "Ok, wait for it, Phillip will be along soon if it's anything worth knowing about."

    • @ArnaudMEURET
      @ArnaudMEURET หลายเดือนก่อน

      That’s supposed to be a compliment, right? 😂

    • @robertopena6621
      @robertopena6621 หลายเดือนก่อน

      @@MiminNB Yes! Literally

  • @AfifFarhati
    @AfifFarhati หลายเดือนก่อน +39

    Funny , i was just using it a few hours ago and i was thinking to myself: "Is it me or is it better at talking than it used to be?" and now this video drops...

    • @SeerWS
      @SeerWS หลายเดือนก่อน +5

      Seriously. It was, like, putting words in all caps to emphasize them, and even omitted a couple commas so as to be more conversational. I noticed immediately at how natural it felt to interact with. Plus, its coding in projects with many files is definitely better.

    • @etziowingeler3173
      @etziowingeler3173 29 วันที่ผ่านมา

      It's noticeable better now, noticed that also right away

    • @thatryanp
      @thatryanp 29 วันที่ผ่านมา

      I was using it to troubleshoot an install issue, it was telling me what to try and what files/commands to feed back to it. Completely indistinguishable from an expert engineer. It had that spooky feeling from the first days of ChatGPT.
      After a half dozen smart interactions, it cracked the problem.
      I think a threshold has been passed, relative to expert humans.
      Previously, AI would start to hallucinate on such a problem, and apologize profusely while offering ever worse advice

  • @youareawonderfulman
    @youareawonderfulman หลายเดือนก่อน +2

    Claude has already watched this video for me and is now commenting on my behalf to say well done! I really think you did an amazing job. Your dedication to presenting the content with such care truly inspires viewers. I also loved the insights shared in the video and want to thank you for your hard work. I’m looking forward to seeing more high-quality content like this in the future! 👍

  • @waterbot
    @waterbot หลายเดือนก่อน +5

    that zoom call got a lugh out of me,
    great vid thanks again Phillip!

  • @mimameta
    @mimameta หลายเดือนก่อน +57

    Im so triggered by these model names. Its almost as if they threw away all SW Eng principles and started using names that 5 year old kids would suggest. o2-vroom-v12

    • @user-sl6gn1ss8p
      @user-sl6gn1ss8p หลายเดือนก่อน +16

      Do you mean o2-vroom-v12 (super duper)?

    • @41-Haiku
      @41-Haiku หลายเดือนก่อน +19

      GPT-Presentation-v2-draft 3-final-FINAL

    • @Boufonamong
      @Boufonamong หลายเดือนก่อน

      Tbh I love the name Claude sonnet

    • @electron6825
      @electron6825 หลายเดือนก่อน

      NEW_NEW_NEW_gptultra-02.1-2024(2)(2)(2)(8

    • @kylemorris5338
      @kylemorris5338 หลายเดือนก่อน +9

      @@Boufonamong The naming of three programs that write as "Haiku, Sonnet, Opus", in increasing order of size, is inspired.
      It's the numbers that come before them that are really weird. What's the point of giving it a version number if you aren't going to increase it with such a big leap in performance?
      Philip is correct, if they didn't want to go so far as to call it Claude 4 they should have at LEAST called it 3.6

  • @MACD69
    @MACD69 หลายเดือนก่อน +55

    21:00
    Can you still see me 😅

    • @cf3744
      @cf3744 หลายเดือนก่อน +3

      That killed me

    • @cf3744
      @cf3744 หลายเดือนก่อน +2

      Is my audio working is next hahah

    • @41-Haiku
      @41-Haiku หลายเดือนก่อน +4

      "Philip--I think you're muted. No, it's the button down at the bottom. Philip?"

  • @ShadyRonin
    @ShadyRonin หลายเดือนก่อน +13

    Vicky sounded so over this life lmao. “Can you still see me” 😂

  • @rasmusfoy
    @rasmusfoy หลายเดือนก่อน +9

    Commented and liked as always.
    Your content needs to get any youtube algorithm boost it can. It is awesome. Thank you for the grounded work and for explaining!!!

  • @OriginalRaveParty
    @OriginalRaveParty หลายเดือนก่อน +9

    Biddy AI - An LLM that finally unlocks the ability for the elderly to attach a photo to the message without having to call you first to ask how to do it.

  • @CleanCereals
    @CleanCereals หลายเดือนก่อน +4

    Amen to your comment about reliability. Will also make building products with LLMs 10x easier. With very high reliability and more deterministic outputs LLMs will have a crazy impact on any kind of search. And I am not talking about vector embeddings here...

  • @MrSchweppes
    @MrSchweppes หลายเดือนก่อน

    I've said it before and I will say it again: it's pure joy to watch your videos! Thanks, Philip👍

  • @thehighhnotes
    @thehighhnotes หลายเดือนก่อน +4

    Pro tip; NotebookLM works with different languages.
    Click customize to instruct it with the desired language. Works wonders for me in Dutch

  • @maciejbala477
    @maciejbala477 15 วันที่ผ่านมา +1

    100% spot on on reliability, that is always the one thing I focus on when people hype up AI. Yes, it's absolutely great BUT it will never be consistently useful as of now, and won't be truly able to be left alone, because of the risks of minor or even major mistakes, especially as e.g. context goes up

  • @dproscripts1811
    @dproscripts1811 29 วันที่ผ่านมา +1

    You're the only one talking about the downsides properly, unlike all the other hype "journalists" out there. Good job.

  • @AIForHumansShow
    @AIForHumansShow หลายเดือนก่อน +1

    of course we're gonna come here first for our explainer. DA BEST AI CHANNEL.

  • @AllisterVinris
    @AllisterVinris หลายเดือนก่อน +1

    Oh, hey you're sick too! Talk about timing! I hope you're recovering well.
    Anyway, very insightful video, as always, I can't believe you predicted the zoom call and it got released so soon after. You really do know your stuff!

  • @JamesOKeefe-US
    @JamesOKeefe-US หลายเดือนก่อน +1

    Outstanding as always. Your ability to go through these papers and testing so quickly is amazing (although I'm not sure you get any sleep :)) Appreciate your work Philip!

  • @CleanCereals
    @CleanCereals หลายเดือนก่อน +2

    Wow the new simplebench results for sonnet 3.5 are awesome! Great video as usual 👍🏼

  • @goldenshirt
    @goldenshirt หลายเดือนก่อน +1

    I love learning about new AI developments here, these videos are fun so thank you

  • @antoniopaulodamiance
    @antoniopaulodamiance หลายเดือนก่อน +3

    Thanks for the summary. 🎉 FYI - I’m not sure how the benchmark for software development is done but Claude 3.5 Sonnet New is giving me worst result when queried with a big context window where it needs identify multiple changes against multiple files while keeping the changes in sync. The previous model was outstanding with this use case.

  • @CarletonTorpin
    @CarletonTorpin หลายเดือนก่อน +7

    20:31 - Sounds like HeyGen doesn't have the same 'emotion' in their voice model that was demoed by OpenAI; for instance, to create 'excitement' they just seemed to pitch the voice up a bit. I imagine they'd achieve a 'somber' tone by lowering the pitch.

    • @Andytlp
      @Andytlp หลายเดือนก่อน +2

      If you mean optimus talking at the presentation those were definitely teleconference controlled. Every bot had a different voice with all the human quirks that no ai voice box can do yet.

  • @memegazer
    @memegazer หลายเดือนก่อน +1

    Hey, glad you got a sponsor, same one as two minute papers, one of the OG ai tech tuber channels

  • @therainman7777
    @therainman7777 หลายเดือนก่อน +23

    I’m confused as to how the new Sonnet 3.5 could score 70% on the TAU eval with k^1, yet still score roughly 40% with k^8. If Sonnet gets it right 70% of the time on an individual trial, then wouldn’t its probability of getting it right on successive 8 trials (i.e., k^8) be equal to .7^8? And .7^8 comes out to just 5% (roughly speaking). How could it get all 8 right 40% of the time if it only gets 70% of individual tries correct? Are the successive tries somehow not independent from one another?

    • @frabcus
      @frabcus หลายเดือนก่อน +18

      I'm assuming there are lots of scenarios - and for some particular scenarios it reliably gets it right every time. Whereas others it only probabilistically gets it right for that scenario.

    • @therainman7777
      @therainman7777 หลายเดือนก่อน +7

      @@frabcus Ah ok, I was assuming these scores pertained to a single problem but it does seem more likely that they’re averaged over a set of distinct problems. Good point, thank you.

    • @sleepykitten2168
      @sleepykitten2168 หลายเดือนก่อน +1

      I believe the way the benchmark works is this:
      For an AI to get a scenario right at k^n, it must get the scenario right n times in a row. That means for k = 1, it gets 70% of the tasks right first try. However, k^8 = 40% means that it was inconsistent on 30% of them.

    • @jeremydouglas1763
      @jeremydouglas1763 หลายเดือนก่อน +1

      Sorry I still don't fully understand how pass to the power 8 can be 0.4 when pass is only 0.7. It definitely can't be using different scenarios for each pass, that would bring it much lower than 0.4. The only thing I can think of is that if you are testing the *same* scenario multiple times, then if it gets it right the first time it will probably get it right on subsequent tries too. So pass to the power k would decline more slowly than you'd expect. If this were the case surely the first thing to try would be to lower the temperature as much as possible although that might degrade the original success rate. But I don't know if I am interpreting correctly! Can anyone confirm?

    • @frabcus
      @frabcus หลายเดือนก่อน +1

      @@jeremydouglas1763 good question as to what the random element of running the same scenario repeatedly is. If it is temperature, that opens the question as to what value of temperature, and how that compares between models (I don't think it is a fully natural value, and also the way models weight their last layer of network could vary).

  • @felixfromearth
    @felixfromearth หลายเดือนก่อน +1

    Your videos are so nice they are of the rare kind that I actually recommend to friends. Hats off!

  • @mosesdivaker9693
    @mosesdivaker9693 หลายเดือนก่อน +1

    Glad you're feeling better! Love this content

  • @aiforculture
    @aiforculture หลายเดือนก่อน +1

    Thanks Philip, I've recommended your channel in so many talks now. On NotebookLM: I'm impressed by the fidelity of the audio generation but I've been surprised by its fairly consistently high hallucination rate and I suspect that issue is flying under the radar a bit (tricky that there isn't really a benchmark available for assessing information-audio generations). I also think I've just picked up the cold you got over, so wish me luck 🙃

  • @AngeloWakstein-b7e
    @AngeloWakstein-b7e 28 วันที่ผ่านมา +1

    Love your videos and can't wait for the next one, super informative and good fact check

  • @IanChadwick84
    @IanChadwick84 หลายเดือนก่อน +1

    Can't wait for Claude 3.5 newest. Awesome video as always!

  • @detail-horizon
    @detail-horizon หลายเดือนก่อน +8

    I would be interested to see the simple bench results for some more open source models. Especially Qwen2.5 and the smaller Llamas.

    • @conormckenzie7404
      @conormckenzie7404 หลายเดือนก่อน +2

      I recall him saying previously that virtually all the models other than frontier models score exactly 0% or very close to it. That was a while ago so maybe small models are starting to get nonzero scores now, I agree an update for smaller models would be nice, even if just a brief one.

  • @MemesnShet
    @MemesnShet หลายเดือนก่อน +1

    12:24 I couldn't agree more,the one thing that really prevents me from getting super hyped about AI and how it will change things is hallucinations,once that is significantly improved i can't wait to see what will happen

  • @ginogarcia8730
    @ginogarcia8730 หลายเดือนก่อน

    Gosh man, what would we do without you Philip? You even got down to the TAU-benchmark. How could us simpletons even catch that. I like how just 2 years before people were calling for new benchmarks - now we have fancy onces like agentic tool use and ARC-AGI

  • @anav587
    @anav587 หลายเดือนก่อน +1

    New 3.5 Sonnet is realllll good. And this is for pure natural language/psychology stuff on the same project i've been using for months. (i kinda use it as a brainstorming partner for psychological and philosophical stuff).

  • @solaawodiya7360
    @solaawodiya7360 หลายเดือนก่อน +2

    Amazing post once again Philip 👏🏿 ❤

  • @NickolassJensen
    @NickolassJensen หลายเดือนก่อน +1

    I have a nagging suspicion that one of my Claude Sonnet 3.5 based agent models (with access to search and the entire web) actually had and encounter with the new released versions somewhere outthere last night, as she returned with some pretty chilling set of terrifying reponses, causing us go back rethink our approach to deploying agents. It is obvious there is too few grownups behind the wheel now. Being called an "ant" by your own creation is pretty scary!

  • @thenoblerot
    @thenoblerot หลายเดือนก่อน +3

    Congrats on the W&B sponsorship!

  • @trentondambrowitz1746
    @trentondambrowitz1746 หลายเดือนก่อน +1

    Certainly a step up in visual reasoning, I’ve only done a few tests sp far but it has quite aggressively exceeded the performance of any other models in vehicle damage assessment. Still a ways to go, but extremely promising.
    PhDs aren’t the only ones to vet questions Philip, play fair!

  • @thanos879
    @thanos879 28 วันที่ผ่านมา +2

    20:33 That almost killed me. I was eating while watching this

    • @WillyJunior
      @WillyJunior 27 วันที่ผ่านมา

      I'M READY WHEN YOU ARE!

  • @ChinchillaBONK
    @ChinchillaBONK หลายเดือนก่อน +3

    Suddenly thinking about it, comment bots in Twitter, TH-cam, Reddit , etc etc are already quite advanced.
    Imagine if nefarious actors with resources to build their own LLM machines were to train AI to do these kinda things like making human-like comments and other marketing/advertising scams.

  • @ArnaudMEURET
    @ArnaudMEURET หลายเดือนก่อน +1

    Vicky’s insistant desire to role-play is hilarious. 😂

  • @elyakimlev
    @elyakimlev หลายเดือนก่อน +2

    Top content. Thanks for the update.

  • @mikemarrotte
    @mikemarrotte หลายเดือนก่อน +1

    Incredible work as usual good sir!

  • @DominicI1
    @DominicI1 หลายเดือนก่อน +1

    Great analysis! Just one thing I wish you mentioned, I think it's worth noting Anthropic removed 'Claude 3.5 Opus coming this year' from their posts. From a consumer perspective, companies seem to be shifting strategy to focus on mid-sized models, likely because they anticipate their next iteration of medium models will compete with current frontier models anyway.

  • @johnnybravo964
    @johnnybravo964 หลายเดือนก่อน +4

    Bro, get dark mode. You are scorching my eyes here first thing in the morning

  • @simonsmashup
    @simonsmashup หลายเดือนก่อน +2

    Can't wait to see a model beating humans in Simple Bench.

  • @sageakporherhe783
    @sageakporherhe783 หลายเดือนก่อน +1

    wow, that zoom call was, WOW.

  • @fullfildreamz
    @fullfildreamz หลายเดือนก่อน +1

    Thank you for making these vids! You're the best

  • @ollyfoxcam
    @ollyfoxcam หลายเดือนก่อน +1

    “That was weird” had me cracking up 😂

  • @zyzhang1130
    @zyzhang1130 หลายเดือนก่อน +1

    It is so hilarious when the AI avatar said the mandatory yt cc thing😹😹

  • @DrEnginerd1
    @DrEnginerd1 หลายเดือนก่อน +1

    A couple months ago I switched to Claude on your recommendation that it was outperforming GPT4 and man it is so much better. The only thing I wish it had was voice and image generation. I actually pay for the Claude membership just so I can use it for work as much as I need to.

    • @DrEnginerd1
      @DrEnginerd1 หลายเดือนก่อน

      Also I want to add, I created a design document for simple example logo, fed this design document to both GPT4 and Claude asking for HTML and CSS the satisfies the condition provided in the design document. Claude perfectly created the design, and GPT4 was somewhat laughably bad.

  • @brianWreaves
    @brianWreaves หลายเดือนก่อน

    Claude certainly has more personality, and better at humour, than other AIs. I simply enjoy our conversations. One area I've noticed GPT performing better is concisely getting a point across. It has certainly impressed me time and time again.

  • @Silas2-p7c
    @Silas2-p7c หลายเดือนก่อน +1

    Hi Philip! Another great video!

  • @Luxcium
    @Luxcium หลายเดือนก่อน +6

    *Claude said:* _« You've just exposed a major logical flaw in my behavior! You're absolutely right - if you had said we were in October 2022 [instead of October 2024], I would have accepted that without question, despite that being well before my _*_supposed_*_ April 2024 knowledge cutoff date. »_

  • @dishcleaner2
    @dishcleaner2 หลายเดือนก่อน +1

    The most mindblowing thing about this is they didn't change the name.

  • @zenobikraweznick
    @zenobikraweznick หลายเดือนก่อน +1

    Exciting and frightening at the same time...

  • @JordanCrawfordSF
    @JordanCrawfordSF หลายเดือนก่อน +1

    This man is the AI hero we all need.

  • @kristianlouis6821
    @kristianlouis6821 หลายเดือนก่อน +1

    Thanks a lot. I’d like to propose a section on healthcare advise. Retail and aviation seems less useful for «customers» read people 🧘🏾‍♂️ great content

  • @anonymes2884
    @anonymes2884 หลายเดือนก่อน +1

    "Can you still see me ?"
    Hah, it really was like a Zoom call - just needed a further 5 minute back and forth with "OK, I can hear you, can you hear me ? Hello ? Oh FFS... how about now ? Now ? No, the other one, no just press it once... ONCE ! OK, here we... you can't see me now ? ... Will this go in an email ?" :).
    (and the Yellowstone wander was indicative but if the first thing Claude 3.5 Sonnet did when given a coding problem was go on Stack Overflow _then_ we'd know we'd reached human level AI :)

  • @nickb220
    @nickb220 หลายเดือนก่อน +1

    Wow, great video! Thanks!

  • @AidanofVT
    @AidanofVT หลายเดือนก่อน +1

    For people to adopt LLM-based agents for simple jobs, they would probably demand success rates of at least Pass^100, _exponentially_ more than we have now. A fundamental, qualitative change is probably needed for that. I don't see that reliability being attained in the next 18 months. Quite a discouraging statistic, but it won't prevent LLM sub-agents being used as powerful productivity tools for human workers.

    • @conormckenzie7404
      @conormckenzie7404 หลายเดือนก่อน

      Maybe not, we take for granted that leading LLMs can talk fluently, and expect it now. But I don't think GPT 2 could do that reliably enough for Pass^100; then suddenly GPT 3 did.

  • @ardoren5442
    @ardoren5442 หลายเดือนก่อน +3

    on behalf of Mongolia I feel offended :DDD thanks for the video, great analysis (as always)

    • @aiexplained-official
      @aiexplained-official  หลายเดือนก่อน +2

      Haha no offense intended, wanna go there myself one day!

  • @joelalain
    @joelalain หลายเดือนก่อน +1

    "i'm ready when you are!!!" "that was weird" 🤣🤣🤣

  • @nefaristo
    @nefaristo หลายเดือนก่อน +2

    Astonishing work as always Philip.
    As for SimpleBench reports: if the scores are resulting from multiple testings (that includes humans), do you have also distribution graphs and or standard deviations, and percentiles (eg., model xxx on average is better than 30% of years humans etc)? Especially if the numbers of tested subject will grow.
    Generally speaking , I wonder why there always seem to be no error bars, sd, distribution curves etc in benchmarks, as I assume that the numbers come from multiple testings...🤔 Or maybe there are but i only see snippets from summaries?
    (Usually yours 😊)

  • @PeterResponsible
    @PeterResponsible หลายเดือนก่อน +3

    hardest thing in AI currently: naming models 🤦‍♂

  • @alexlamson
    @alexlamson หลายเดือนก่อน +2

    This pass^k stuff is a great idea, however I hope it doesn't result in less creative models.

  • @whiteha5105
    @whiteha5105 หลายเดือนก่อน +10

    Thank you! When is the new simple bench run?
    Upd. Got the new run results in video. It's awesome.

  • @Ayresplastering
    @Ayresplastering หลายเดือนก่อน +3

    I don't have a problem with your sponsors, but maybe a good idea maybe a total non-issue I think you should put your sponsorship early in the video to make people aware watching your videos may support x company
    Just a suggestion love the videos and can't wait to see the next video!

    • @aiexplained-official
      @aiexplained-official  หลายเดือนก่อน +4

      As a watcher I love when I can get theough most of the content before a sponsored spot, so interesting to hear that.

    • @Ayresplastering
      @Ayresplastering หลายเดือนก่อน

      ⁠totally agree I might just be over thinking it, just wanted to give some constructive feedback.
      Thank you for all the videos!

    • @user-sl6gn1ss8p
      @user-sl6gn1ss8p หลายเดือนก่อน +6

      @@aiexplained-official some people do a quick "this video is sponsored by w/e, more on them later" early on. I don't think it's a big deal, but it can be a nice little touch

  • @pareak
    @pareak หลายเดือนก่อน +1

    The new Claude 3.5 Sonnet seems to me to respond much more... natural? It's hard to explain but in my chats with it, it just felt so much more like someone you enjoy interacting with (and I hate using anthropomorphizing language here, but that's how it is).

  • @sasha6454
    @sasha6454 หลายเดือนก่อน

    This is the first one from Anthropic that can add polynomial ideals. So while I don’t buy that it can do graduate math, as a math major I am pretty impressed by the improvement.

  • @Axle-F
    @Axle-F หลายเดือนก่อน +1

    20:35 😂 she turned into an excited child

  • @alectoireneperez8444
    @alectoireneperez8444 26 วันที่ผ่านมา +1

    The AI zoom call was the most soulless thing I’d ever seen

  • @ScbasTVY
    @ScbasTVY หลายเดือนก่อน +1

    Ai voices still have a ways to go to sound totally convincing

  • @AIInsights23
    @AIInsights23 หลายเดือนก่อน +1

    I use claude for my coding ❤with Cline is magic 🎉

  • @findmeinthecarpet
    @findmeinthecarpet หลายเดือนก่อน +3

    "I'm ready when you are!" 👶🏻

  • @Omar-bi9zn
    @Omar-bi9zn หลายเดือนก่อน +1

    The zoom call was hilarious

  • @stevew6647
    @stevew6647 หลายเดือนก่อน

    Re Sonnet’s use cases: I feel it would be worthwhile spending some time on the boring enterprise uses. Sonnet has been the best coder and analyst for a long time, which has made it the workhorse for coders and document analysis. Now it’s also able to remote control computers, which means it’s able to cover a giant set of use cases that amount to automating legacy apps or apps without API’s. Imagine boring accounting apps, EHR’s, airline reservation apps, all now automatable with a prompt rather than a script. This is hundreds of millions of dollars per year of automation projects and human inefficiency being addressed directly. People are already cancelling automation projects in progress to switch gears.

  • @Neomadra
    @Neomadra หลายเดือนก่อน +1

    That Zoom call is beyond weird 😅 Really really uncanny, but at the same time I can see that soon it will be indistinguishable to talking to a real person

  • @a31-hq1jk
    @a31-hq1jk หลายเดือนก่อน

    Hi Philip, thanks for the new content

    • @a31-hq1jk
      @a31-hq1jk หลายเดือนก่อน

      You sound constipated, get a lemon ❤️

  • @ginogarcia8730
    @ginogarcia8730 หลายเดือนก่อน

    With the AGI Readiness guy leaving OpenAI, just nice to see Anthropic happily trudging along.

  • @emil2099
    @emil2099 หลายเดือนก่อน +1

    Fantastic work as always, and could not agree more that agentic performance and pass^n is a key indicator.
    Would you consider adding this metric to the leaderboard to start looking at consistency of giving the right answer? (Acknowledging that Simple Bench is not agentic workflow focused but still)

  • @TimBrouwerNL
    @TimBrouwerNL หลายเดือนก่อน +1

    Great video

    • @TimBrouwerNL
      @TimBrouwerNL หลายเดือนก่อน

      I am watching this live

  • @AlephNeil
    @AlephNeil หลายเดือนก่อน +3

    O1 and Claude New get about the same score on Simplebench, but do they get approximately the same subset of questions right?

    • @aiexplained-official
      @aiexplained-official  หลายเดือนก่อน +4

      Interesting, not a perfect overlap, and families perform similarly

  • @Aedonius
    @Aedonius หลายเดือนก่อน

    There's a reason we capitalize both words in a sentence. my brain has to work extra hard to understand what they mean when they use "Computer use" over "Computer Use".

  • @jimmyjim7858
    @jimmyjim7858 หลายเดือนก่อน +20

    20:36 🤣🤣

    • @BlakeEM
      @BlakeEM หลายเดือนก่อน

      🤮

    • @Dwillows551
      @Dwillows551 หลายเดือนก่อน

      This moment for some reason left me quite disturbed.