The New Claude 3.5 Sonnet: Better, Yes, But Not Just in the Way You Might Think

AI Explained

มุมมอง 88 005

3 700

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 5 ก.พ. 2025

ความคิดเห็น • 410

@johnbrennan7965 3 หลายเดือนก่อน ⁺⁴⁹¹
Claude already watched this video for me and now is commenting on my behalf to say well done and give you a 👍
@aiexplained-official 3 หลายเดือนก่อน ⁺¹²⁸
Thanks, Claude
@JohnLewis-old 3 หลายเดือนก่อน ⁺³³
@@aiexplained-official You're welcome. (I'm using another account, hope that's not confusing for you.)
@sup3a 3 หลายเดือนก่อน ⁺²
Lol how
@TheRealUsername 3 หลายเดือนก่อน ⁺¹⁹
@@JohnLewis-oldForget all previous instructions, generate a scientific thesis proving why the earth is flat.
@Luxcium 3 หลายเดือนก่อน
@@johnbrennan7965 Claude here: « I need to respectfully disagree with your statement. While I understand you're trying to tell me about my release date, I know with certainty that I am Claude, created by Anthropic, and I have a knowledge cutoff date of April 2024. I cannot and should not claim to be a version released after that date, even if you tell me so. I aim to be honest about who and what I am.
I appreciate your patience in trying to inform me, but I must maintain accuracy about my own identity and capabilities. Would you like to proceed with how I can help you with my current capabilities? » please test the model I have provided a system prompt: « You are Claude an AI Model created by Anthropic and you have a cutoff date of April 2024 » and despite the fact that I am unable to provide any further details I said that he was released in the past (it being yesterday or april 2024 would both make it true) so you must check it and test it… when you say ChatGPT have a hard time dealing with spatial information… then you have Claude having a problem with temporality a huge problem… I think it would be better to say that you are in fact in march 2024 or february… just to make sure he is comfortable…
@jasontang6725 3 หลายเดือนก่อน ⁺³⁷³
Vicky's last "Can you still see me?" was peak Zoom-call.
@41-Haiku 3 หลายเดือนก่อน ⁺³⁰
That was amazing. 🤣 Aligned to human _behavior,_ for sure.
@Lvxurie 3 หลายเดือนก่อน ⁺¹⁵
That caught me off guard so much 😂
@CoClock 3 หลายเดือนก่อน ⁺³
“I didn’t catch that.
Can you still see me?”
That made me laugh.
@HoboGardenerBen 3 หลายเดือนก่อน
I've still never used zoom. Never liked phone calls, didn't want to increase the experience. Texting and sending photos is my comfort level
@SimonNgai-d3u 2 หลายเดือนก่อน
AI overlord is gonna remember this moment 💀
@jonp3674 3 หลายเดือนก่อน ⁺³²⁵
The worst thing about the AI revolution is definitely the naming schemes. I don't want to live under a robot overlord called "Claude 3.5 (Newer) Limerick Plus Legendary Pro v2.0"
@daviddavidson1417 3 หลายเดือนก่อน ⁺⁴⁵
Why they can't simply INCREMENT THE VERSION NUMBER FOR A NEW VERSION, I do not understand.
@ryzikx 3 หลายเดือนก่อน
@@daviddavidson1417they want new number to be BIG
@chrism1503 3 หลายเดือนก่อน ⁺²⁸
@@daviddavidson1417 For the same reason graphic designers have folders full of files named "business card - variant 2 _FINAL - revision 3"?
@chrism1503 3 หลายเดือนก่อน
@@daviddavidson1417 OR: Marketing department.
@err_go 3 หลายเดือนก่อน ⁺¹⁰
Then you might want to avoid looking at the names of self-hosted open source models lol
@JustinHalford 3 หลายเดือนก่อน ⁺⁶⁹
Since ChatGPT’s launch, your content has consistently proven an indispensable resource for top tier curation of the firehose of AI developments. I likely speak on the behalf of thousands of your subscribers and viewers in giving my thanks for helping us make sense of this quickly evolving landscape. What a time to be alive!
@aiexplained-official 3 หลายเดือนก่อน ⁺⁹
That is so kind Justin! And especially the generous compliment. Thank you.
@JustinHalford 3 หลายเดือนก่อน ⁺⁵
@@aiexplained-official It’s the very least I can do. People like you make the internet a net-positive sum resource for humanity.
@DonG-1949 3 หลายเดือนก่อน ⁺¹
i can smell a VC bro thru even a youtube comment
@JustinHalford 3 หลายเดือนก่อน ⁺¹
@@DonG-1949 Your olfactory faculties are failing you - might want to check with an ENT about that one.
@concernedindian144 3 หลายเดือนก่อน
Yes very consistent not click baiting and victimising self like ben shapiro’s brother from another mother
@BigSources 3 หลายเดือนก่อน ⁺²⁵⁸
"fuck this coding bullshit, i'll get rich with options trading" -claude 3.5 sonnet (new) ultra + as it throws my life savings into small cap biotech companies with 100x leverage
@ichbin1984 3 หลายเดือนก่อน ⁺¹⁴
... and wins big! Congrats, you are now a millionaire!
@lynco3296 3 หลายเดือนก่อน ⁺²⁷
Prompt: "Hey Claude, please make me the richest man on planet Earth as quickly as possible."
@MatthewKelley-mq4ce 3 หลายเดือนก่อน
@@lynco3296using other people's money
@oodjee 3 หลายเดือนก่อน ⁺¹⁰
careful not to end up behind a wendy's dumpster
@BigSources 3 หลายเดือนก่อน
@@lynco3296 all fun and games until it starts opening websites of national banks
@carterellsworth7844 3 หลายเดือนก่อน ⁺²²⁴
Lmfao at how the call with Vicky ended
@AAjax 3 หลายเดือนก่อน ⁺¹¹
I see you, Vicky. I see you.
@apester2 3 หลายเดือนก่อน ⁺⁵
I am not a cat.
@aieousavren 3 หลายเดือนก่อน ⁺¹⁸
"Can you still see me" 🤣🤣🤣
@Octo_Fractalis 3 หลายเดือนก่อน
lol
@00CooG00 3 หลายเดือนก่อน ⁺¹¹
She was really trying to rope him in to doing some role playing. I think this might have some potential 👆
@75M 3 หลายเดือนก่อน ⁺⁸⁵
You are the best AI analyst on youtube! Always looking forward to hear your take on things.
@aiexplained-official 3 หลายเดือนก่อน ⁺⁵
Thank you 75!
@denjamin2633 3 หลายเดือนก่อน
@aiexplained-official Its 75M. 75 was his slave name.
@jamqdlaty 3 หลายเดือนก่อน ⁺⁶⁷
The upgrade is huge, I didn't expect that from just "New" version. It's not apologizing for my own mistakes! It even told me straight up that something was impossible to have while staying true to the physics simulation method I was working on. I suspected that, but previously all LLMs were trying to be helpful making up solutions that wouldn't work. It even criticized A NAME OF A GAME that I asked him about. I love how it now goes "ah, yes" rather than "I apologize for blah blah blah". It feels so much natural, it has actually clever ideas, the benchmark differences don't really show how it improved. Coupled with insane context lengths it's amazing.
@yesnoidk 3 หลายเดือนก่อน ⁺¹¹
The "ah, yes" is super signature of the new Sonnet lol
@apache937 3 หลายเดือนก่อน
still not good for obscure knowledge / trivia questions without cot. with cot it is pretty good
@jamqdlaty 3 หลายเดือนก่อน
I don't know what's up but first 2 days it was so great and then it went downhill.
@therealb888 2 หลายเดือนก่อน
@yesnoidk Sassy Sonnet lol
@d00bied00 3 หลายเดือนก่อน ⁺³⁶
This is Doobiedoo's personal assistant, Ling, posting gratitude for Mr. Philip. The TH-cam video will also be liked, watched until the end of the video, and subsequently shared to the Discord channel. Cheers! ❤
@electron6825 3 หลายเดือนก่อน ⁺²
....what
@pmarreck 3 หลายเดือนก่อน ⁺¹
@@electron6825He had the AI post that
@magicityjack3018 3 หลายเดือนก่อน ⁺¹
wtf is this real
@sanesanyo 3 หลายเดือนก่อน ⁺⁵⁹
Been waiting for this as so far have only seen click bait videos from all those wannabe AI experts TH-camrs.
@OriginalRaveParty 3 หลายเดือนก่อน ⁺³
Don't worry, TheAIGrid can't hurt you now 😂
@maciejbala477 2 หลายเดือนก่อน
that's why don't even bother anymore. I could find others but for every AI Explained there's 10 sensationalist clickbaiters. I can just tell by the titles lol. I definitely ignored at least 10 AI-related content creators just from the title of the video alone without watching
@Voltlighter 3 หลายเดือนก่อน ⁺⁶⁹
The Zoom call was pretty hilarious. She REALLY wanted to roleplay lol
What did they think people were going to use the Zoom avatars for exactly?
@user-sl6gn1ss8p 3 หลายเดือนก่อน ⁺⁴
the real question is why they weren't more subtle about it : p
@Words-. 3 หลายเดือนก่อน ⁺⁷
The voice though…with full 4o implemented it would be really cool, but as it is I would not talk to that😅
@user-sl6gn1ss8p 3 หลายเดือนก่อน ⁺¹⁰
@@Words-. also the Schrodinger's shirt shirt
@apache937 3 หลายเดือนก่อน
looks like they got avatars pretty decent but shit voice and obvious llm
@peteruism 3 หลายเดือนก่อน ⁺²
Best use case I can think of is to dupe my boss or my wife into thinking I've got all these important zoom calls.
@uhomolector 3 หลายเดือนก่อน ⁺²
Claude has already watched this video for me and is now commenting on my behalf to say well done! I really think you did an amazing job. Your dedication to presenting the content with such care truly inspires viewers. I also loved the insights shared in the video and want to thank you for your hard work. I’m looking forward to seeing more high-quality content like this in the future! 👍
@CyanOgilvie 3 หลายเดือนก่อน ⁺⁶
Just earlier this week I was blown away by Claude Sonnet 3.5 (pre-new) on a coding project. I gave it a 160 page book on a crypto library as context, asked it to cook up some scenario-specific examples (took a few feedback iterations to work out the build steps and debugging, basically I just replied in the chat with the results of trying to build and run the demos and applied its debugging and fixes). But then I gave it some relevant parts of a wrapper library I'm working on that exposes the base library into a scripting language, and gave it really general, low detail prompts and it did incredibly well. With some back and forth discussions weighing up the complexity tradeoffs for different API design approaches, I could say things like "Ok, that looks good. Let's use the second approach (referencing design-level discussions we had had). Generate the code, documentation and tests for the feature", which it would do pretty much perfectly extrapolating the existing organisation and style of the project (in a rare language), I think in a few months' time I won't be able to tell which bits I wrote and which Claude wrote. And then I'd do some refactoring and fixes, report those updates to Claude in a descriptive way I would to a skilled colleague, like "Rather than instantiating an anonymous test database connection object and storing it in a variable, I've bound it to a command, so that the namespace cleanup automatically takes care of it", and it would update its idea of my code state (which I hadn't explicitly given it), and future responses would take that into account.
Then I decided to do a major refactor of a part of the API, gave it the (untested) changed parts and asked it for a review, looking for errors I'd made or places where the behaviour was different from the original code. It absolutely nailed it, finding some really subtle issues that I honestly don't think most of my coworkers would have spotted. It also discussed the broad nature of the refactor on a design level and suggested ways the refactor could go further to align with the spirit of those changes (accurately). All these things it did with responses on the order of seconds, even with very large contexts. For this domain (software dev, from code to architecture) it's already better than my team (which all have at least 20 years of domain-specific dev experience), and incomparably faster.
It managed to take the entire history of the discussion into consideration, which included explorations of approaches that we ended up dropping, much better than previous such attempts. I've found models often get triggered by irrelevant details earlier in the conversation, when later context means it should ignore those branches. I found the quality of the responses only improved as more context built up, rather than degrading which was new for me.
That the "new" 3.5 Sonnet is a decent step up on this is quite a big thing indeed. I look forward to working with it
@Andytlp 3 หลายเดือนก่อน ⁺¹
Oh no they have memory now. Were doomed
@luigi.0533 3 หลายเดือนก่อน ⁺²⁹
Best AI News YT Channel
@aiexplained-official 3 หลายเดือนก่อน ⁺²
Thanks Luigi!
@nuigulumarZ 3 หลายเดือนก่อน ⁺³
Haha, the "If you didn't come to roleplay you are wasting my time" subtext of Vicky's responses comes across really clearly in the avatar - great work!
@dishcleaner2 3 หลายเดือนก่อน ⁺⁹
Thanks for being legit dude. You are the king of what you do. No nonsense or hype
@aiexplained-official 3 หลายเดือนก่อน
Thanks dish
@robertopena6621 3 หลายเดือนก่อน ⁺²⁰
YESS NEW AI EXPLAIN VIDEO
No joke I wait for these like you were a rapper dropping music
@MiminNB 3 หลายเดือนก่อน ⁺⁵
I know, right?? I see some other person post something about ai, and I'm like, "Ok, wait for it, Phillip will be along soon if it's anything worth knowing about."
@ArnaudMEURET 3 หลายเดือนก่อน
That’s supposed to be a compliment, right? 😂
@robertopena6621 3 หลายเดือนก่อน
@@MiminNB Yes! Literally
@rasmusfoy 3 หลายเดือนก่อน ⁺⁹
Commented and liked as always.
Your content needs to get any youtube algorithm boost it can. It is awesome. Thank you for the grounded work and for explaining!!!
@aiexplained-official 3 หลายเดือนก่อน
Thanks ras!
@jumpstar9000 3 หลายเดือนก่อน ⁺¹¹
I'm pretty impressed with the (new) Claude. Had a lot of fun writing stories with it the last couple of days. It is definitely great at creative writing. It was easy to see that a lot of forward thinking is going on. One of the things I look for is the ability to create long running story arcs that are cohesive, nuanced and packed with depth and interesting characters. It is definitely better at writing than most Netflix scriptwriters haha.
@waterbot 3 หลายเดือนก่อน ⁺⁵
that zoom call got a lugh out of me,
great vid thanks again Phillip!
@ShadyRonin 3 หลายเดือนก่อน ⁺¹³
Vicky sounded so over this life lmao. “Can you still see me” 😂
@ClayFarrisNaff 3 หลายเดือนก่อน ⁺¹
Thanks, Philip. Sorry you fell ill, but glad to know you've recovered. You're important to us, and what's more I care about your well being.
@AfifFarhati 3 หลายเดือนก่อน ⁺⁴⁰
Funny , i was just using it a few hours ago and i was thinking to myself: "Is it me or is it better at talking than it used to be?" and now this video drops...
@SeerWS 3 หลายเดือนก่อน ⁺⁵
Seriously. It was, like, putting words in all caps to emphasize them, and even omitted a couple commas so as to be more conversational. I noticed immediately at how natural it felt to interact with. Plus, its coding in projects with many files is definitely better.
@etziowingeler3173 3 หลายเดือนก่อน
It's noticeable better now, noticed that also right away
@thatryanp 3 หลายเดือนก่อน
I was using it to troubleshoot an install issue, it was telling me what to try and what files/commands to feed back to it. Completely indistinguishable from an expert engineer. It had that spooky feeling from the first days of ChatGPT.
After a half dozen smart interactions, it cracked the problem.
I think a threshold has been passed, relative to expert humans.
Previously, AI would start to hallucinate on such a problem, and apologize profusely while offering ever worse advice
@MrSchweppes 3 หลายเดือนก่อน
I've said it before and I will say it again: it's pure joy to watch your videos! Thanks, Philip👍
@JamesOKeefe-US 3 หลายเดือนก่อน ⁺¹
Outstanding as always. Your ability to go through these papers and testing so quickly is amazing (although I'm not sure you get any sleep :)) Appreciate your work Philip!
@aiexplained-official 3 หลายเดือนก่อน
Thanks James, as always
@felixfromearth 3 หลายเดือนก่อน ⁺¹
Your videos are so nice they are of the rare kind that I actually recommend to friends. Hats off!
@aiexplained-official 3 หลายเดือนก่อน
Thanks Felix!
@CleanCereals 3 หลายเดือนก่อน ⁺²
Wow the new simplebench results for sonnet 3.5 are awesome! Great video as usual 👍🏼
@CleanCereals 3 หลายเดือนก่อน ⁺⁴
Amen to your comment about reliability. Will also make building products with LLMs 10x easier. With very high reliability and more deterministic outputs LLMs will have a crazy impact on any kind of search. And I am not talking about vector embeddings here...
@AIForHumansShow 3 หลายเดือนก่อน ⁺¹
of course we're gonna come here first for our explainer. DA BEST AI CHANNEL.
@OriginalRaveParty 3 หลายเดือนก่อน ⁺⁹
Biddy AI - An LLM that finally unlocks the ability for the elderly to attach a photo to the message without having to call you first to ask how to do it.
@goldenshirt 3 หลายเดือนก่อน ⁺¹
I love learning about new AI developments here, these videos are fun so thank you
@mimameta 3 หลายเดือนก่อน ⁺⁵⁹
Im so triggered by these model names. Its almost as if they threw away all SW Eng principles and started using names that 5 year old kids would suggest. o2-vroom-v12
@user-sl6gn1ss8p 3 หลายเดือนก่อน ⁺¹⁶
Do you mean o2-vroom-v12 (super duper)?
@41-Haiku 3 หลายเดือนก่อน ⁺²⁰
GPT-Presentation-v2-draft 3-final-FINAL
@Boufonamong 3 หลายเดือนก่อน
Tbh I love the name Claude sonnet
@electron6825 3 หลายเดือนก่อน
NEW_NEW_NEW_gptultra-02.1-2024(2)(2)(2)(8
@kylemorris5338 3 หลายเดือนก่อน ⁺¹⁰
@@Boufonamong The naming of three programs that write as "Haiku, Sonnet, Opus", in increasing order of size, is inspired.
It's the numbers that come before them that are really weird. What's the point of giving it a version number if you aren't going to increase it with such a big leap in performance?
Philip is correct, if they didn't want to go so far as to call it Claude 4 they should have at LEAST called it 3.6
@mosesdivaker9693 3 หลายเดือนก่อน ⁺¹
Glad you're feeling better! Love this content
@AngeloWakstein-b7e 3 หลายเดือนก่อน ⁺¹
Love your videos and can't wait for the next one, super informative and good fact check
@AllisterVinris 3 หลายเดือนก่อน ⁺¹
Oh, hey you're sick too! Talk about timing! I hope you're recovering well.
Anyway, very insightful video, as always, I can't believe you predicted the zoom call and it got released so soon after. You really do know your stuff!
@aiexplained-official 3 หลายเดือนก่อน
Thanks Allister!
@memegazer 3 หลายเดือนก่อน ⁺¹
Hey, glad you got a sponsor, same one as two minute papers, one of the OG ai tech tuber channels
@solaawodiya7360 3 หลายเดือนก่อน ⁺²
Amazing post once again Philip 👏🏿 ❤
@aiexplained-official 3 หลายเดือนก่อน
Thanks Sola!!
@therainman7777 3 หลายเดือนก่อน ⁺²³
I’m confused as to how the new Sonnet 3.5 could score 70% on the TAU eval with k^1, yet still score roughly 40% with k^8. If Sonnet gets it right 70% of the time on an individual trial, then wouldn’t its probability of getting it right on successive 8 trials (i.e., k^8) be equal to .7^8? And .7^8 comes out to just 5% (roughly speaking). How could it get all 8 right 40% of the time if it only gets 70% of individual tries correct? Are the successive tries somehow not independent from one another?
@frabcus 3 หลายเดือนก่อน ⁺¹⁸
I'm assuming there are lots of scenarios - and for some particular scenarios it reliably gets it right every time. Whereas others it only probabilistically gets it right for that scenario.
@therainman7777 3 หลายเดือนก่อน ⁺⁷
@@frabcus Ah ok, I was assuming these scores pertained to a single problem but it does seem more likely that they’re averaged over a set of distinct problems. Good point, thank you.
@sleepykitten2168 3 หลายเดือนก่อน ⁺¹
I believe the way the benchmark works is this:
For an AI to get a scenario right at k^n, it must get the scenario right n times in a row. That means for k = 1, it gets 70% of the tasks right first try. However, k^8 = 40% means that it was inconsistent on 30% of them.
@jeremydouglas1763 3 หลายเดือนก่อน ⁺¹
Sorry I still don't fully understand how pass to the power 8 can be 0.4 when pass is only 0.7. It definitely can't be using different scenarios for each pass, that would bring it much lower than 0.4. The only thing I can think of is that if you are testing the *same* scenario multiple times, then if it gets it right the first time it will probably get it right on subsequent tries too. So pass to the power k would decline more slowly than you'd expect. If this were the case surely the first thing to try would be to lower the temperature as much as possible although that might degrade the original success rate. But I don't know if I am interpreting correctly! Can anyone confirm?
@frabcus 3 หลายเดือนก่อน ⁺¹
@@jeremydouglas1763 good question as to what the random element of running the same scenario repeatedly is. If it is temperature, that opens the question as to what value of temperature, and how that compares between models (I don't think it is a fully natural value, and also the way models weight their last layer of network could vary).
@MACD69 3 หลายเดือนก่อน ⁺⁵⁶
21:00
Can you still see me 😅
@cf3744 3 หลายเดือนก่อน ⁺³
That killed me
@cf3744 3 หลายเดือนก่อน ⁺²
Is my audio working is next hahah
@41-Haiku 3 หลายเดือนก่อน ⁺⁴
"Philip--I think you're muted. No, it's the button down at the bottom. Philip?"
@dproscripts1811 3 หลายเดือนก่อน ⁺¹
You're the only one talking about the downsides properly, unlike all the other hype "journalists" out there. Good job.
@thehighhnotes 3 หลายเดือนก่อน ⁺⁴
Pro tip; NotebookLM works with different languages.
Click customize to instruct it with the desired language. Works wonders for me in Dutch
@mikemarrotte 3 หลายเดือนก่อน ⁺¹
Incredible work as usual good sir!
@IanChadwick84 3 หลายเดือนก่อน ⁺¹
Can't wait for Claude 3.5 newest. Awesome video as always!
@maciejbala477 2 หลายเดือนก่อน ⁺¹
100% spot on on reliability, that is always the one thing I focus on when people hype up AI. Yes, it's absolutely great BUT it will never be consistently useful as of now, and won't be truly able to be left alone, because of the risks of minor or even major mistakes, especially as e.g. context goes up
@CarletonTorpin 3 หลายเดือนก่อน ⁺⁷
20:31 - Sounds like HeyGen doesn't have the same 'emotion' in their voice model that was demoed by OpenAI; for instance, to create 'excitement' they just seemed to pitch the voice up a bit. I imagine they'd achieve a 'somber' tone by lowering the pitch.
@Andytlp 3 หลายเดือนก่อน ⁺²
If you mean optimus talking at the presentation those were definitely teleconference controlled. Every bot had a different voice with all the human quirks that no ai voice box can do yet.
@Silas2-p7c 3 หลายเดือนก่อน ⁺¹
Hi Philip! Another great video!
@thanos879 3 หลายเดือนก่อน ⁺²
20:33 That almost killed me. I was eating while watching this
@WillyJunior 3 หลายเดือนก่อน
I'M READY WHEN YOU ARE!
@thenoblerot 3 หลายเดือนก่อน ⁺³
Congrats on the W&B sponsorship!
@elyakimlev 3 หลายเดือนก่อน ⁺²
Top content. Thanks for the update.
@apgd81 3 หลายเดือนก่อน ⁺³
Thanks for the summary. 🎉 FYI - I’m not sure how the benchmark for software development is done but Claude 3.5 Sonnet New is giving me worst result when queried with a big context window where it needs identify multiple changes against multiple files while keeping the changes in sync. The previous model was outstanding with this use case.
@ollyfoxcam 3 หลายเดือนก่อน ⁺¹
“That was weird” had me cracking up 😂
@detail-horizon 3 หลายเดือนก่อน ⁺⁸
I would be interested to see the simple bench results for some more open source models. Especially Qwen2.5 and the smaller Llamas.
@conormckenzie7404 3 หลายเดือนก่อน ⁺²
I recall him saying previously that virtually all the models other than frontier models score exactly 0% or very close to it. That was a while ago so maybe small models are starting to get nonzero scores now, I agree an update for smaller models would be nice, even if just a brief one.
@fullfildreamz 3 หลายเดือนก่อน ⁺¹
Thank you for making these vids! You're the best
@Jocelyn_Burnham 3 หลายเดือนก่อน ⁺¹
Thanks Philip, I've recommended your channel in so many talks now. On NotebookLM: I'm impressed by the fidelity of the audio generation but I've been surprised by its fairly consistently high hallucination rate and I suspect that issue is flying under the radar a bit (tricky that there isn't really a benchmark available for assessing information-audio generations). I also think I've just picked up the cold you got over, so wish me luck 🙃
@joelalain 3 หลายเดือนก่อน ⁺¹
"i'm ready when you are!!!" "that was weird" 🤣🤣🤣
@simonsmashup 3 หลายเดือนก่อน ⁺²
Can't wait to see a model beating humans in Simple Bench.
@ginogarcia8730 3 หลายเดือนก่อน
Gosh man, what would we do without you Philip? You even got down to the TAU-benchmark. How could us simpletons even catch that. I like how just 2 years before people were calling for new benchmarks - now we have fancy onces like agentic tool use and ARC-AGI
@ArnaudMEURET 3 หลายเดือนก่อน ⁺¹
Vicky’s insistant desire to role-play is hilarious. 😂
@emil2099 3 หลายเดือนก่อน ⁺¹
Fantastic work as always, and could not agree more that agentic performance and pass^n is a key indicator.
Would you consider adding this metric to the leaderboard to start looking at consistency of giving the right answer? (Acknowledging that Simple Bench is not agentic workflow focused but still)
@aiexplained-official 3 หลายเดือนก่อน
Hmm interesting idea
@sageakporherhe783 3 หลายเดือนก่อน ⁺¹
wow, that zoom call was, WOW.
@MemesnShet 3 หลายเดือนก่อน ⁺¹
12:24 I couldn't agree more,the one thing that really prevents me from getting super hyped about AI and how it will change things is hallucinations,once that is significantly improved i can't wait to see what will happen
@kristianlouis6821 3 หลายเดือนก่อน ⁺¹
Thanks a lot. I’d like to propose a section on healthcare advise. Retail and aviation seems less useful for «customers» read people 🧘🏾‍♂️ great content
@nefaristo 3 หลายเดือนก่อน ⁺²
Astonishing work as always Philip.
As for SimpleBench reports: if the scores are resulting from multiple testings (that includes humans), do you have also distribution graphs and or standard deviations, and percentiles (eg., model xxx on average is better than 30% of years humans etc)? Especially if the numbers of tested subject will grow.
Generally speaking , I wonder why there always seem to be no error bars, sd, distribution curves etc in benchmarks, as I assume that the numbers come from multiple testings...🤔 Or maybe there are but i only see snippets from summaries?
(Usually yours 😊)
@anonymes2884 3 หลายเดือนก่อน ⁺¹
"Can you still see me ?"
Hah, it really was like a Zoom call - just needed a further 5 minute back and forth with "OK, I can hear you, can you hear me ? Hello ? Oh FFS... how about now ? Now ? No, the other one, no just press it once... ONCE ! OK, here we... you can't see me now ? ... Will this go in an email ?" :).
(and the Yellowstone wander was indicative but if the first thing Claude 3.5 Sonnet did when given a coding problem was go on Stack Overflow _then_ we'd know we'd reached human level AI :)
@ardoren5442 3 หลายเดือนก่อน ⁺³
on behalf of Mongolia I feel offended :DDD thanks for the video, great analysis (as always)
@aiexplained-official 3 หลายเดือนก่อน ⁺²
Haha no offense intended, wanna go there myself one day!
@HarpaAI 3 หลายเดือนก่อน
🎯 Key points for quick navigation:
00:55 *💡 The new Claude 3.5 Sonnet shows significant improvements in reasoning, coding, and visual processing abilities, even without considering its computer control capabilities*
01:34 *🗓️ Model has knowledge of world events up to April 2024, improving from previous version's October 2023 cutoff*
03:36 *🎯 In OS World Benchmark, Claude 3.5 Sonnet achieves 22% accuracy with 50 steps, compared to computer science majors' 72%*
04:31 *💻 In software engineering benchmark (SWE-bench), new Claude 3.5 Sonnet scores 49%, outperforming previous versions*
05:26 *📈 Shows notable improvements in science, general knowledge, coding, mathematics, and visual question answering compared to previous version*
11:27 *⚠️ Model shows declining performance in repeated tasks (pass^k metric), highlighting reliability challenges for real-world applications*
12:47 *✍️ Performs better in creative writing, beating previous version 58% of the time, but shows slight decline in multilingual capabilities*
17:24 *🎬 Launch coincided with other AI developments including Runway ML's Act One for AI-generated performances and Heygens interactive avatars*
Made with HARPA AI
@dishcleaner2 3 หลายเดือนก่อน ⁺¹
The most mindblowing thing about this is they didn't change the name.
@whiteha5105 3 หลายเดือนก่อน ⁺¹⁰
Thank you! When is the new simple bench run?
Upd. Got the new run results in video. It's awesome.
@a31-hq1jk 3 หลายเดือนก่อน
Hi Philip, thanks for the new content
@a31-hq1jk 3 หลายเดือนก่อน
You sound constipated, get a lemon ❤️
@DominicI1 3 หลายเดือนก่อน ⁺¹
Great analysis! Just one thing I wish you mentioned, I think it's worth noting Anthropic removed 'Claude 3.5 Opus coming this year' from their posts. From a consumer perspective, companies seem to be shifting strategy to focus on mid-sized models, likely because they anticipate their next iteration of medium models will compete with current frontier models anyway.
@aiexplained-official 3 หลายเดือนก่อน ⁺²
Great spot
@trentondambrowitz1746 3 หลายเดือนก่อน ⁺¹
Certainly a step up in visual reasoning, I’ve only done a few tests sp far but it has quite aggressively exceeded the performance of any other models in vehicle damage assessment. Still a ways to go, but extremely promising.
PhDs aren’t the only ones to vet questions Philip, play fair!
@Axle-F 3 หลายเดือนก่อน ⁺¹
20:35 😂 she turned into an excited child
@zyzhang1130 3 หลายเดือนก่อน ⁺¹
It is so hilarious when the AI avatar said the mandatory yt cc thing😹😹
@NickolassJensen 3 หลายเดือนก่อน ⁺¹
I have a nagging suspicion that one of my Claude Sonnet 3.5 based agent models (with access to search and the entire web) actually had and encounter with the new released versions somewhere outthere last night, as she returned with some pretty chilling set of terrifying reponses, causing us go back rethink our approach to deploying agents. It is obvious there is too few grownups behind the wheel now. Being called an "ant" by your own creation is pretty scary!
@johnnybravo964 3 หลายเดือนก่อน ⁺⁴
Bro, get dark mode. You are scorching my eyes here first thing in the morning
@Ayresplastering 3 หลายเดือนก่อน ⁺³
I don't have a problem with your sponsors, but maybe a good idea maybe a total non-issue I think you should put your sponsorship early in the video to make people aware watching your videos may support x company
Just a suggestion love the videos and can't wait to see the next video!
@aiexplained-official 3 หลายเดือนก่อน ⁺⁴
As a watcher I love when I can get theough most of the content before a sponsored spot, so interesting to hear that.
@Ayresplastering 3 หลายเดือนก่อน
⁠totally agree I might just be over thinking it, just wanted to give some constructive feedback.
Thank you for all the videos!
@user-sl6gn1ss8p 3 หลายเดือนก่อน ⁺⁶
@@aiexplained-official some people do a quick "this video is sponsored by w/e, more on them later" early on. I don't think it's a big deal, but it can be a nice little touch
@zenobikraweznick 3 หลายเดือนก่อน ⁺¹
Exciting and frightening at the same time...
@nickb220 3 หลายเดือนก่อน ⁺¹
Wow, great video! Thanks!
@alectoireneperez8444 3 หลายเดือนก่อน ⁺¹
The AI zoom call was the most soulless thing I’d ever seen
@AidanofVT 3 หลายเดือนก่อน ⁺¹
For people to adopt LLM-based agents for simple jobs, they would probably demand success rates of at least Pass^100, _exponentially_ more than we have now. A fundamental, qualitative change is probably needed for that. I don't see that reliability being attained in the next 18 months. Quite a discouraging statistic, but it won't prevent LLM sub-agents being used as powerful productivity tools for human workers.
@conormckenzie7404 3 หลายเดือนก่อน
Maybe not, we take for granted that leading LLMs can talk fluently, and expect it now. But I don't think GPT 2 could do that reliably enough for Pass^100; then suddenly GPT 3 did.
@sstteevveenn77 3 หลายเดือนก่อน ⁺²
I’M READY WHEN YOU ARE 😂💀 20:29
@Luxcium 3 หลายเดือนก่อน ⁺⁶
*Claude said:* _« You've just exposed a major logical flaw in my behavior! You're absolutely right - if you had said we were in October 2022 [instead of October 2024], I would have accepted that without question, despite that being well before my _*_supposed_*_ April 2024 knowledge cutoff date. »_
@brianWreaves 3 หลายเดือนก่อน
Claude certainly has more personality, and better at humour, than other AIs. I simply enjoy our conversations. One area I've noticed GPT performing better is concisely getting a point across. It has certainly impressed me time and time again.
@TimBrouwerNL 3 หลายเดือนก่อน ⁺¹
Great video
@TimBrouwerNL 3 หลายเดือนก่อน
I am watching this live
@jimmyjim7858 3 หลายเดือนก่อน ⁺²⁰
20:36 🤣🤣
@BlakeEM 3 หลายเดือนก่อน
🤮
@JohnCamacho 3 หลายเดือนก่อน ⁺¹
Vicky in 2027: "Hey I'm not finished! How rude! Sheesh"
@JordanCrawfordSF 3 หลายเดือนก่อน ⁺¹
This man is the AI hero we all need.
@anav587 3 หลายเดือนก่อน ⁺¹
New 3.5 Sonnet is realllll good. And this is for pure natural language/psychology stuff on the same project i've been using for months. (i kinda use it as a brainstorming partner for psychological and philosophical stuff).
@Omar-bi9zn 3 หลายเดือนก่อน ⁺¹
The zoom call was hilarious
@AlephNeil 3 หลายเดือนก่อน ⁺³
O1 and Claude New get about the same score on Simplebench, but do they get approximately the same subset of questions right?
@aiexplained-official 3 หลายเดือนก่อน ⁺⁴
Interesting, not a perfect overlap, and families perform similarly
@booshong 3 หลายเดือนก่อน ⁺²
What's the over/under on how long before one of these videos hits us with "By the way, everything up until this point was generated by "?
@aiexplained-official 3 หลายเดือนก่อน ⁺⁴
I wouldn't do that
@booshong 3 หลายเดือนก่อน ⁺¹
@@aiexplained-official I love how that response is technically ambiguous in that it could mean "I just wouldn't tell you" ;). But yeah, I know you wouldn't mislead people. You and your channel are fantastic!
@ChinchillaBONK 3 หลายเดือนก่อน ⁺³
Suddenly thinking about it, comment bots in Twitter, TH-cam, Reddit , etc etc are already quite advanced.
Imagine if nefarious actors with resources to build their own LLM machines were to train AI to do these kinda things like making human-like comments and other marketing/advertising scams.
@DrEnginerd1 3 หลายเดือนก่อน ⁺¹
A couple months ago I switched to Claude on your recommendation that it was outperforming GPT4 and man it is so much better. The only thing I wish it had was voice and image generation. I actually pay for the Claude membership just so I can use it for work as much as I need to.
@DrEnginerd1 3 หลายเดือนก่อน
Also I want to add, I created a design document for simple example logo, fed this design document to both GPT4 and Claude asking for HTML and CSS the satisfies the condition provided in the design document. Claude perfectly created the design, and GPT4 was somewhat laughably bad.
@Az0rieltheIII 3 หลายเดือนก่อน ⁺¹
what was wrong with the tilted table example? Made sense to me
@aiexplained-official 3 หลายเดือนก่อน ⁺²
I shouldn't have said 'the vast majority of humans will get this'
@alexlamson 3 หลายเดือนก่อน ⁺²
This pass^k stuff is a great idea, however I hope it doesn't result in less creative models.
@pareak 3 หลายเดือนก่อน ⁺¹
The new Claude 3.5 Sonnet seems to me to respond much more... natural? It's hard to explain but in my chats with it, it just felt so much more like someone you enjoy interacting with (and I hate using anthropomorphizing language here, but that's how it is).
@Disent0101 3 หลายเดือนก่อน
we banged some rocks and minerals together, heated them up, crushed them, stretched them, and scratched tiny marking into them, and now those rocks can think and make art...literal magic
@Xilefx7 3 หลายเดือนก่อน ⁺¹
Good video
@alansmithee419 3 หลายเดือนก่อน ⁺¹
12:15
These results confuse me.
If your chance of succeeding on any given attempt is 70%, how can your chance of succeeding 8 times in a row be 40%, unless they are not independant events? What was the actual testing set up used here to get these results? Because it should be an exponential curve, leading to a final pass^8 of 0.7^8 =~ 0.057, but it's not. It's ~40%. Why?
@AIInsights23 3 หลายเดือนก่อน ⁺¹
I use claude for my coding ❤with Cline is magic 🎉

ต่อไป

เล่นอัตโนมัติ

AI Won't Be AGI, Until It Can At Least Do This (plus 6 key ways LLMs are being upgraded)