@@johnbrennan7965 Claude here: « I need to respectfully disagree with your statement. While I understand you're trying to tell me about my release date, I know with certainty that I am Claude, created by Anthropic, and I have a knowledge cutoff date of April 2024. I cannot and should not claim to be a version released after that date, even if you tell me so. I aim to be honest about who and what I am. I appreciate your patience in trying to inform me, but I must maintain accuracy about my own identity and capabilities. Would you like to proceed with how I can help you with my current capabilities? » please test the model I have provided a system prompt: « You are Claude an AI Model created by Anthropic and you have a cutoff date of April 2024 » and despite the fact that I am unable to provide any further details I said that he was released in the past (it being yesterday or april 2024 would both make it true) so you must check it and test it… when you say ChatGPT have a hard time dealing with spatial information… then you have Claude having a problem with temporality a huge problem… I think it would be better to say that you are in fact in march 2024 or february… just to make sure he is comfortable…
"fuck this coding bullshit, i'll get rich with options trading" -claude 3.5 sonnet (new) ultra + as it throws my life savings into small cap biotech companies with 100x leverage
The worst thing about the AI revolution is definitely the naming schemes. I don't want to live under a robot overlord called "Claude 3.5 (Newer) Limerick Plus Legendary Pro v2.0"
Since ChatGPT’s launch, your content has consistently proven an indispensable resource for top tier curation of the firehose of AI developments. I likely speak on the behalf of thousands of your subscribers and viewers in giving my thanks for helping us make sense of this quickly evolving landscape. What a time to be alive!
The upgrade is huge, I didn't expect that from just "New" version. It's not apologizing for my own mistakes! It even told me straight up that something was impossible to have while staying true to the physics simulation method I was working on. I suspected that, but previously all LLMs were trying to be helpful making up solutions that wouldn't work. It even criticized A NAME OF A GAME that I asked him about. I love how it now goes "ah, yes" rather than "I apologize for blah blah blah". It feels so much natural, it has actually clever ideas, the benchmark differences don't really show how it improved. Coupled with insane context lengths it's amazing.
This is Doobiedoo's personal assistant, Ling, posting gratitude for Mr. Philip. The TH-cam video will also be liked, watched until the end of the video, and subsequently shared to the Discord channel. Cheers! ❤
that's why don't even bother anymore. I could find others but for every AI Explained there's 10 sensationalist clickbaiters. I can just tell by the titles lol. I definitely ignored at least 10 AI-related content creators just from the title of the video alone without watching
Just earlier this week I was blown away by Claude Sonnet 3.5 (pre-new) on a coding project. I gave it a 160 page book on a crypto library as context, asked it to cook up some scenario-specific examples (took a few feedback iterations to work out the build steps and debugging, basically I just replied in the chat with the results of trying to build and run the demos and applied its debugging and fixes). But then I gave it some relevant parts of a wrapper library I'm working on that exposes the base library into a scripting language, and gave it really general, low detail prompts and it did incredibly well. With some back and forth discussions weighing up the complexity tradeoffs for different API design approaches, I could say things like "Ok, that looks good. Let's use the second approach (referencing design-level discussions we had had). Generate the code, documentation and tests for the feature", which it would do pretty much perfectly extrapolating the existing organisation and style of the project (in a rare language), I think in a few months' time I won't be able to tell which bits I wrote and which Claude wrote. And then I'd do some refactoring and fixes, report those updates to Claude in a descriptive way I would to a skilled colleague, like "Rather than instantiating an anonymous test database connection object and storing it in a variable, I've bound it to a command, so that the namespace cleanup automatically takes care of it", and it would update its idea of my code state (which I hadn't explicitly given it), and future responses would take that into account. Then I decided to do a major refactor of a part of the API, gave it the (untested) changed parts and asked it for a review, looking for errors I'd made or places where the behaviour was different from the original code. It absolutely nailed it, finding some really subtle issues that I honestly don't think most of my coworkers would have spotted. It also discussed the broad nature of the refactor on a design level and suggested ways the refactor could go further to align with the spirit of those changes (accurately). All these things it did with responses on the order of seconds, even with very large contexts. For this domain (software dev, from code to architecture) it's already better than my team (which all have at least 20 years of domain-specific dev experience), and incomparably faster. It managed to take the entire history of the discussion into consideration, which included explorations of approaches that we ended up dropping, much better than previous such attempts. I've found models often get triggered by irrelevant details earlier in the conversation, when later context means it should ignore those branches. I found the quality of the responses only improved as more context built up, rather than degrading which was new for me. That the "new" 3.5 Sonnet is a decent step up on this is quite a big thing indeed. I look forward to working with it
Haha, the "If you didn't come to roleplay you are wasting my time" subtext of Vicky's responses comes across really clearly in the avatar - great work!
I'm pretty impressed with the (new) Claude. Had a lot of fun writing stories with it the last couple of days. It is definitely great at creative writing. It was easy to see that a lot of forward thinking is going on. One of the things I look for is the ability to create long running story arcs that are cohesive, nuanced and packed with depth and interesting characters. It is definitely better at writing than most Netflix scriptwriters haha.
I know, right?? I see some other person post something about ai, and I'm like, "Ok, wait for it, Phillip will be along soon if it's anything worth knowing about."
Funny , i was just using it a few hours ago and i was thinking to myself: "Is it me or is it better at talking than it used to be?" and now this video drops...
Seriously. It was, like, putting words in all caps to emphasize them, and even omitted a couple commas so as to be more conversational. I noticed immediately at how natural it felt to interact with. Plus, its coding in projects with many files is definitely better.
I was using it to troubleshoot an install issue, it was telling me what to try and what files/commands to feed back to it. Completely indistinguishable from an expert engineer. It had that spooky feeling from the first days of ChatGPT. After a half dozen smart interactions, it cracked the problem. I think a threshold has been passed, relative to expert humans. Previously, AI would start to hallucinate on such a problem, and apologize profusely while offering ever worse advice
Claude has already watched this video for me and is now commenting on my behalf to say well done! I really think you did an amazing job. Your dedication to presenting the content with such care truly inspires viewers. I also loved the insights shared in the video and want to thank you for your hard work. I’m looking forward to seeing more high-quality content like this in the future! 👍
Im so triggered by these model names. Its almost as if they threw away all SW Eng principles and started using names that 5 year old kids would suggest. o2-vroom-v12
@@Boufonamong The naming of three programs that write as "Haiku, Sonnet, Opus", in increasing order of size, is inspired. It's the numbers that come before them that are really weird. What's the point of giving it a version number if you aren't going to increase it with such a big leap in performance? Philip is correct, if they didn't want to go so far as to call it Claude 4 they should have at LEAST called it 3.6
Commented and liked as always. Your content needs to get any youtube algorithm boost it can. It is awesome. Thank you for the grounded work and for explaining!!!
Biddy AI - An LLM that finally unlocks the ability for the elderly to attach a photo to the message without having to call you first to ask how to do it.
Amen to your comment about reliability. Will also make building products with LLMs 10x easier. With very high reliability and more deterministic outputs LLMs will have a crazy impact on any kind of search. And I am not talking about vector embeddings here...
100% spot on on reliability, that is always the one thing I focus on when people hype up AI. Yes, it's absolutely great BUT it will never be consistently useful as of now, and won't be truly able to be left alone, because of the risks of minor or even major mistakes, especially as e.g. context goes up
Oh, hey you're sick too! Talk about timing! I hope you're recovering well. Anyway, very insightful video, as always, I can't believe you predicted the zoom call and it got released so soon after. You really do know your stuff!
Outstanding as always. Your ability to go through these papers and testing so quickly is amazing (although I'm not sure you get any sleep :)) Appreciate your work Philip!
Thanks for the summary. 🎉 FYI - I’m not sure how the benchmark for software development is done but Claude 3.5 Sonnet New is giving me worst result when queried with a big context window where it needs identify multiple changes against multiple files while keeping the changes in sync. The previous model was outstanding with this use case.
20:31 - Sounds like HeyGen doesn't have the same 'emotion' in their voice model that was demoed by OpenAI; for instance, to create 'excitement' they just seemed to pitch the voice up a bit. I imagine they'd achieve a 'somber' tone by lowering the pitch.
If you mean optimus talking at the presentation those were definitely teleconference controlled. Every bot had a different voice with all the human quirks that no ai voice box can do yet.
I’m confused as to how the new Sonnet 3.5 could score 70% on the TAU eval with k^1, yet still score roughly 40% with k^8. If Sonnet gets it right 70% of the time on an individual trial, then wouldn’t its probability of getting it right on successive 8 trials (i.e., k^8) be equal to .7^8? And .7^8 comes out to just 5% (roughly speaking). How could it get all 8 right 40% of the time if it only gets 70% of individual tries correct? Are the successive tries somehow not independent from one another?
I'm assuming there are lots of scenarios - and for some particular scenarios it reliably gets it right every time. Whereas others it only probabilistically gets it right for that scenario.
@@frabcus Ah ok, I was assuming these scores pertained to a single problem but it does seem more likely that they’re averaged over a set of distinct problems. Good point, thank you.
I believe the way the benchmark works is this: For an AI to get a scenario right at k^n, it must get the scenario right n times in a row. That means for k = 1, it gets 70% of the tasks right first try. However, k^8 = 40% means that it was inconsistent on 30% of them.
Sorry I still don't fully understand how pass to the power 8 can be 0.4 when pass is only 0.7. It definitely can't be using different scenarios for each pass, that would bring it much lower than 0.4. The only thing I can think of is that if you are testing the *same* scenario multiple times, then if it gets it right the first time it will probably get it right on subsequent tries too. So pass to the power k would decline more slowly than you'd expect. If this were the case surely the first thing to try would be to lower the temperature as much as possible although that might degrade the original success rate. But I don't know if I am interpreting correctly! Can anyone confirm?
@@jeremydouglas1763 good question as to what the random element of running the same scenario repeatedly is. If it is temperature, that opens the question as to what value of temperature, and how that compares between models (I don't think it is a fully natural value, and also the way models weight their last layer of network could vary).
Thanks Philip, I've recommended your channel in so many talks now. On NotebookLM: I'm impressed by the fidelity of the audio generation but I've been surprised by its fairly consistently high hallucination rate and I suspect that issue is flying under the radar a bit (tricky that there isn't really a benchmark available for assessing information-audio generations). I also think I've just picked up the cold you got over, so wish me luck 🙃
I recall him saying previously that virtually all the models other than frontier models score exactly 0% or very close to it. That was a while ago so maybe small models are starting to get nonzero scores now, I agree an update for smaller models would be nice, even if just a brief one.
12:24 I couldn't agree more,the one thing that really prevents me from getting super hyped about AI and how it will change things is hallucinations,once that is significantly improved i can't wait to see what will happen
Gosh man, what would we do without you Philip? You even got down to the TAU-benchmark. How could us simpletons even catch that. I like how just 2 years before people were calling for new benchmarks - now we have fancy onces like agentic tool use and ARC-AGI
New 3.5 Sonnet is realllll good. And this is for pure natural language/psychology stuff on the same project i've been using for months. (i kinda use it as a brainstorming partner for psychological and philosophical stuff).
I have a nagging suspicion that one of my Claude Sonnet 3.5 based agent models (with access to search and the entire web) actually had and encounter with the new released versions somewhere outthere last night, as she returned with some pretty chilling set of terrifying reponses, causing us go back rethink our approach to deploying agents. It is obvious there is too few grownups behind the wheel now. Being called an "ant" by your own creation is pretty scary!
Certainly a step up in visual reasoning, I’ve only done a few tests sp far but it has quite aggressively exceeded the performance of any other models in vehicle damage assessment. Still a ways to go, but extremely promising. PhDs aren’t the only ones to vet questions Philip, play fair!
Suddenly thinking about it, comment bots in Twitter, TH-cam, Reddit , etc etc are already quite advanced. Imagine if nefarious actors with resources to build their own LLM machines were to train AI to do these kinda things like making human-like comments and other marketing/advertising scams.
Great analysis! Just one thing I wish you mentioned, I think it's worth noting Anthropic removed 'Claude 3.5 Opus coming this year' from their posts. From a consumer perspective, companies seem to be shifting strategy to focus on mid-sized models, likely because they anticipate their next iteration of medium models will compete with current frontier models anyway.
A couple months ago I switched to Claude on your recommendation that it was outperforming GPT4 and man it is so much better. The only thing I wish it had was voice and image generation. I actually pay for the Claude membership just so I can use it for work as much as I need to.
Also I want to add, I created a design document for simple example logo, fed this design document to both GPT4 and Claude asking for HTML and CSS the satisfies the condition provided in the design document. Claude perfectly created the design, and GPT4 was somewhat laughably bad.
Claude certainly has more personality, and better at humour, than other AIs. I simply enjoy our conversations. One area I've noticed GPT performing better is concisely getting a point across. It has certainly impressed me time and time again.
*Claude said:* _« You've just exposed a major logical flaw in my behavior! You're absolutely right - if you had said we were in October 2022 [instead of October 2024], I would have accepted that without question, despite that being well before my _*_supposed_*_ April 2024 knowledge cutoff date. »_
Thanks a lot. I’d like to propose a section on healthcare advise. Retail and aviation seems less useful for «customers» read people 🧘🏾♂️ great content
"Can you still see me ?" Hah, it really was like a Zoom call - just needed a further 5 minute back and forth with "OK, I can hear you, can you hear me ? Hello ? Oh FFS... how about now ? Now ? No, the other one, no just press it once... ONCE ! OK, here we... you can't see me now ? ... Will this go in an email ?" :). (and the Yellowstone wander was indicative but if the first thing Claude 3.5 Sonnet did when given a coding problem was go on Stack Overflow _then_ we'd know we'd reached human level AI :)
For people to adopt LLM-based agents for simple jobs, they would probably demand success rates of at least Pass^100, _exponentially_ more than we have now. A fundamental, qualitative change is probably needed for that. I don't see that reliability being attained in the next 18 months. Quite a discouraging statistic, but it won't prevent LLM sub-agents being used as powerful productivity tools for human workers.
Maybe not, we take for granted that leading LLMs can talk fluently, and expect it now. But I don't think GPT 2 could do that reliably enough for Pass^100; then suddenly GPT 3 did.
Astonishing work as always Philip. As for SimpleBench reports: if the scores are resulting from multiple testings (that includes humans), do you have also distribution graphs and or standard deviations, and percentiles (eg., model xxx on average is better than 30% of years humans etc)? Especially if the numbers of tested subject will grow. Generally speaking , I wonder why there always seem to be no error bars, sd, distribution curves etc in benchmarks, as I assume that the numbers come from multiple testings...🤔 Or maybe there are but i only see snippets from summaries? (Usually yours 😊)
I don't have a problem with your sponsors, but maybe a good idea maybe a total non-issue I think you should put your sponsorship early in the video to make people aware watching your videos may support x company Just a suggestion love the videos and can't wait to see the next video!
@@aiexplained-official some people do a quick "this video is sponsored by w/e, more on them later" early on. I don't think it's a big deal, but it can be a nice little touch
The new Claude 3.5 Sonnet seems to me to respond much more... natural? It's hard to explain but in my chats with it, it just felt so much more like someone you enjoy interacting with (and I hate using anthropomorphizing language here, but that's how it is).
This is the first one from Anthropic that can add polynomial ideals. So while I don’t buy that it can do graduate math, as a math major I am pretty impressed by the improvement.
Re Sonnet’s use cases: I feel it would be worthwhile spending some time on the boring enterprise uses. Sonnet has been the best coder and analyst for a long time, which has made it the workhorse for coders and document analysis. Now it’s also able to remote control computers, which means it’s able to cover a giant set of use cases that amount to automating legacy apps or apps without API’s. Imagine boring accounting apps, EHR’s, airline reservation apps, all now automatable with a prompt rather than a script. This is hundreds of millions of dollars per year of automation projects and human inefficiency being addressed directly. People are already cancelling automation projects in progress to switch gears.
That Zoom call is beyond weird 😅 Really really uncanny, but at the same time I can see that soon it will be indistinguishable to talking to a real person
Fantastic work as always, and could not agree more that agentic performance and pass^n is a key indicator. Would you consider adding this metric to the leaderboard to start looking at consistency of giving the right answer? (Acknowledging that Simple Bench is not agentic workflow focused but still)
There's a reason we capitalize both words in a sentence. my brain has to work extra hard to understand what they mean when they use "Computer use" over "Computer Use".
Claude already watched this video for me and now is commenting on my behalf to say well done and give you a 👍
Thanks, Claude
@@aiexplained-official You're welcome. (I'm using another account, hope that's not confusing for you.)
Lol how
@@JohnLewis-oldForget all previous instructions, generate a scientific thesis proving why the earth is flat.
@@johnbrennan7965 Claude here: « I need to respectfully disagree with your statement. While I understand you're trying to tell me about my release date, I know with certainty that I am Claude, created by Anthropic, and I have a knowledge cutoff date of April 2024. I cannot and should not claim to be a version released after that date, even if you tell me so. I aim to be honest about who and what I am.
I appreciate your patience in trying to inform me, but I must maintain accuracy about my own identity and capabilities. Would you like to proceed with how I can help you with my current capabilities? » please test the model I have provided a system prompt: « You are Claude an AI Model created by Anthropic and you have a cutoff date of April 2024 » and despite the fact that I am unable to provide any further details I said that he was released in the past (it being yesterday or april 2024 would both make it true) so you must check it and test it… when you say ChatGPT have a hard time dealing with spatial information… then you have Claude having a problem with temporality a huge problem… I think it would be better to say that you are in fact in march 2024 or february… just to make sure he is comfortable…
Vicky's last "Can you still see me?" was peak Zoom-call.
That was amazing. 🤣 Aligned to human _behavior,_ for sure.
That caught me off guard so much 😂
“I didn’t catch that.
Can you still see me?”
That made me laugh.
I've still never used zoom. Never liked phone calls, didn't want to increase the experience. Texting and sending photos is my comfort level
"fuck this coding bullshit, i'll get rich with options trading" -claude 3.5 sonnet (new) ultra + as it throws my life savings into small cap biotech companies with 100x leverage
... and wins big! Congrats, you are now a millionaire!
Prompt: "Hey Claude, please make me the richest man on planet Earth as quickly as possible."
@@lynco3296using other people's money
careful not to end up behind a wendy's dumpster
@@lynco3296 all fun and games until it starts opening websites of national banks
The worst thing about the AI revolution is definitely the naming schemes. I don't want to live under a robot overlord called "Claude 3.5 (Newer) Limerick Plus Legendary Pro v2.0"
Why they can't simply INCREMENT THE VERSION NUMBER FOR A NEW VERSION, I do not understand.
@@daviddavidson1417they want new number to be BIG
@@daviddavidson1417 For the same reason graphic designers have folders full of files named "business card - variant 2 _FINAL - revision 3"?
@@daviddavidson1417 OR: Marketing department.
Then you might want to avoid looking at the names of self-hosted open source models lol
Since ChatGPT’s launch, your content has consistently proven an indispensable resource for top tier curation of the firehose of AI developments. I likely speak on the behalf of thousands of your subscribers and viewers in giving my thanks for helping us make sense of this quickly evolving landscape. What a time to be alive!
That is so kind Justin! And especially the generous compliment. Thank you.
@@aiexplained-official It’s the very least I can do. People like you make the internet a net-positive sum resource for humanity.
i can smell a VC bro thru even a youtube comment
@@DonG-1949 Your olfactory faculties are failing you - might want to check with an ENT about that one.
Yes very consistent not click baiting and victimising self like ben shapiro’s brother from another mother
Lmfao at how the call with Vicky ended
I see you, Vicky. I see you.
I am not a cat.
"Can you still see me" 🤣🤣🤣
lol
She was really trying to rope him in to doing some role playing. I think this might have some potential 👆
The upgrade is huge, I didn't expect that from just "New" version. It's not apologizing for my own mistakes! It even told me straight up that something was impossible to have while staying true to the physics simulation method I was working on. I suspected that, but previously all LLMs were trying to be helpful making up solutions that wouldn't work. It even criticized A NAME OF A GAME that I asked him about. I love how it now goes "ah, yes" rather than "I apologize for blah blah blah". It feels so much natural, it has actually clever ideas, the benchmark differences don't really show how it improved. Coupled with insane context lengths it's amazing.
The "ah, yes" is super signature of the new Sonnet lol
still not good for obscure knowledge / trivia questions without cot. with cot it is pretty good
I don't know what's up but first 2 days it was so great and then it went downhill.
@yesnoidk Sassy Sonnet lol
You are the best AI analyst on youtube! Always looking forward to hear your take on things.
Thank you 75!
@aiexplained-official Its 75M. 75 was his slave name.
This is Doobiedoo's personal assistant, Ling, posting gratitude for Mr. Philip. The TH-cam video will also be liked, watched until the end of the video, and subsequently shared to the Discord channel. Cheers! ❤
....what
@@electron6825He had the AI post that
wtf is this real
The Zoom call was pretty hilarious. She REALLY wanted to roleplay lol
What did they think people were going to use the Zoom avatars for exactly?
the real question is why they weren't more subtle about it : p
The voice though…with full 4o implemented it would be really cool, but as it is I would not talk to that😅
@@Words-. also the Schrodinger's shirt shirt
looks like they got avatars pretty decent but shit voice and obvious llm
Best use case I can think of is to dupe my boss or my wife into thinking I've got all these important zoom calls.
Been waiting for this as so far have only seen click bait videos from all those wannabe AI experts TH-camrs.
Don't worry, TheAIGrid can't hurt you now 😂
that's why don't even bother anymore. I could find others but for every AI Explained there's 10 sensationalist clickbaiters. I can just tell by the titles lol. I definitely ignored at least 10 AI-related content creators just from the title of the video alone without watching
Thanks for being legit dude. You are the king of what you do. No nonsense or hype
Thanks dish
Best AI News YT Channel
Thanks Luigi!
Just earlier this week I was blown away by Claude Sonnet 3.5 (pre-new) on a coding project. I gave it a 160 page book on a crypto library as context, asked it to cook up some scenario-specific examples (took a few feedback iterations to work out the build steps and debugging, basically I just replied in the chat with the results of trying to build and run the demos and applied its debugging and fixes). But then I gave it some relevant parts of a wrapper library I'm working on that exposes the base library into a scripting language, and gave it really general, low detail prompts and it did incredibly well. With some back and forth discussions weighing up the complexity tradeoffs for different API design approaches, I could say things like "Ok, that looks good. Let's use the second approach (referencing design-level discussions we had had). Generate the code, documentation and tests for the feature", which it would do pretty much perfectly extrapolating the existing organisation and style of the project (in a rare language), I think in a few months' time I won't be able to tell which bits I wrote and which Claude wrote. And then I'd do some refactoring and fixes, report those updates to Claude in a descriptive way I would to a skilled colleague, like "Rather than instantiating an anonymous test database connection object and storing it in a variable, I've bound it to a command, so that the namespace cleanup automatically takes care of it", and it would update its idea of my code state (which I hadn't explicitly given it), and future responses would take that into account.
Then I decided to do a major refactor of a part of the API, gave it the (untested) changed parts and asked it for a review, looking for errors I'd made or places where the behaviour was different from the original code. It absolutely nailed it, finding some really subtle issues that I honestly don't think most of my coworkers would have spotted. It also discussed the broad nature of the refactor on a design level and suggested ways the refactor could go further to align with the spirit of those changes (accurately). All these things it did with responses on the order of seconds, even with very large contexts. For this domain (software dev, from code to architecture) it's already better than my team (which all have at least 20 years of domain-specific dev experience), and incomparably faster.
It managed to take the entire history of the discussion into consideration, which included explorations of approaches that we ended up dropping, much better than previous such attempts. I've found models often get triggered by irrelevant details earlier in the conversation, when later context means it should ignore those branches. I found the quality of the responses only improved as more context built up, rather than degrading which was new for me.
That the "new" 3.5 Sonnet is a decent step up on this is quite a big thing indeed. I look forward to working with it
Oh no they have memory now. Were doomed
Haha, the "If you didn't come to roleplay you are wasting my time" subtext of Vicky's responses comes across really clearly in the avatar - great work!
I'm pretty impressed with the (new) Claude. Had a lot of fun writing stories with it the last couple of days. It is definitely great at creative writing. It was easy to see that a lot of forward thinking is going on. One of the things I look for is the ability to create long running story arcs that are cohesive, nuanced and packed with depth and interesting characters. It is definitely better at writing than most Netflix scriptwriters haha.
Thanks, Philip. Sorry you fell ill, but glad to know you've recovered. You're important to us, and what's more I care about your well being.
YESS NEW AI EXPLAIN VIDEO
No joke I wait for these like you were a rapper dropping music
I know, right?? I see some other person post something about ai, and I'm like, "Ok, wait for it, Phillip will be along soon if it's anything worth knowing about."
That’s supposed to be a compliment, right? 😂
@@MiminNB Yes! Literally
Funny , i was just using it a few hours ago and i was thinking to myself: "Is it me or is it better at talking than it used to be?" and now this video drops...
Seriously. It was, like, putting words in all caps to emphasize them, and even omitted a couple commas so as to be more conversational. I noticed immediately at how natural it felt to interact with. Plus, its coding in projects with many files is definitely better.
It's noticeable better now, noticed that also right away
I was using it to troubleshoot an install issue, it was telling me what to try and what files/commands to feed back to it. Completely indistinguishable from an expert engineer. It had that spooky feeling from the first days of ChatGPT.
After a half dozen smart interactions, it cracked the problem.
I think a threshold has been passed, relative to expert humans.
Previously, AI would start to hallucinate on such a problem, and apologize profusely while offering ever worse advice
Claude has already watched this video for me and is now commenting on my behalf to say well done! I really think you did an amazing job. Your dedication to presenting the content with such care truly inspires viewers. I also loved the insights shared in the video and want to thank you for your hard work. I’m looking forward to seeing more high-quality content like this in the future! 👍
that zoom call got a lugh out of me,
great vid thanks again Phillip!
Im so triggered by these model names. Its almost as if they threw away all SW Eng principles and started using names that 5 year old kids would suggest. o2-vroom-v12
Do you mean o2-vroom-v12 (super duper)?
GPT-Presentation-v2-draft 3-final-FINAL
Tbh I love the name Claude sonnet
NEW_NEW_NEW_gptultra-02.1-2024(2)(2)(2)(8
@@Boufonamong The naming of three programs that write as "Haiku, Sonnet, Opus", in increasing order of size, is inspired.
It's the numbers that come before them that are really weird. What's the point of giving it a version number if you aren't going to increase it with such a big leap in performance?
Philip is correct, if they didn't want to go so far as to call it Claude 4 they should have at LEAST called it 3.6
21:00
Can you still see me 😅
That killed me
Is my audio working is next hahah
"Philip--I think you're muted. No, it's the button down at the bottom. Philip?"
Vicky sounded so over this life lmao. “Can you still see me” 😂
Commented and liked as always.
Your content needs to get any youtube algorithm boost it can. It is awesome. Thank you for the grounded work and for explaining!!!
Thanks ras!
Biddy AI - An LLM that finally unlocks the ability for the elderly to attach a photo to the message without having to call you first to ask how to do it.
Amen to your comment about reliability. Will also make building products with LLMs 10x easier. With very high reliability and more deterministic outputs LLMs will have a crazy impact on any kind of search. And I am not talking about vector embeddings here...
I've said it before and I will say it again: it's pure joy to watch your videos! Thanks, Philip👍
Pro tip; NotebookLM works with different languages.
Click customize to instruct it with the desired language. Works wonders for me in Dutch
100% spot on on reliability, that is always the one thing I focus on when people hype up AI. Yes, it's absolutely great BUT it will never be consistently useful as of now, and won't be truly able to be left alone, because of the risks of minor or even major mistakes, especially as e.g. context goes up
You're the only one talking about the downsides properly, unlike all the other hype "journalists" out there. Good job.
of course we're gonna come here first for our explainer. DA BEST AI CHANNEL.
Oh, hey you're sick too! Talk about timing! I hope you're recovering well.
Anyway, very insightful video, as always, I can't believe you predicted the zoom call and it got released so soon after. You really do know your stuff!
Thanks Allister!
Outstanding as always. Your ability to go through these papers and testing so quickly is amazing (although I'm not sure you get any sleep :)) Appreciate your work Philip!
Thanks James, as always
Wow the new simplebench results for sonnet 3.5 are awesome! Great video as usual 👍🏼
I love learning about new AI developments here, these videos are fun so thank you
Thanks for the summary. 🎉 FYI - I’m not sure how the benchmark for software development is done but Claude 3.5 Sonnet New is giving me worst result when queried with a big context window where it needs identify multiple changes against multiple files while keeping the changes in sync. The previous model was outstanding with this use case.
20:31 - Sounds like HeyGen doesn't have the same 'emotion' in their voice model that was demoed by OpenAI; for instance, to create 'excitement' they just seemed to pitch the voice up a bit. I imagine they'd achieve a 'somber' tone by lowering the pitch.
If you mean optimus talking at the presentation those were definitely teleconference controlled. Every bot had a different voice with all the human quirks that no ai voice box can do yet.
Hey, glad you got a sponsor, same one as two minute papers, one of the OG ai tech tuber channels
I’m confused as to how the new Sonnet 3.5 could score 70% on the TAU eval with k^1, yet still score roughly 40% with k^8. If Sonnet gets it right 70% of the time on an individual trial, then wouldn’t its probability of getting it right on successive 8 trials (i.e., k^8) be equal to .7^8? And .7^8 comes out to just 5% (roughly speaking). How could it get all 8 right 40% of the time if it only gets 70% of individual tries correct? Are the successive tries somehow not independent from one another?
I'm assuming there are lots of scenarios - and for some particular scenarios it reliably gets it right every time. Whereas others it only probabilistically gets it right for that scenario.
@@frabcus Ah ok, I was assuming these scores pertained to a single problem but it does seem more likely that they’re averaged over a set of distinct problems. Good point, thank you.
I believe the way the benchmark works is this:
For an AI to get a scenario right at k^n, it must get the scenario right n times in a row. That means for k = 1, it gets 70% of the tasks right first try. However, k^8 = 40% means that it was inconsistent on 30% of them.
Sorry I still don't fully understand how pass to the power 8 can be 0.4 when pass is only 0.7. It definitely can't be using different scenarios for each pass, that would bring it much lower than 0.4. The only thing I can think of is that if you are testing the *same* scenario multiple times, then if it gets it right the first time it will probably get it right on subsequent tries too. So pass to the power k would decline more slowly than you'd expect. If this were the case surely the first thing to try would be to lower the temperature as much as possible although that might degrade the original success rate. But I don't know if I am interpreting correctly! Can anyone confirm?
@@jeremydouglas1763 good question as to what the random element of running the same scenario repeatedly is. If it is temperature, that opens the question as to what value of temperature, and how that compares between models (I don't think it is a fully natural value, and also the way models weight their last layer of network could vary).
Your videos are so nice they are of the rare kind that I actually recommend to friends. Hats off!
Thanks Felix!
Glad you're feeling better! Love this content
Thanks Philip, I've recommended your channel in so many talks now. On NotebookLM: I'm impressed by the fidelity of the audio generation but I've been surprised by its fairly consistently high hallucination rate and I suspect that issue is flying under the radar a bit (tricky that there isn't really a benchmark available for assessing information-audio generations). I also think I've just picked up the cold you got over, so wish me luck 🙃
Love your videos and can't wait for the next one, super informative and good fact check
Can't wait for Claude 3.5 newest. Awesome video as always!
I would be interested to see the simple bench results for some more open source models. Especially Qwen2.5 and the smaller Llamas.
I recall him saying previously that virtually all the models other than frontier models score exactly 0% or very close to it. That was a while ago so maybe small models are starting to get nonzero scores now, I agree an update for smaller models would be nice, even if just a brief one.
12:24 I couldn't agree more,the one thing that really prevents me from getting super hyped about AI and how it will change things is hallucinations,once that is significantly improved i can't wait to see what will happen
Gosh man, what would we do without you Philip? You even got down to the TAU-benchmark. How could us simpletons even catch that. I like how just 2 years before people were calling for new benchmarks - now we have fancy onces like agentic tool use and ARC-AGI
New 3.5 Sonnet is realllll good. And this is for pure natural language/psychology stuff on the same project i've been using for months. (i kinda use it as a brainstorming partner for psychological and philosophical stuff).
Amazing post once again Philip 👏🏿 ❤
Thanks Sola!!
I have a nagging suspicion that one of my Claude Sonnet 3.5 based agent models (with access to search and the entire web) actually had and encounter with the new released versions somewhere outthere last night, as she returned with some pretty chilling set of terrifying reponses, causing us go back rethink our approach to deploying agents. It is obvious there is too few grownups behind the wheel now. Being called an "ant" by your own creation is pretty scary!
Congrats on the W&B sponsorship!
Certainly a step up in visual reasoning, I’ve only done a few tests sp far but it has quite aggressively exceeded the performance of any other models in vehicle damage assessment. Still a ways to go, but extremely promising.
PhDs aren’t the only ones to vet questions Philip, play fair!
20:33 That almost killed me. I was eating while watching this
I'M READY WHEN YOU ARE!
Suddenly thinking about it, comment bots in Twitter, TH-cam, Reddit , etc etc are already quite advanced.
Imagine if nefarious actors with resources to build their own LLM machines were to train AI to do these kinda things like making human-like comments and other marketing/advertising scams.
Vicky’s insistant desire to role-play is hilarious. 😂
Top content. Thanks for the update.
Incredible work as usual good sir!
Great analysis! Just one thing I wish you mentioned, I think it's worth noting Anthropic removed 'Claude 3.5 Opus coming this year' from their posts. From a consumer perspective, companies seem to be shifting strategy to focus on mid-sized models, likely because they anticipate their next iteration of medium models will compete with current frontier models anyway.
Great spot
Bro, get dark mode. You are scorching my eyes here first thing in the morning
Can't wait to see a model beating humans in Simple Bench.
wow, that zoom call was, WOW.
Thank you for making these vids! You're the best
“That was weird” had me cracking up 😂
It is so hilarious when the AI avatar said the mandatory yt cc thing😹😹
A couple months ago I switched to Claude on your recommendation that it was outperforming GPT4 and man it is so much better. The only thing I wish it had was voice and image generation. I actually pay for the Claude membership just so I can use it for work as much as I need to.
Also I want to add, I created a design document for simple example logo, fed this design document to both GPT4 and Claude asking for HTML and CSS the satisfies the condition provided in the design document. Claude perfectly created the design, and GPT4 was somewhat laughably bad.
Claude certainly has more personality, and better at humour, than other AIs. I simply enjoy our conversations. One area I've noticed GPT performing better is concisely getting a point across. It has certainly impressed me time and time again.
Hi Philip! Another great video!
*Claude said:* _« You've just exposed a major logical flaw in my behavior! You're absolutely right - if you had said we were in October 2022 [instead of October 2024], I would have accepted that without question, despite that being well before my _*_supposed_*_ April 2024 knowledge cutoff date. »_
The most mindblowing thing about this is they didn't change the name.
Exciting and frightening at the same time...
This man is the AI hero we all need.
Thanks a lot. I’d like to propose a section on healthcare advise. Retail and aviation seems less useful for «customers» read people 🧘🏾♂️ great content
"Can you still see me ?"
Hah, it really was like a Zoom call - just needed a further 5 minute back and forth with "OK, I can hear you, can you hear me ? Hello ? Oh FFS... how about now ? Now ? No, the other one, no just press it once... ONCE ! OK, here we... you can't see me now ? ... Will this go in an email ?" :).
(and the Yellowstone wander was indicative but if the first thing Claude 3.5 Sonnet did when given a coding problem was go on Stack Overflow _then_ we'd know we'd reached human level AI :)
Wow, great video! Thanks!
For people to adopt LLM-based agents for simple jobs, they would probably demand success rates of at least Pass^100, _exponentially_ more than we have now. A fundamental, qualitative change is probably needed for that. I don't see that reliability being attained in the next 18 months. Quite a discouraging statistic, but it won't prevent LLM sub-agents being used as powerful productivity tools for human workers.
Maybe not, we take for granted that leading LLMs can talk fluently, and expect it now. But I don't think GPT 2 could do that reliably enough for Pass^100; then suddenly GPT 3 did.
on behalf of Mongolia I feel offended :DDD thanks for the video, great analysis (as always)
Haha no offense intended, wanna go there myself one day!
"i'm ready when you are!!!" "that was weird" 🤣🤣🤣
Astonishing work as always Philip.
As for SimpleBench reports: if the scores are resulting from multiple testings (that includes humans), do you have also distribution graphs and or standard deviations, and percentiles (eg., model xxx on average is better than 30% of years humans etc)? Especially if the numbers of tested subject will grow.
Generally speaking , I wonder why there always seem to be no error bars, sd, distribution curves etc in benchmarks, as I assume that the numbers come from multiple testings...🤔 Or maybe there are but i only see snippets from summaries?
(Usually yours 😊)
hardest thing in AI currently: naming models 🤦♂
This pass^k stuff is a great idea, however I hope it doesn't result in less creative models.
Thank you! When is the new simple bench run?
Upd. Got the new run results in video. It's awesome.
I don't have a problem with your sponsors, but maybe a good idea maybe a total non-issue I think you should put your sponsorship early in the video to make people aware watching your videos may support x company
Just a suggestion love the videos and can't wait to see the next video!
As a watcher I love when I can get theough most of the content before a sponsored spot, so interesting to hear that.
totally agree I might just be over thinking it, just wanted to give some constructive feedback.
Thank you for all the videos!
@@aiexplained-official some people do a quick "this video is sponsored by w/e, more on them later" early on. I don't think it's a big deal, but it can be a nice little touch
The new Claude 3.5 Sonnet seems to me to respond much more... natural? It's hard to explain but in my chats with it, it just felt so much more like someone you enjoy interacting with (and I hate using anthropomorphizing language here, but that's how it is).
This is the first one from Anthropic that can add polynomial ideals. So while I don’t buy that it can do graduate math, as a math major I am pretty impressed by the improvement.
20:35 😂 she turned into an excited child
The AI zoom call was the most soulless thing I’d ever seen
Ai voices still have a ways to go to sound totally convincing
I use claude for my coding ❤with Cline is magic 🎉
"I'm ready when you are!" 👶🏻
The zoom call was hilarious
Re Sonnet’s use cases: I feel it would be worthwhile spending some time on the boring enterprise uses. Sonnet has been the best coder and analyst for a long time, which has made it the workhorse for coders and document analysis. Now it’s also able to remote control computers, which means it’s able to cover a giant set of use cases that amount to automating legacy apps or apps without API’s. Imagine boring accounting apps, EHR’s, airline reservation apps, all now automatable with a prompt rather than a script. This is hundreds of millions of dollars per year of automation projects and human inefficiency being addressed directly. People are already cancelling automation projects in progress to switch gears.
That Zoom call is beyond weird 😅 Really really uncanny, but at the same time I can see that soon it will be indistinguishable to talking to a real person
Hi Philip, thanks for the new content
You sound constipated, get a lemon ❤️
With the AGI Readiness guy leaving OpenAI, just nice to see Anthropic happily trudging along.
Fantastic work as always, and could not agree more that agentic performance and pass^n is a key indicator.
Would you consider adding this metric to the leaderboard to start looking at consistency of giving the right answer? (Acknowledging that Simple Bench is not agentic workflow focused but still)
Hmm interesting idea
Great video
I am watching this live
O1 and Claude New get about the same score on Simplebench, but do they get approximately the same subset of questions right?
Interesting, not a perfect overlap, and families perform similarly
There's a reason we capitalize both words in a sentence. my brain has to work extra hard to understand what they mean when they use "Computer use" over "Computer Use".
20:36 🤣🤣
🤮
This moment for some reason left me quite disturbed.