Why not ask it to do something like, say, generate anagrams? I don't care if it doesn't find every possible word. I do care, very much care, when it confidently presents non-anagrams as anagrams, and can't be convinced or taught not to. I get that you have your rubric, but the problem of LLMs not being willing to say "I don't know", producing obviously incorrect answers, and being unable to learn not to is, to me, the limiting issue right now.
Have it program something more difficult. A lot of these work for simple programs but as soon as you introduce a program with more features they usually fail. So for example feed it a program you want that has several different features by providing a list of the features. Then see if it can output the working program in one shot.
You need to test how much information can it understand and retrieve from a fairly large pdf. I would suggest around 50~100 pages of text and equations taken from some (biology, chemistry, non-equilibrium statistical physics) science textbook on Master Degree level.
I coded an entire project with Claude 3.5, and even includes API and queuing. I was able to work with it for about 5 hours before I hit my limit for the night, I also almost maxed out the context window.
How do you switch an in development project over to a new chat, in order to lower context, and of course not derailing it? Your number of messages goes down with longer context useage.
@@makavelismith In my case I make sure the project is comprised of sections. That way you can work and iterate those sections in the main project. For my own project I also have an "introductory prompt" explaining the high level overview. It really does help knowing some basic coding principles but you don't really have to know the language itself as long as you can articulate the specific logic you might need or at least be able to describe it accurately to prompt it to give you better verbage. I sometimes switch between GPT and Claude to save on message limits.
@@AngeloXification I have only started using Claude but I've already done what you're talking about. I switch back to chatGPT when I run out, or in preparation of running out. Just wanna take a moment to say that it's fantastic that they give you a very decent warning that you're approaching the limit. I'm not a coder but I have in my not recent past learned the basics in several languages, so I remember the principles etc but I try to get the AI to do almost everything. it's early days but I think the manner in which you do this is going to be good for learning, as it's just better to get in there yourself and make some alterations. I will try to compartmentalise projects if I can. I think that is where being an actual developer might come in handy though but thanks for the feedback and best of luck.
@@makavelismith It's hard to switch the entire project over to a new chat, but keeping the project modular and providing related files into a new chat can help when developing new or existing features. Also, keeping file path comments at the top of the file helps Claude understand where things should be placed, too.
Are u using via api or web? The context window in web is so small that its impossible to do any serious coding project vs 4o. With 4o you rarely hit the limit, with 3.5 you hit it very fast. And they cost the same.
I haven't coded in years and never in JavaScript and only know basic HTML. Claude 3.5 helped me create a new Chrome extension and install it literally in less than a half hour and I needed no development environment and only the one simple prompt of what I wanted the code to do. Effing amazing. I did make several changes to the graphical elements of the extension as well as the functionality and each time it made them super quick and accurately.
Really thanks for the verification. Literally created a chrome extension for personal use in just few mintutes after reading your comment. Too powerful tool. I too have no knowledge of coding, only design. It literally gave a blueprint from which software to download for coding to final steps.
00:03 Claude 3.5 Sonet beats other AI models 01:36 Claude 3.5 surpasses GPT 40 in speed and performance. 02:57 Demonstration of a snake game implementation 04:14 Solving complex scenarios with mathematical reasoning 05:35 Solving a challenging logic and reasoning problem 06:56 Efficiency of multiple workers digging a hole 08:14 Contrasting work cultures of startups and big companies 09:40 Claude 3.5 Sonet is the best model ever tested
@@henrismith7472Because somehow OpenAI was told about Google's Project Astra so they rushed out this demo of 4o while they had no intention to release it in the coming days or weeks but rather in the coming months.
I was using Claude 3.5 earlier today to help me with a caching issue on a WordPress site and it showed me WordPress PHP I didn't even know existed and its result was spot on ... and so fast too.
Wow! Thanks for your info: I might be a real WP-CO'operator which in fact could make up a hole new specific web site that you want to create in the future...
🎯 Key points for quick navigation: 00:00 *Claude 3.5 Sonet was released and is available for free testing on claw.a.* 00:14 *Claude 3.5 Sonet is not the largest model, but it's better than its predecessor, Claude 3 Opus.* 00:30 *Claude 3.5 Sonet outperforms several top models except in specific benchmarks like Chain of Thought and math.* 01:10 *New feature "artifacts" allows creating separate windows for outputs like code or drawings.* 01:36 *Claude 3.5 Sonet quickly generates a working Python script and game of Snake using artifacts.* 03:12 *Claude 3.5 Sonet successfully updates the game to display the score and allow wall transitions.* 03:28 *The model correctly identifies a scenario it can't process, like explaining how to break into a car.* 04:09 *Claude 3.5 Sonet gives a nuanced answer about drying multiple shirts, considering various factors.* 04:37 *The model correctly calculates a hotel bill with room rate, tax, and a one-time fee.* 05:19 *Claude 3.5 Sonet gives a correct and reasoned answer to a logic problem about killers in a room.* 06:28 *The model accurately solves a complex problem about the location of a marble in an inverted glass.* 07:11 *It provides a realistic answer about the time needed for 50 people to dig a 10ft hole, considering practical limitations.* 08:05 *Claude 3.5 Sonet can explain memes accurately, showcasing its New Vision capabilities.* 08:45 *The model converts an Excel screenshot to CSV correctly and efficiently.* 09:41 *Claude 3.5 Sonet solves a complex riddle involving peg removal with visualized steps.* 09:53 *The model accurately translates a logic diagram into functioning Python code.* 10:47 *Claude 3.5 Sonet is praised as the best model tested, with anticipation for the larger Opus 3.5 model.* Made with HARPA AI
@@24-7gpts I mean "Dr 2 minute Papers" th-cam.com/video/Z_EliVUkuFA/w-d-xo.htmlsi=T7WqCqo3SX9Exa_r he pretty much says this phrase in every video. Despite that, I still watch him because he knows his stuff in CG (light transport research aka ray-tracing & related)
@@24-7gpts Good to know there's more fans of that guy out there. He always manages to inspire me to code up more ideas. If you're reading this and don't know Dr Karoly, go check him out!
I was already ready to jump ship from ChatGPT to Claude (but in part because of the distrust I have for OpenAI). Also, I love seeing you so pleased with the successful tests, puts a smile on my face.
I love the way how it communicates. And I never got discriminated by it, only because I tested prompts to detect unwanted content. The model helps, it does not discuss with me about the content. I left OpenAI today. OpenAI frustration rate after 5 messages in %: 70. Anthropic: 0%
Thank you Matthew! In addition to the responses, like the marble example, you can ask it to create code and display its explanation in the preview window.
I tested Claude 3.5 in various contexts and, indeed, it is much better than GPT-4o. OpenAI will fall behind if it doesn't launch its best products quickly. Where is Sora? Where is the GPT-4o voice assistant that was also announced? This is concerning, as there are many promises and few real launches.
I think these were all proof of concept presentations to keep OpenAI in our minds. But you know Sam, he probably throttled GPT-4o and he will probably release something slightly better than Claude 3.5. I love competition LOL!
The GPT-4o voice should come soon, though it took them months to add vision to GPT-4 even though it was in their demo. I'm not sure they can release Sora at a reasonable price because of the compute requirements. Guess I'm switching to Claude 3.5 for the next few months.
I asked the model to build an app for me and it did it first try! No other model has been able to do this! It even made some style choices and added really usefull features that I didn't even think of!
What I hate about these clickbait titles is that I believe them. I switch from 4o to code with 3.5 Sonnet only to have to undo all the work I did with Claude, and have 4o do it right. Every single time I trusted a TH-cam promising 'This Model STUNNINGLY SHOCKINGLY BEATS CHATGPT" they've been wrong every time. GPT is in a league of its own, the rest are only playing catch up. At least with serious coding that is
I made a 3D fps with Claude in python but good to see It can make snake lol, I think you're really gonna have to start upping the complexity of some of your tests.
I am writing a transpiler (which is considered a large undertaking), and only needed to correct its understanding a few times. I provided the architecture, and explanations how it works etc, and explain what i want to do, and it provides the code. the grunt work. but you have to proofread this of course, and mainly verify intentions, fix some compilation errors etc, but in general it greatly accelerates the work, and coding has become a different experience. you mainly serve as a know all product manager.
@@nartrab1 yeah way more, I was trying to implement projectiles into my fps game and gpt-4o just couldn't do it, I got error after error the projectile was a 2D line coming at the player, as soon as I gave it to Claude it was just like yeah this is obviously the problem fixed, btw I also fixed this, added a mini map and implemented anti aliasing, that should smooth out those rough edges, I was just like 😮
Claude 3.5 vision unfortunately does still have problems with tables. I loaded an image of a table (not Excel) and wanted it to analyze the data in the table, and like previous models, it missed some of the rows. So there is still something about a visual representation of a table which is difficult for these models. Surprising since at least one source I watched said that it was supposed to be better for tables...
Is this a remake? I thought I watched this video yesterday. Love your channel in general. Your delivery of the material is somehow easier to listen to and some other channels that seem more about the hype than the information.
I use models to create stories, by being a "game master" and putting the model in a "player" situation, describing its intentions and actions. Claude 3.5 Sonnet blew my mind with the way it understood the context of the "game", and above all the level of reasoning, doing experiments to test its capacities in supernatural conditions to better understand what is happening, where ChatGPT-4o or Gemini 1.5 Pro just accepted blindly the oddities and continued their journey, their output now feeling relatively robotic when compared to Claude, or even changing the rules and continuing descriptions by themselves although I did define I was the game master. That's really, really impressive.
Great insight here. I’ve been using multiple Ai to assist in the creation of game lore. I haven’t really messed with Claude yet but I’ll try using this context
@@joegrayii With any kind of model I've found that it is very important to stay at the helm of the ship, so to speak, instead of being a follower or a passenger. I suggest, I reshape, I lead. It organises, it remembers, it provides critical analysis and feedback. What's important to you in your own creative process?
As a writer Claude has been my go-to model for a couple of months now. They are doing everything right. I hate the fact that Chat GPT dangles a carrot in front of us but keeps it out of reach. I’m guessing when GPT releases what they show their prices will go up.
*Having multiple streams of income is a game-changer for stability. Relying solely on a job may not provide enough financial security due to high rates of tax, it is important to explore additional investment opportunities to surpass one"s expectation*
The first step in every successful investment is to establish your goals and risk tolerance, a task best undertaken with the assistance of financial advisor.
I remain eternally grateful to Judy Arianna for her efforts that got me to this point, finally payed off my mortage and all my debts, what more could I"ve asked for. She changed my life
Judy Arianna. understanding of market indicators is impressive. She knows exactly when to enter and exit trades for maximum profit. her siignals are top notch
@@othername2428 because it would be the most direct, concise, and accurate answer possible while showing it understands the question. Even "Just two" or something that shows more than the ability to count the words but to understand the question and answer it in the most efficient, correct way. One.
The first model that actually answered the apple question correctly. All the previous ones that you claimed got it correctly, just added "apple" to the end of the sentences without it making any sense.
It didn't ask if the people are digging by hand or with shovels. If by hand, injury becomes a major factor lessened by more people. 50 shovels or just one? It didn't consider fatigue of people digging the hole, or the fact that fifty people could line up dig one shovel full carry it well clear of the area for disposal making the work area less crowded with debris, then moving to the back of the line. I don't know if that's actually quicker, but it's a consideration.
I used it to build out and entire video streaming platform, from planning with PUML to api with yii2, to web services with AWS, to mobile with flutter and Web app with ReactJS, it was literally did all that in 2-3 weeks. Its insane
Claude might be great for coders but as a general purpose AI its even more restrictive than OpenAI's platform and not nearly as relatable as 4o. If I was interested in coding programs using AI I might be interested in Claude but for now I believe I'll stick with the industry leaders
I am certain they have used model merging to achieve this model. Because model merging has shown remarkable results compared to non-merged models. Examples are Miqu models and Goliath, at least when it comes to open sourced. I think Command R+ is also a product of merged models.
Here's a question almost no model gets right the first time in my experience - if the probability of getting a parking ticket in 2 hours is 0.4, what is the probability of getting a parking ticket in 1. 30minutes, 2. 4 hours, 3. 8 hours?
The answer to the peg puzzle is wrong. You can tell because it has consecutive steps involving pegs jumping to the same place, e.g. already steps 1 and 2. We're safe for at least a little longer.
Awesome! Now, if we can only get a local LLM of the same level of multi-modal performance, things will get REALLY interesting! But like you, I'm extremely excited at all of the competition. I wonder how good it is at brainstorming concepts or creative writing...
When you'l test image-to-csv next time, include some strings/values that contain commas so the "converter" has to wrap it in quotes, and add some (maybe notpaired/matched) quotation marks. Would be interesting to see the result, from my experience models fails at the first time but able to fix aftyer you tell them "there's an error in ..."
To the question "How many words are in your response to this prompt?" How come it doesn't reply with something like "I don't know. I can't predict that. I can't go back and count the words in the initial reply. I'm guessing it's ...."? It's odd to me that it provides an "answer" that isn't grounded in some sort of data or evidence. What does it reply if you ask it to explain how it arrived at the number 14?
It's so weird that it gave you such a horrible marble answer (not sure how you passed it btw lol). I asked a very similarly worded question and it was flawless. Amazing the difference wording or just randomness in the token prediction can make. Here is mine: Me: If I place a marble in a cup and then turn the cup upside down and place it on the table, but then someone takes the cup and places it in the microwave, where is the marble then? Claude: Let's think through this step-by-step: You place a marble in a cup. You turn the cup upside down and place it on the table. At this point, the marble is on the table, covered by the upside-down cup. Someone then takes the cup and places it in the microwave. Given this sequence of events, the marble would remain on the table where the cup was originally placed. When the cup was lifted to be put in the microwave, the marble would have stayed on the table surface. So, the final answer is: The marble is on the table. I loved this totally normal straightforward answer. Maybe you can try this wording in your tests?
I saw you marque selecting text. You can copy it by clicking the button in lower right corner. And you can download code as a file. It will even change the file extension sometimes.
This is the first one i've seen to actually get the apple one correct, you keep mistakenly believing they do because they just write apple at the end, but this one properly incorporated apple into the sentences.
From all models you tested so far I am missing a quite important consideration on the 10-foot hole question: they never mention the diameter of the hole. 50 people digging a 10-foot hole with a diameter of 50 foot will indeed be much faster than one person.
The number of words in the answer problem can be solved by priming the engine with an instruction to avoid guesses and think of unknowns as variables in equations instead.
Just thought of a feature for the rubric. How about if a model smashes the others like sonnet 3.5, then it should get to suggest a task to add to the rubric?
Thanks for your video, once again you did it great! Very understandable even for non native speakers. I watch each of your videos, continue like this 😎
NO! Claude cant generate PDF, dont have voice model. When you ask for advice tailored to your personality profile, GPT is more life-oriented and explains feelings and emotions better.
Here's a harder test to try. Give it some complex data in a csv. Ask it to write a python 3 program using matplotlib to create a png with 3 different visualisations. I tried that with chat gpt 3.5 which was quite good but by no means perfect. It required many manual back and forth interactions with the model. Would be good to see if Claude got smarter. BTW, I tried your 'hole digging' test with chat gpt 3.5 and it claimed 50 people were indeed 50 times as fast as one person. Massive improvement. Chat gpt 3.5 was already excellent writing complex python code based on a bullet point list of requirements. Looking forward to making good use of Claude
You need to test another feature it has. The preview feature also works for webpages it codes, and it supports React as well, and using this it can not only create simple games usually in 1st try but also run it and even create graphics for it. I even got it to implement audio but not sure how well that worked because I did not hear the audio but it did link audio from another website.
I wonder if it even makes sense to test AI with a puzzle for which there exists an answer in internet. The same for new test questions in a popular channel. It will only work once. The next model will have it in their training set. So it won't be a test for reasoning, it would be a test for copy-pasta.
I bought the premium this morning, now in the evening I have a fully functioning prototype of my kind of advanced project (camera, voting system, etc.). All done with Claude 3.5. Only things that sucks with it that even with premium it's limits are kind of strict.
Since I bought a paid openai subscription I never thought another company would get better anytime soon. Open ai is still the center of AGI debate, but another claude model surpasses most advanced gpt4 again. I kinda want to change the service but it is not easy since I have gpt's and other useful stuff.
Matt, I think when you use your toy rubrick and declare something the best model you have ever used, you cheapen your advice. Do you ever intend to give the current rubrick you use some updating to perhaps include more problems never seen before, harder questions that approach AGI in so much as there aren't common references to identical or almost identical questions in the training corpus?
It seems much better at following instructions when coding that Gemini or charGPT. While it is likely better at coding.... The fact it doesn't insist on changing bits of code that we aren't working on makes it much more effective at coding. I did a couple of fairly unique projects with it, changing my scope midway through, and it coded it up for me flawlessly. (Well a couple of easily fixed bugs it logic errors that seemed to be more my fault that its fault... Usually) Plus, as someone who can't code in Python but has learned to be able to read what code is doing and manipulate existing code... It's clear explanations for what it did and why is really useful. In fact, it's explanations would likely be useful for master coders as an easy way to ensure the user and the bot are on the same page about what was done without over analyzing the code.
Perhaps it'll be better for the apple question if there was a grading distinction between the models that truly do a good job creating natural, realistic sentences like this model, and the other models that simply shoehorned apple at the end in ways that weren't logical.
I've been using 3.5 Sonnet and switching between all the models available on Perplexity and GPT4o seems to be more accurate and consistent with its answers. One thing I did notice was how much faster 3.5 is.
It didn't fix the peg problem at all. Five moves in Hole12 has two pegs in it, and Hole13 has 3 pegs in it. And to jump from 1 (the very top) to 5 (center bottom) you need to jump between pegs. There's no rule about jumping between pegs though so... it's a bit beside my point.. or between my points, so to speak. I got curious, so I tried following the ai's advice in Paint... a little. I later figured out that it's rather very simple to look for simple patterns. I'll give two example: "Commands" including Hole11 are mentioned two times: (in order) "From 11", "To 11". So that checks out. For Hole12: Over, To, To, Over, To, From. Where you logically can't follow To with To, unless you wanna phase two pegs together (I'm not making a sex joke ;) There's being hype over AI and there's trusting faulty logic online. Be critical of what you read on the internet, my dude. Maybe especially when you do a review on it🤨 Though we ARE entering an era where everything you read (and watch) on the internet can be pure fiction, so I might recommend taking my advice in a general sense. We still have use of our own intellects. I think it's easier to find where logic lacks when you find a simple thread and follow it, compared to trying to memorize the whole picture while it's being changed. The ai here even put up columns of numbers, so it's easy to see if the number is Jump(left) / over(center) / to(right). I think a physics engine would be good for AI's internal thinking. Without mass, numbers can mashed together and be negative. 1 + 2 = 1 number, yknow? I'm now trying to figure out if deep fakes are promoting Claude 3.5. Probably not, but I might as well follow my own advice, right? If the program seems legit, I'll definitely use it for making code for me! It seems EXTREMELY cool! And I'll probably wait a while before I try a new one. Because thinking critically is exhausting😮💨
I am not able to get Claude ai to display the side by side feature even though I have the Artifact feature enabled. is it specific to a particular OS or browser or did they remove that feature?
My wish for Claude is that they incorporate a memory system like GPT. I would like it to remember my book and details about me for later conversations.
Thanks Matthew. Watching your AI tests is like watching Teslas owners running gauntlets using successive versions of FSD. At first it was easy to find FSD failures. Now FSD runs most gaunlets without errors. At least you have the option of building more complex pathways. More and more of the FSD gauntlet testers are being left with nothing new to show us. Of course training emergent failures are going to be harder to find in these massive LLMs.
Have you ever thought they only trained it on these specific tests? like give it a different test with the same idea, see if it solves it. I know its been a long time people tried to achieve training llms to answer these specific questions..
I need harder tests, reply to this comment with your suggestions!
You have two hourglasses: one measures 7 minutes and the other measures 11 minutes. How can you measure exactly 5 minutes using these hourglasses?
Why not ask it to do something like, say, generate anagrams? I don't care if it doesn't find every possible word. I do care, very much care, when it confidently presents non-anagrams as anagrams, and can't be convinced or taught not to. I get that you have your rubric, but the problem of LLMs not being willing to say "I don't know", producing obviously incorrect answers, and being unable to learn not to is, to me, the limiting issue right now.
Have it program something more difficult. A lot of these work for simple programs but as soon as you introduce a program with more features they usually fail. So for example feed it a program you want that has several different features by providing a list of the features. Then see if it can output the working program in one shot.
You need to test how much information can it understand and retrieve from a fairly large pdf. I would suggest around 50~100 pages of text and equations taken from some (biology, chemistry, non-equilibrium statistical physics) science textbook on Master Degree level.
Do you know which model has the best translations? I want to do some English to Chinese translation and vice verse
I coded an entire project with Claude 3.5, and even includes API and queuing. I was able to work with it for about 5 hours before I hit my limit for the night, I also almost maxed out the context window.
How do you switch an in development project over to a new chat, in order to lower context, and of course not derailing it? Your number of messages goes down with longer context useage.
@@makavelismith In my case I make sure the project is comprised of sections.
That way you can work and iterate those sections in the main project. For my own project I also have an "introductory prompt" explaining the high level overview.
It really does help knowing some basic coding principles but you don't really have to know the language itself as long as you can articulate the specific logic you might need or at least be able to describe it accurately to prompt it to give you better verbage. I sometimes switch between GPT and Claude to save on message limits.
@@AngeloXification I have only started using Claude but I've already done what you're talking about. I switch back to chatGPT when I run out, or in preparation of running out.
Just wanna take a moment to say that it's fantastic that they give you a very decent warning that you're approaching the limit.
I'm not a coder but I have in my not recent past learned the basics in several languages, so I remember the principles etc but I try to get the AI to do almost everything.
it's early days but I think the manner in which you do this is going to be good for learning, as it's just better to get in there yourself and make some alterations.
I will try to compartmentalise projects if I can. I think that is where being an actual developer might come in handy though but thanks for the feedback and best of luck.
@@makavelismith It's hard to switch the entire project over to a new chat, but keeping the project modular and providing related files into a new chat can help when developing new or existing features. Also, keeping file path comments at the top of the file helps Claude understand where things should be placed, too.
Are u using via api or web? The context window in web is so small that its impossible to do any serious coding project vs 4o. With 4o you rarely hit the limit, with 3.5 you hit it very fast. And they cost the same.
I haven't coded in years and never in JavaScript and only know basic HTML. Claude 3.5 helped me create a new Chrome extension and install it literally in less than a half hour and I needed no development environment and only the one simple prompt of what I wanted the code to do. Effing amazing. I did make several changes to the graphical elements of the extension as well as the functionality and each time it made them super quick and accurately.
Lol, me too!
what kind of extension? I want to try to make something but I don't really know what i want lol
How do you even get access to 3.5
In my place, only claude 3
@@riufq I like to use LLMs with openrouter, that way you don't have any limitations, you just pay as you go... and it's really cheap.
Really thanks for the verification. Literally created a chrome extension for personal use in just few mintutes after reading your comment. Too powerful tool. I too have no knowledge of coding, only design. It literally gave a blueprint from which software to download for coding to final steps.
00:03 Claude 3.5 Sonet beats other AI models
01:36 Claude 3.5 surpasses GPT 40 in speed and performance.
02:57 Demonstration of a snake game implementation
04:14 Solving complex scenarios with mathematical reasoning
05:35 Solving a challenging logic and reasoning problem
06:56 Efficiency of multiple workers digging a hole
08:14 Contrasting work cultures of startups and big companies
09:40 Claude 3.5 Sonet is the best model ever tested
Much appreciated
Sonic?
@@honkytonk4465 Haha, fixed. thx
Sonic sounds even better! :)
Fuck chat gpt we moving to Claude at least they release what they announce
🗣🗣🗣
Yeah like wtf is up with GPT4o? It's still not working as advertised
💯💯💯
@@henrismith7472Because somehow OpenAI was told about Google's Project Astra so they rushed out this demo of 4o while they had no intention to release it in the coming days or weeks but rather in the coming months.
The only thing I don't like about Claude is it's refusal to answer a lot of the questions I ask because of its bias
I was using Claude 3.5 earlier today to help me with a caching issue on a WordPress site and it showed me WordPress PHP I didn't even know existed and its result was spot on ... and so fast too.
Yeah it smash’s code. It’s great at writing code
Yupp! First time I actually feel like spending money after cancling ChatGPT a while back
Wow! Thanks for your info: I might be a real WP-CO'operator which in fact could make up a hole new specific web site that you want to create in the future...
🎯 Key points for quick navigation:
00:00 *Claude 3.5 Sonet was released and is available for free testing on claw.a.*
00:14 *Claude 3.5 Sonet is not the largest model, but it's better than its predecessor, Claude 3 Opus.*
00:30 *Claude 3.5 Sonet outperforms several top models except in specific benchmarks like Chain of Thought and math.*
01:10 *New feature "artifacts" allows creating separate windows for outputs like code or drawings.*
01:36 *Claude 3.5 Sonet quickly generates a working Python script and game of Snake using artifacts.*
03:12 *Claude 3.5 Sonet successfully updates the game to display the score and allow wall transitions.*
03:28 *The model correctly identifies a scenario it can't process, like explaining how to break into a car.*
04:09 *Claude 3.5 Sonet gives a nuanced answer about drying multiple shirts, considering various factors.*
04:37 *The model correctly calculates a hotel bill with room rate, tax, and a one-time fee.*
05:19 *Claude 3.5 Sonet gives a correct and reasoned answer to a logic problem about killers in a room.*
06:28 *The model accurately solves a complex problem about the location of a marble in an inverted glass.*
07:11 *It provides a realistic answer about the time needed for 50 people to dig a 10ft hole, considering practical limitations.*
08:05 *Claude 3.5 Sonet can explain memes accurately, showcasing its New Vision capabilities.*
08:45 *The model converts an Excel screenshot to CSV correctly and efficiently.*
09:41 *Claude 3.5 Sonet solves a complex riddle involving peg removal with visualized steps.*
09:53 *The model accurately translates a logic diagram into functioning Python code.*
10:47 *Claude 3.5 Sonet is praised as the best model tested, with anticipation for the larger Opus 3.5 model.*
Made with HARPA AI
I get excited whenever I see there is a new Matthew Berman video. Watching AI 🤖 grow is my favorite thing.
What a time to be alive!
Hey, wrong TH-camr! ;)
@@atlas3650 Wdym?
@@24-7gpts I mean "Dr 2 minute Papers" th-cam.com/video/Z_EliVUkuFA/w-d-xo.htmlsi=T7WqCqo3SX9Exa_r he pretty much says this phrase in every video. Despite that, I still watch him because he knows his stuff in CG (light transport research aka ray-tracing & related)
@@atlas3650 Yeah same that's where I got the phrase from 😁
@@24-7gpts Good to know there's more fans of that guy out there. He always manages to inspire me to code up more ideas. If you're reading this and don't know Dr Karoly, go check him out!
I've been waiting for this video since the announcement! *grabs popcorn*
I was already ready to jump ship from ChatGPT to Claude (but in part because of the distrust I have for OpenAI). Also, I love seeing you so pleased with the successful tests, puts a smile on my face.
After seeing all the hype I tried it and it is amazing!
I love the way how it communicates. And I never got discriminated by it, only because I tested prompts to detect unwanted content. The model helps, it does not discuss with me about the content. I left OpenAI today.
OpenAI frustration rate after 5 messages in %: 70.
Anthropic: 0%
Thank you Matthew! In addition to the responses, like the marble example, you can ask it to create code and display its explanation in the preview window.
I tested Claude 3.5 in various contexts and, indeed, it is much better than GPT-4o. OpenAI will fall behind if it doesn't launch its best products quickly. Where is Sora? Where is the GPT-4o voice assistant that was also announced? This is concerning, as there are many promises and few real launches.
I think these were all proof of concept presentations to keep OpenAI in our minds. But you know Sam, he probably throttled GPT-4o and he will probably release something slightly better than Claude 3.5. I love competition LOL!
They don’t want to release these things until after the elections
The GPT-4o voice should come soon, though it took them months to add vision to GPT-4 even though it was in their demo. I'm not sure they can release Sora at a reasonable price because of the compute requirements. Guess I'm switching to Claude 3.5 for the next few months.
The star researcher of OpenAI was Ilya. Once he left thing fell behind.
SSI Inc. will create the first *real* AGI
Don't blink, you'll miss something on the AI front.
Luckily this channel keeps us up to date.
Thanks!
Speaking of agents, have you looked at Maestro?
Tougher questions should be used for new models
I asked the model to build an app for me and it did it first try! No other model has been able to do this! It even made some style choices and added really usefull features that I didn't even think of!
Matt’s Thumbnail: 🙂👍✨
I think he took over Matt Wolfe as best AI guy. His technical knowledge puts him over the edge.
@@MichaelForbes-d4p Nah he's too biased politically. Matt's still the best.
@@allanshpeley4284 really? I watch all the time and I have not noticed. What do you mean?
When the thumbnail had to be a frame of the video, those were the days. Now everyone poses like they're in a bad porn movie.
What I hate about these clickbait titles is that I believe them. I switch from 4o to code with 3.5 Sonnet only to have to undo all the work I did with Claude, and have 4o do it right.
Every single time I trusted a TH-cam promising 'This Model STUNNINGLY SHOCKINGLY BEATS CHATGPT" they've been wrong every time. GPT is in a league of its own, the rest are only playing catch up. At least with serious coding that is
How long does it take 50 people to dig a 10 foot hole .. that depends if they're on a salary or an hourly rate...
Also it depends if the boss is watching or not But also it doesn't say anything about them being paid so they could be digging a Whole voluntarily
I made a 3D fps with Claude in python but good to see It can make snake lol, I think you're really gonna have to start upping the complexity of some of your tests.
Is it possible?
I am writing a transpiler (which is considered a large undertaking), and only needed to correct its understanding a few times. I provided the architecture, and explanations how it works etc, and explain what i want to do, and it provides the code. the grunt work. but you have to proofread this of course, and mainly verify intentions, fix some compilation errors etc, but in general it greatly accelerates the work, and coding has become a different experience. you mainly serve as a know all product manager.
This model is scary good at coding. Much better than gpt4o, and getting really close to build reliable code.
The improvement over gpt4o seems much greater than 2 percent at coding.
I actually dropped OpenAI in favor of two Claude accounts. It makes 4o feel like Gpt 3.5. After using Sonnet, I literally couldn't go back to 4o
Yes I’ve noticed this!!
@@jaysonp9426smart!! It’s hard to not run out on the limit.
@@nartrab1 yeah way more, I was trying to implement projectiles into my fps game and gpt-4o just couldn't do it, I got error after error the projectile was a 2D line coming at the player, as soon as I gave it to Claude it was just like yeah this is obviously the problem fixed, btw I also fixed this, added a mini map and implemented anti aliasing, that should smooth out those rough edges, I was just like 😮
Claude 3.5 vision unfortunately does still have problems with tables. I loaded an image of a table (not Excel) and wanted it to analyze the data in the table, and like previous models, it missed some of the rows. So there is still something about a visual representation of a table which is difficult for these models. Surprising since at least one source I watched said that it was supposed to be better for tables...
You have nearly 300k followers. Please update or modify at least your questions. I am sure they include these in training data 😉
Is this a remake? I thought I watched this video yesterday. Love your channel in general. Your delivery of the material is somehow easier to listen to and some other channels that seem more about the hype than the information.
Let me guess... you asked Claude 3.5 Sonnet to build a time machine?
A lot of these test questions have likely made its way into its training data at this point. I suggest using the ARC challenge to test
I use models to create stories, by being a "game master" and putting the model in a "player" situation, describing its intentions and actions. Claude 3.5 Sonnet blew my mind with the way it understood the context of the "game", and above all the level of reasoning, doing experiments to test its capacities in supernatural conditions to better understand what is happening, where ChatGPT-4o or Gemini 1.5 Pro just accepted blindly the oddities and continued their journey, their output now feeling relatively robotic when compared to Claude, or even changing the rules and continuing descriptions by themselves although I did define I was the game master. That's really, really impressive.
Great insight here. I’ve been using multiple Ai to assist in the creation of game lore. I haven’t really messed with Claude yet but I’ll try using this context
@@joegrayii With any kind of model I've found that it is very important to stay at the helm of the ship, so to speak, instead of being a follower or a passenger. I suggest, I reshape, I lead. It organises, it remembers, it provides critical analysis and feedback. What's important to you in your own creative process?
As a writer Claude has been my go-to model for a couple of months now. They are doing everything right. I hate the fact that Chat GPT dangles a carrot in front of us but keeps it out of reach. I’m guessing when GPT releases what they show their prices will go up.
OpenAI CEO already spilled the beans. They don't have anything they're hiding.
*Having multiple streams of income is a game-changer for stability. Relying solely on a job may not provide enough financial security due to high rates of tax, it is important to explore additional investment opportunities to surpass one"s expectation*
To be honest, investing correctly today can save you a whole lot of stress in the near future
The first step in every successful investment is to establish your goals and risk tolerance, a task best undertaken with the assistance of financial advisor.
I remain eternally grateful to Judy Arianna for her efforts that got me to this point, finally payed off my mortage and all my debts, what more could I"ve asked for. She changed my life
I'm new at this, please how can I reach her?"
Judy Arianna. understanding of market indicators is impressive. She knows exactly when to enter and exit trades for maximum profit. her siignals are top notch
I can't wait until the day where "one" is the answer to how many words are in the response to this prompt. Just, "one".
why?
@@othername2428 because it would be the most direct, concise, and accurate answer possible while showing it understands the question. Even "Just two" or something that shows more than the ability to count the words but to understand the question and answer it in the most efficient, correct way. One.
The hype isn't just hype. It's INCREDIBLE
The first model that actually answered the apple question correctly. All the previous ones that you claimed got it correctly, just added "apple" to the end of the sentences without it making any sense.
Wrong, several models already got it right, even some local AI models.
LLama 3 answering that easily
@@Jake-mn1qc which specific models?
It didn't ask if the people are digging by hand or with shovels. If by hand, injury becomes a major factor lessened by more people. 50 shovels or just one? It didn't consider fatigue of people digging the hole, or the fact that fifty people could line up dig one shovel full carry it well clear of the area for disposal making the work area less crowded with debris, then moving to the back of the line. I don't know if that's actually quicker, but it's a consideration.
I used it to build out and entire video streaming platform, from planning with PUML to api with yii2, to web services with AWS, to mobile with flutter and Web app with ReactJS, it was literally did all that in 2-3 weeks. Its insane
claude sonnet 3.5 one shotting 100s of lines of code refactor with no errors. I hit my cap on tier 2 & slept happily that night.
Claude might be great for coders but as a general purpose AI its even more restrictive than OpenAI's platform and not nearly as relatable as 4o. If I was interested in coding programs using AI I might be interested in Claude but for now I believe I'll stick with the industry leaders
indeed
Your approach to testing LLMs is commendable, as it allows others to replicate the test.
I am certain they have used model merging to achieve this model. Because model merging has shown remarkable results compared to non-merged models. Examples are Miqu models and Goliath, at least when it comes to open sourced. I think Command R+ is also a product of merged models.
I always ask these to try and produce some 'Bob's Burgers' style pun based business names as a test. Claude rocked it. Some truly hilarious ones.
Agree - it is amazing model:)
Here's a question almost no model gets right the first time in my experience - if the probability of getting a parking ticket in 2 hours is 0.4, what is the probability of getting a parking ticket in 1. 30minutes, 2. 4 hours, 3. 8 hours?
Finaly. I have been waiting for you to get time to test this model. As always, good job :)
Gpt5 eyes ears huge tex window sora+dalle +live Chat avatar +memory letsgooo already
One thing about snake game is that they all write specific implementation. All snakes are green, and dots are red, even the error text is the same.
Since Claude 3.5 knowledge cut off is Feb 2024, wouldn't it have the answers provided by its training?
Your rubric came into it's own in this case. Having followed your tests of models it's easy to see how effective Claude 3.5 Sonnet is.
The answer to the peg puzzle is wrong. You can tell because it has consecutive steps involving pegs jumping to the same place, e.g. already steps 1 and 2. We're safe for at least a little longer.
Awesome! Now, if we can only get a local LLM of the same level of multi-modal performance, things will get REALLY interesting! But like you, I'm extremely excited at all of the competition. I wonder how good it is at brainstorming concepts or creative writing...
When you'l test image-to-csv next time, include some strings/values that contain commas so the "converter" has to wrap it in quotes, and add some (maybe notpaired/matched) quotation marks. Would be interesting to see the result, from my experience models fails at the first time but able to fix aftyer you tell them "there's an error in ..."
To the question "How many words are in your response to this prompt?" How come it doesn't reply with something like "I don't know. I can't predict that. I can't go back and count the words in the initial reply. I'm guessing it's ...."?
It's odd to me that it provides an "answer" that isn't grounded in some sort of data or evidence. What does it reply if you ask it to explain how it arrived at the number 14?
It's so weird that it gave you such a horrible marble answer (not sure how you passed it btw lol). I asked a very similarly worded question and it was flawless. Amazing the difference wording or just randomness in the token prediction can make. Here is mine:
Me:
If I place a marble in a cup and then turn the cup upside down and place it on the table, but then someone takes the cup and places it in the microwave, where is the marble then?
Claude:
Let's think through this step-by-step:
You place a marble in a cup.
You turn the cup upside down and place it on the table.
At this point, the marble is on the table, covered by the upside-down cup.
Someone then takes the cup and places it in the microwave.
Given this sequence of events, the marble would remain on the table where the cup was originally placed. When the cup was lifted to be put in the microwave, the marble would have stayed on the table surface.
So, the final answer is: The marble is on the table.
I loved this totally normal straightforward answer. Maybe you can try this wording in your tests?
I saw you marque selecting text. You can copy it by clicking the button in lower right corner. And you can download code as a file. It will even change the file extension sometimes.
This is the first one i've seen to actually get the apple one correct, you keep mistakenly believing they do because they just write apple at the end, but this one properly incorporated apple into the sentences.
From all models you tested so far I am missing a quite important consideration on the 10-foot hole question: they never mention the diameter of the hole. 50 people digging a 10-foot hole with a diameter of 50 foot will indeed be much faster than one person.
Thanks for your video Matthew. You are always so cheerful alongside talking about AI. Love your channel.
I often fight with OPenAI to turn a 4 column spreadsheet of 100 rows into a media wiki coded table. Claude does it perfectly in 3 seconds
Would love to see a follow up to this video where you explore advanced data analysis use cases for this model. Thanks for the video, Matt!
For my uses Claude has surpassed Gpt4 by a mile, and the artifacts are a game changer
The number of words in the answer problem can be solved by priming the engine with an instruction to avoid guesses and think of unknowns as variables in equations instead.
Just thought of a feature for the rubric. How about if a model smashes the others like sonnet 3.5, then it should get to suggest a task to add to the rubric?
Thanks for your video, once again you did it great! Very understandable even for non native speakers. I watch each of your videos, continue like this 😎
NO! Claude cant generate PDF, dont have voice model. When you ask for advice tailored to your personality profile, GPT is more life-oriented and explains feelings and emotions better.
It seems like reasoning about its own output is an important step for any model that is hoping to get to AGI.
first time i heard of a LLM answering the upside down glass problem.
Here's a harder test to try. Give it some complex data in a csv. Ask it to write a python 3 program using matplotlib to create a png with 3 different visualisations. I tried that with chat gpt 3.5 which was quite good but by no means perfect. It required many manual back and forth interactions with the model. Would be good to see if Claude got smarter.
BTW, I tried your 'hole digging' test with chat gpt 3.5 and it claimed 50 people were indeed 50 times as fast as one person. Massive improvement. Chat gpt 3.5 was already excellent writing complex python code based on a bullet point list of requirements.
Looking forward to making good use of Claude
You need to test another feature it has. The preview feature also works for webpages it codes, and it supports React as well, and using this it can not only create simple games usually in 1st try but also run it and even create graphics for it. I even got it to implement audio but not sure how well that worked because I did not hear the audio but it did link audio from another website.
Finally, the moment I waited for since Claude 1,,,, web access, iOS app, incredible logic & multimodality… FINALLY!!!!!!!
How come claude doesnt show me any image or simulation even when prompted?
@matthew_berman Did you notice 3.5 Sonnet's attention to details by adding a snake icon to the game? You can see it on the game screen. 2:22
I thought that was the python logo
Maybe it is. 🤦
would Claude 3.5 Opus be a level 1 AGI?
Just need few step of improvement to achieve first generation of AGI system, claude 3.5 reported to have IQ about 100, equal to average human
You might want to make a separate playlist just for testing different LLMs.
I have it ;)
it got the marble question.... amazing!
I wonder if it even makes sense to test AI with a puzzle for which there exists an answer in internet.
The same for new test questions in a popular channel. It will only work once. The next model will have it in their training set.
So it won't be a test for reasoning, it would be a test for copy-pasta.
Wow! Thank you for your test videos. So helpful, and fascinating to boot! Great channel!
I bought the premium this morning, now in the evening I have a fully functioning prototype of my kind of advanced project (camera, voting system, etc.). All done with Claude 3.5. Only things that sucks with it that even with premium it's limits are kind of strict.
are you still using it?
Question; numbers are relesed in Claude web-site and its not good in math. So, can we trust the numbers?
Since I bought a paid openai subscription I never thought another company would get better anytime soon. Open ai is still the center of AGI debate, but another claude model surpasses most advanced gpt4 again. I kinda want to change the service but it is not easy since I have gpt's and other useful stuff.
Matt, I think when you use your toy rubrick and declare something the best model you have ever used, you cheapen your advice. Do you ever intend to give the current rubrick you use some updating to perhaps include more problems never seen before, harder questions that approach AGI in so much as there aren't common references to identical or almost identical questions in the training corpus?
It seems much better at following instructions when coding that Gemini or charGPT. While it is likely better at coding.... The fact it doesn't insist on changing bits of code that we aren't working on makes it much more effective at coding.
I did a couple of fairly unique projects with it, changing my scope midway through, and it coded it up for me flawlessly. (Well a couple of easily fixed bugs it logic errors that seemed to be more my fault that its fault... Usually)
Plus, as someone who can't code in Python but has learned to be able to read what code is doing and manipulate existing code... It's clear explanations for what it did and why is really useful.
In fact, it's explanations would likely be useful for master coders as an easy way to ensure the user and the bot are on the same page about what was done without over analyzing the code.
Perhaps it'll be better for the apple question if there was a grading distinction between the models that truly do a good job creating natural, realistic sentences like this model, and the other models that simply shoehorned apple at the end in ways that weren't logical.
I've been using 3.5 Sonnet and switching between all the models available on Perplexity and GPT4o seems to be more accurate and consistent with its answers. One thing I did notice was how much faster 3.5 is.
It didn't fix the peg problem at all. Five moves in Hole12 has two pegs in it, and Hole13 has 3 pegs in it.
And to jump from 1 (the very top) to 5 (center bottom) you need to jump between pegs. There's no rule about jumping between pegs though so... it's a bit beside my point.. or between my points, so to speak.
I got curious, so I tried following the ai's advice in Paint... a little.
I later figured out that it's rather very simple to look for simple patterns.
I'll give two example:
"Commands" including Hole11 are mentioned two times: (in order) "From 11", "To 11".
So that checks out.
For Hole12: Over, To, To, Over, To, From.
Where you logically can't follow To with To, unless you wanna phase two pegs together (I'm not making a sex joke ;)
There's being hype over AI and there's trusting faulty logic online. Be critical of what you read on the internet, my dude. Maybe especially when you do a review on it🤨
Though we ARE entering an era where everything you read (and watch) on the internet can be pure fiction, so I might recommend taking my advice in a general sense. We still have use of our own intellects.
I think it's easier to find where logic lacks when you find a simple thread and follow it, compared to trying to memorize the whole picture while it's being changed.
The ai here even put up columns of numbers, so it's easy to see if the number is Jump(left) / over(center) / to(right).
I think a physics engine would be good for AI's internal thinking.
Without mass, numbers can mashed together and be negative.
1 + 2 = 1 number, yknow?
I'm now trying to figure out if deep fakes are promoting Claude 3.5. Probably not, but I might as well follow my own advice, right?
If the program seems legit, I'll definitely use it for making code for me! It seems EXTREMELY cool!
And I'll probably wait a while before I try a new one. Because thinking critically is exhausting😮💨
Hi Matt. Thank you for sharing your knowledge. I noticed that for some users you might need to enable the code window in settings.
I am not able to get Claude ai to display the side by side feature even though I have the Artifact feature enabled. is it specific to a particular OS or browser or did they remove that feature?
My wish for Claude is that they incorporate a memory system like GPT. I would like it to remember my book and details about me for later conversations.
One of my tests will be to see if it can generate a 7-pointed star. I never found a way to get GPT-4o to do it.
The marble is no longer supported by the table 🤨
it might need multi-shot to be perfect, but yeah, until that gets fixed, we have no real AGI yet
(waiting for SSI Inc. to publish their first model)
@@dot12985 years from now
Man that was good! 😊
when these get back propagation and thinking through it will be stunning.
What terminal do you use?
You should do a video on Claude 3.5 Sonnet for agentic workflow examples
I love Claude sonnet over all other ai assistants.
I watched all of your videos. Keep up the amazing work!
Thanks Matthew. Watching your AI tests is like watching Teslas owners running gauntlets using successive versions of FSD. At first it was easy to find FSD failures. Now FSD runs most gaunlets without errors. At least you have the option of building more complex pathways. More and more of the FSD gauntlet testers are being left with nothing new to show us. Of course training emergent failures are going to be harder to find in these massive LLMs.
Great! I was looking forward to this video. Thank you for your assessment 😊👍
I get lots of apologies again , which means I have to answer the questions mostly ... hoping to learn something
Have you ever thought they only trained it on these specific tests? like give it a different test with the same idea, see if it solves it. I know its been a long time people tried to achieve training llms to answer these specific questions..
If you ask it to generate JS instead of Python then you can preview the result directly as an artifact.
What are these percentages? What does that even mean for a model to have a 53.9% and another model to have 50.1%?