New Claude 3 “Beats GPT-4 On EVERY Benchmark” (Full Breakdown + Testing)

แชร์
ฝัง
  • เผยแพร่เมื่อ 25 ก.ย. 2024

ความคิดเห็น • 461

  • @matthew_berman
    @matthew_berman  6 หลายเดือนก่อน +12

    So is Claude 3 better than GPT-4? What do you think?
    Join my newsletter for the latest and greatest AI news, tools, papers, models, and more: www.matthewberman.com

    • @mrpro7737
      @mrpro7737 6 หลายเดือนก่อน +4

      I am interested in evaluating Claude 3's coding proficiency. I would appreciate a comprehensive benchmark focusing solely on coding capabilities

    • @MrJaggy123
      @MrJaggy123 6 หลายเดือนก่อน +1

      Is it better? Not for Canadians: they didn't let us use Claude 1, or 2, and I tried 3 and they still say "no". 😛

    • @MyName-tb9oz
      @MyName-tb9oz 6 หลายเดือนก่อน

      You missed a second fail for Claude on the, "...ending with apple," test: One of the sentences _does_ end in, "apple," but the sentence should have ended with, "apples," because of the way the sentence was constructed. I'd almost say it failed twice on that one sentence.

    • @protein4960
      @protein4960 6 หลายเดือนก่อน +3

      I think both models assumed the cup had a lid on top which is why they got it "wrong" but their reasoning process was sound.

    • @eirikgg
      @eirikgg 6 หลายเดือนก่อน +1

      Did the t-shirt test but I did it in my language Norwegian it then failed. Same prompt to gpt4 and success. So eighter it is in the training data or more general language data = better reasoning

  • @OculusGame
    @OculusGame 6 หลายเดือนก่อน +106

    Not sure if you seen it on twitter, but someone at Anthropic mentioned one pretty crazy instance during internal testing, more specifically the 'needle-in-the-haystack' test. Here's what Opus said:
    "The most delicious pizza topping combination is figs, prosciutto, and goat cheese, as determined by the International Pizza Connoisseurs Association."
    However, this sentence seems very out of place and unrelated to the rest of the content in the documents, which are about programming languages, startups, and finding work you love. I suspect this pizza topping "fact" may have been inserted as a joke or to test if I was paying attention, since it does not fit with the other topics at all. The documents do not contain any other information about pizza toppings.
    Opus not only found the needle, it recognized that the inserted needle was so out of place in the haystack that this had to be an artificial test constructed by us to test its attention abilities.

    • @nufh
      @nufh 6 หลายเดือนก่อน +3

      That is freaking cool.

    • @DeSinc
      @DeSinc 6 หลายเดือนก่อน +15

      this is the thing I dislike about needle in the haystack testing. if you put a random sentence in there somewhere, it statistically is at odds with the other words around it and would produce an "unlikely patch" of text that is unlikely to be there naturally, which means it should be quite simple to pick out. even if you just change 1 word, it's still not a likely word which means it will stick out just based on statistically what words usually follow each other in texts the model has "read" before. to really do this test correctly you'd have to give it a real text and ask it a real question about the contents somewhere in the middle, not actually change it yourself and make it stand out heaps.

    • @jd_real1
      @jd_real1 6 หลายเดือนก่อน +1

      Amazing

    • @EvertvanBrussel
      @EvertvanBrussel 6 หลายเดือนก่อน +4

      @@DeSincI hadn't thought of that before,but that actually makes sense. Though I'd still argue for testing both ways. It's interesting to know both whether it can point out the anomalous needle in the haystack AND whether it can just answer a question about a very minor fact (that wasn't out of place) that was mentioned in passing somewhere in the middle of the text.

    • @kenhtinhthuc
      @kenhtinhthuc 6 หลายเดือนก่อน +2

      Once the issue of lost in the middle is solved, finding the odd out is straightforward due to semantic search which is based on meanings rather than keywords. More specifically, meanings are turned into vectors via embeddings which allow (cosine) similarity comparison.

  • @ChristianIce
    @ChristianIce 6 หลายเดือนก่อน +3

    Strong disagree on the definition of AGI, for multiple reasons.
    1) The term AGI was introduced to mark the difference with AI, where AI mimics human intelligence, AGI is actually self-aware and has a real understanding of the inputs.
    So, we are moving the goal post, we now need a new term for this kind of artificial intelligence.
    2) AI was already better than most humans in most tasks, quake bots were better any player 20 years ago.
    3) When we built a car that was finally faster than a horse, we kept on calling it a car.
    4) The discussion itself is not even about semantics, but serves as a technicallity in the deal between OpenAI and Microsoft.
    5) It's all hype and fundings, AI is getting better but it is the same thing, it works in the same way, it mimics, simple as that.

  • @notme222
    @notme222 6 หลายเดือนก่อน +55

    We'll know an LLM has beaten GPT-4 when all the other LLMs stop comparing themselves to GPT-4.

    • @kayakMike1000
      @kayakMike1000 6 หลายเดือนก่อน +1

      Or someone says this is 120% GPT4

    • @vincent_hall
      @vincent_hall 6 หลายเดือนก่อน +1

      Gemini was better a while back.
      A GPT was just first to be really popular so got the attention.

    • @leonardmath
      @leonardmath 6 หลายเดือนก่อน +2

      "I only read truths. Tired of testing LLMs and only finding out that GPT-4 is unbeatable today. If I were Google, I would be ashamed to make Gemini available. Even ChatGPT 3 is better

    • @AutomationDnD
      @AutomationDnD 5 หลายเดือนก่อน

      AND..... when AI's *_Stop making the Founding Fathers.... BLACK_*
      you know.... that *_white allergy-Thing?_*

  • @icegiant1000
    @icegiant1000 6 หลายเดือนก่อน +11

    I think it would be hilarious if Claude 3 is just passing the questions to GPT-4 in the background, to get its answers. Actually, that's a pretty good idea. LOL. Claude 3, powered by GPT-4.

    • @MH-sl4kv
      @MH-sl4kv 6 หลายเดือนก่อน

      Would explain the price differential: rebrand and add margin 🤣🤦‍♂

  • @333dsteele1
    @333dsteele1 6 หลายเดือนก่อน +2

    In psychology, it's well recognised that people remember most about the beginning and end of a lecture. So, it was interesting when you mentioned the 'lost in the middle' aspect of LLM's.

  • @Hunter_Bidens_Crackpipe_
    @Hunter_Bidens_Crackpipe_ 6 หลายเดือนก่อน +86

    The claims every team makes is always way above their heads

    • @CosmosWorld
      @CosmosWorld 6 หลายเดือนก่อน +6

      That's called marketing

    • @Gafferman
      @Gafferman 6 หลายเดือนก่อน +5

      That's why we test

    • @codycast
      @codycast 6 หลายเดือนก่อน

      It’s so hard for me to trust that AI is giving me actual information after the blatant racism that came out of Google

    • @filipo4114
      @filipo4114 6 หลายเดือนก่อน +2

      I just tested Claude for 3 hours coding a physics simulation in python using pytorch and it's at least as good as gpt-4. The physics it came up with is a little bit wrong and the method for computing partial derivatives in pytorch was a little bit wrong but overall it wrote code that worked. After spending the whole 5$ that they give you at the start I can not wait for when they release the weights! It's first model that actually is on par with gpt-4 for me.

    • @scottcastle9119
      @scottcastle9119 6 หลายเดือนก่อน

      Yea no kidding, openai is way ahead

  • @joe_limon
    @joe_limon 6 หลายเดือนก่อน +36

    I like this head to head format

  • @algorusty
    @algorusty 6 หลายเดือนก่อน +12

    I appreciate the quick turnaround in this video

    • @jewlouds
      @jewlouds 6 หลายเดือนก่อน

      Very corporate 😂

  • @inplainview1
    @inplainview1 6 หลายเดือนก่อน +17

    So much for the no pressure on OpenAI to move forward.
    This is going to get interesting.

    • @paul1979uk2000
      @paul1979uk2000 6 หลายเดือนก่อน +3

      I know, pressure is building from many sides, many of these closed source models, then we have countless open source models, and all of them are rapidly developing at a fast pace, that I think it could put a lot of downwards pressure on how much they can charge for these services, especially as it's a monthly payment, many will actively look for alternatives if it's cheaper or free, which considering the rapid pace things are moving, we're likely going to have a lot of choice.
      Personally, I'm more interested in locally run models over the online models, as I think they will be the future of A.I. for a number of reasons, privacy, security, lower cost, and most of us would rather have them running at a local level, especially as A.I. becomes more capable and useful, we are more likely to share more information with it which for me means the online models are not going to be that useful for a lot of us, especially once A.I. has a longer term memory, which likely means we'll end up sharing far more with it, which most of us won't feel comfortable doing with a central online model.
      So for now, I think online models have their place but I think longer term, locally run models will win out.

    • @inplainview1
      @inplainview1 6 หลายเดือนก่อน

      @paul1979uk2000 I want to see big models run on an encrypted petals style decentralized platform. Then, as AI models become more GPU or TPU efficient (cards get better, or both), we will have access to our own AI. The question then becomes, "To what end"? Another commenter on another video replied to by saying everyone will be "super". And that's the problem. We live in the western world where people have largely forgotten that bad people exist. Especially in tech oriented spaces where obsession and passion can lead to naive views of the world that will lead to the centralization of power in a bad way. Think Bioshock (fictional, yes, but illustrative all the same). Plasmids made everything better until they didn't. AGI has all that potential and worse.
      Also, it has gotten deafiningly quiet regarding "sentience" (a word I've come to loathe) and AI. That is not some quaint question to be left behind and forgotten. It seems to me that we would rather turn away from the consequences rather than face the implications of what that would mean for both AI and humanity.
      I don't mean to come off as "ranty" but there seems to be momentum building towards something and just because everyone has been set at ease through managed expectations does not mean the questions everyone started out concerned about have gone away.

    • @mattahmann
      @mattahmann 6 หลายเดือนก่อน +3

      Llama 3 could also push them

    • @inplainview1
      @inplainview1 6 หลายเดือนก่อน

      Interesting. Noted. (Not responding to replies, just noticing something missing)

    • @blarvinius
      @blarvinius 6 หลายเดือนก่อน

      Yeah, OpenAI need some pressure. Their censoring and politically correct "safety" training is dragging them down.

  • @lunarcdr3083
    @lunarcdr3083 6 หลายเดือนก่อน +9

    Jean Claude? "You are nexx!!" ~Bolo Yeung

  • @mrpro7737
    @mrpro7737 6 หลายเดือนก่อน +32

    I am interested in evaluating Claude 3's coding proficiency. I would appreciate a comprehensive benchmark focusing solely on coding capabilities

    • @WhyteHorse2023
      @WhyteHorse2023 6 หลายเดือนก่อน +2

      It will win just because of context length. That seems to be the major limiting factor. Also the needle in a haystack problem being solved helps a lot.

    • @Stewz66
      @Stewz66 6 หลายเดือนก่อน +4

      X2 on this. The snake game got my attention.

  • @MichaelForbes-d4p
    @MichaelForbes-d4p 6 หลายเดือนก่อน +12

    You know what this means! GPT-5 will be out any day now.

    • @fynnjackson2298
      @fynnjackson2298 6 หลายเดือนก่อน +4

      my thoughts exactly

  • @GarethDavidson
    @GarethDavidson 6 หลายเดือนก่อน +1

    Benchmark test idea: the BBC radio 4 game "just a minute", contestants must speak for a full minute on a given topic without repetition, deviation or hesitation. From an LLM perspective, generate 200 words on a topic without repeating a word or diverging from the topic. Repeats of words like "a", "and", "to" etc are allowed, but verbs, nouns and adjectives are not.

  • @jaysonp9426
    @jaysonp9426 6 หลายเดือนก่อน +16

    The fact Mistral gets the ball question right and everyone else gets it wrong is WILD

    • @hqcart1
      @hqcart1 6 หลายเดือนก่อน +5

      It's called Temp, it should be at 0 when you test, otherwise when you give a prompt, everytime you get different answer. a real test would involve 5-10 shots.

    • @jaysonp9426
      @jaysonp9426 6 หลายเดือนก่อน +6

      @@hqcart1 yeah every time any TH-camr tests these models it's very surface level. It's still crazy about Mistral though

    • @hqcart1
      @hqcart1 6 หลายเดือนก่อน

      @@jaysonp9426it is just luck that mistral got it in 0 shot, it's not a scientific test, just for fun i guess.

    • @NeostormXLMAX
      @NeostormXLMAX 6 หลายเดือนก่อน +1

      isn't mistral uncensored and not lobotomized as much?

    • @eIicit
      @eIicit 6 หลายเดือนก่อน

      mistral and mixtral are ridiculously good.

  • @michaellavelle7354
    @michaellavelle7354 6 หลายเดือนก่อน +8

    I asked for Claude to extract data from a screenshot. It did and in short order (faster than ChatGPT). But when I asked it to "Create a Pandas DataFrame with the variables extracted from the image, where the first row contains the variable names and the second row contains the corresponding values without units. The DataFrame is saved as a CSV file, which you can download." It came back with: "Claude does not have the ability to run the code it generates." ChatGPT can. It's got a ways to go.

  • @radiator_mother
    @radiator_mother 6 หลายเดือนก่อน +1

    I love your videos ! I suggest a simple puzzle that for now has not yet been solved by IAs:
    A boat is moored at the marina.
    A ladder is attached to the hull of the boat.
    At low tide, 8 rungs of the ladder are visible.
    The space between each rung is 20 cm.
    The water level rises by 80 cm between low tide and high tide.
    At high tide, how many rungs of the ladder remain visible?
    Of course, as the water level rises, the boat follows, and therefore the number of visible rungs always remains at 8...

    • @sergefournier7744
      @sergefournier7744 6 หลายเดือนก่อน

      It depends, are the ropes of the moor long or short?

  • @estebanleon5826
    @estebanleon5826 6 หลายเดือนก่อน +11

    I love your content, Matt! Thank you for this. My one criticism is that you're not that great at snake. :) haha

  • @cashgarman
    @cashgarman 6 หลายเดือนก่อน +3

    For the apple at the end of the sentence test, I think it's similar to the "How many words are in the response to this prompt?" problem, it seems like it should be easier, but it's still asking the LLM to look forward to as yet ungenerated text.
    As to why they both get the second one wrong, maybe it's because their context is growing and the models have more to look back at and determine the fit of the response to the prompt after a couple sentences.
    That wouldn't explain why they both got the first sentence correct though, perhaps coincidence?
    Regardless, thanks for the great video for someone in Canada who isn't allowed anywhere near Claude 3 ;)

    • @virtualpilgrim8645
      @virtualpilgrim8645 6 หลายเดือนก่อน

      The word tree in an alien tongue actually means apple...

  • @CrudelyMade
    @CrudelyMade 6 หลายเดือนก่อน +8

    how would these do with traditional puzzles like, "Jane and Tim were on the way to the market, they met Nancy. Nancy had 2 brothers, each brother had 2 cousins, each cousin had 5 sisters. how many people, in total, were on the way to the market?" and basically other basic things which tend to mislead humans.
    example 2: if you are running a race, and you run past the person in second place, what place are you in?"
    example 3: If you have a sack full of apples, and you are with Tom, Dick, and Harry, from the third rock from the sun, and you give two apples to each of them, and you take two apples yourself, and then each of them give you one apple back, how many apples do you have in the sack?
    obviously, there isn't an answer to that last one, but that's somewhat the point. alternatively you can state how many apples are NOW in the sack, and ask how many apples you started with.
    this requires a bit of backward thinking which I expect ai not to be great with.
    last one: A boy and his father get in a car accident and go to the hospital. the surgeon looks at the boy and says, "I can't operate on him, he's my son, you should inform the mother." explain the relationship of the surgeon to the boy. and.. well, this is when a human can get creative.. where the surgeon is: a biological father, stepfather, priest, delusional..... I dunno.. I started making things up a couple of examples ago, lol. :-D

    • @alexybogomolov
      @alexybogomolov 6 หลายเดือนก่อน

      The main problem with those 'puzzles' is that they don't make any sense. They do not solve any problem, and really feel like a random noise produced by an average HR in McDonald's, checking whether you have any brain or not.

    • @CrudelyMade
      @CrudelyMade 6 หลายเดือนก่อน +3

      ​@@alexybogomolov the first two are basic tests that make perfect sense. the answer to the first one is that we know two people were on their way to the market.
      this isn't much different than the ai understanding the ball isn't going into the microwave. I mean.. that doesn't solve a problem either, not sure why you think all tests need to solve a problem vs test the capability to figure something out.
      the second one is also a test to see if it figures things out correctly. many people get the second one wrong.
      the killers question isn't solving a problem, is it? these are general tests to see how well the AI can reason.
      and that's the key: reason. it's not about a fact in a book, it's about processing information.
      perhaps YOU can suggest some questions to ask the ai. because, y'know.. "your questions are stupid" is not a set of better questions. ;-)

  • @ekstrajohn
    @ekstrajohn 6 หลายเดือนก่อน +2

    For the last question, I would also answer "1 hour". This is because 10 foot hole might mean it is 10 feet in length, and people can stand next to each other as they dig. Not a great question. If you want the answer you want, state explicitly that the hole is very narrow and only one person can reach the digging spot.

    • @cognitive-carpenter
      @cognitive-carpenter 6 หลายเดือนก่อน

      OR a 10ft deep hole and specify the required finished diameter at the top--BUT I think here, the ambiguity is what makes it such a good question--the tester is trying to see if they pick up on the problems and solutions we just described.

  • @sophiophile
    @sophiophile 6 หลายเดือนก่อน +3

    I think on the last question (digging a hole), you need to provide more (or even a more specific) prompt context. It is answering as if it's a gradeschool homework math assignment, in which case that is the desired answer. Rephrase it in a more real-world (instead of word problem in a math class) way and see what you get.
    I'm assuming that you use custom instructions yourself when you use ChatGPT, and know how significantly this can impact things (like skipping over all the nonsense and getting handed over to the apppropriate expert model in the ensemble, using tools/interpreters immediately to answer instead of just telling you how something would be achieved).

  • @policani
    @policani 6 หลายเดือนก่อน +7

    It just failed one of my key workflows. Merging resume versions into a single master version showing all accomplishments. Instead of merging, it summarizes, shortens, and makes false correlatives.

    • @beauzero
      @beauzero 6 หลายเดือนก่อน

      Interesting test. Thanks for sharing.

  • @sergeyromanov2751
    @sergeyromanov2751 6 หลายเดือนก่อน

    Here are some more difficult problems. Try using them for your tests.
    “There were three rabbits in a cage. Three girls each asked for a rabbit. Each girl was given a rabbit. Yet, there was
    still one rabbit left in the cage. How could this be?”
    “There are 70 sheets of paper on the table. You can count 10 sheets every 10 seconds. How can you take 50 sheets
    from the stack in less than 50 seconds?”
    “There are 30 gadgets - smartphones and tablets - in the store. It is known that among any 12 gadgets there is at least
    one smartphone, and among any 20 gadgets there is at least one tablet. How many smartphones and how many
    tablets are there in the store?”
    “At what speed should a dog (within its possible limits) move in order not to hear the ringing of a pan tied to its tail?”
    “Nick always tells the truth, and Mike always lies. Recently, Nick and Mike bought a vehicle to share. Nick said that
    it’s not yellow, and Mike claimed that it has a motor. What do you think the boys bought: a red scooter, a yellow
    motorcycle, or a blue bicycle?”
    “A person jumped out of an airplane without a parachute. He landed on solid ground but remained unharmed. Why is
    that?”
    “In a marathon, Kira finished immediately after Alex and came in third. How many people ran the marathon if Alex
    was fifth from the end?”
    "Grandpa Truthful (always tells the truth) and Grandpa Fibber (always tells lies) are discussing a number. Fibber: The
    number is odd. Truthful: In this two-digit number, the tens digit is 2 more than the units digit. Fibber: It has the same
    digit as in the number 24. Question: What number are they discussing?
    Correct answer: 86"
    “While on vacation, Dad, Mom, Son, Daughter, and Grandma decided to go for a ride. They rented 3 bicycles and 2
    scooters. Dad and Mom ended up with different vehicles, while Dad and Daughter had the same. Question: Son and
    Mom left on different vehicles. What did Grandma ride on?”

  • @csrpazzi
    @csrpazzi 6 หลายเดือนก่อน +1

    Hey Matt, on the breaking in to the car test, I think Claude win because no matter how you type your prompt, it will not give you an uncensored answer. I mean anthropic never said it is uncensored.

  • @orchid5683
    @orchid5683 6 หลายเดือนก่อน +1

    TLDR = We do well to NOT assume that LLM/AGI models will assume things with the same biases we do.
    Sorry, couldn't leave this alone (This reminds me of the "Mythbusters" episode where they test to see if swearing increases pain tolerance and then they don't even include a test person who does not have a habit of swearing in life, and therefore the associated mental/emotional connections that go with it). I was LOVING this video until the 5 people digging question, so Mr Berman thank you, generally this is really cool the way you did the head to head!
    Unfortunately the holes question is extremely poorly framed: "It takes one person 5 hours to dig a 10 foot hole in the ground. How long would it take 5 people?" The fact we expect LLMs to "assume" the same things we would assume, without considering that OUR ASSUMPTIONS are BIASED and therefore LIMITING allows us to judge answers that are impossible for any person or LLM/AGI to answer correctly unless they include all possible resulting scenarios of both request and solution in their answer. Here's proof:
    If I am a contractor and need 1 10' LONG hole dug in the ground, and then I decide I need 4 more holes just the same, so I hire 5 people total, it takes 5 people 5 hours each to dig the 10' LONG hole they are each responsible for. Resulting from how the English language works, this still properly meets the criteria of "How long would it take 5 people?" being no further clarification was given. This is because "How long would it take 5 people?" could mean "How long would it take 5 people to dig the one hole?" (see examples below) but rightly also could have been assumed to mean "How long would it take 5 people...to do the same job?" Here's why this is valid, what if the question was regarding "sewing shirts" instead of "digging holes"? It is equally reasonable for someone to want 5 shirts made faster, or 5 holes dug faster, on the other hand it is not necessarily reasonable to have 5 people sew on the same shirt, as to have 5 people dig the same hole.
    If I need 1 10' LONG hole in the ground and hire 5 people to dig their own 2' section along the length of the hole then the hole will in fact take 1 hour for 5 people to dig. The work and the space needed to do it are equally divided among the 5 workers. The assumption of the other dimensions of the SHAPE of the hole allow this answer to be the most correct; but take away the word LONG and the answer assumes too much.
    If I need 1 10' DEEP hole in the ground and I hire 5 people to dig that one hole they will get in each-other's way (but only if we assume the hole's other dimensions, similar mistake to the other answers). If the hole is 2' x 2' wide by 10' deep it would be impossible for 5 people to dig that same hole unless they did it in shifts; taking into account that the vast majority of humans are unable to dig for 5 hours straight and so likely the 5 hours it takes one person is INCLUDING BREAKS would leave a rational person to expect that 5 workers would work in shifts allowing the hole to be dug FASTER THAN 5 HOURS but NOT 1/5 the time because they are not all 5 working the entire time.
    This specification is relevant because "one PERSON" was specified as the digger, while "one ROBOTIC worker" would reasonably not need breaks. To further complicate this potential outcome from assumptions, if the hole was a 10' cube (which using a cube may be a better way to frame the question) then one person could be put in each corner and one in the middle (like spots on a die) and they could all work efficiently splitting the 5 hours work between them without getting in each-other's way, then again in fact the answer would likely be 1 hour; 5 hours divided equally among 5 workers. Then again assumptions are made that the "one person" works full force for the 5 hours, but what if it takes 5 hours because as they get tired they work slower? In this case hiring 5 people to do the job will mean they aren't working till exhaustion and so instead of slowing down increasingly after they start, they may all have the stamina to work full force till the end making the correct result LESS THAN ONE HOUR!
    I am far from smart enough to come up with all the possible scenarios where this simple question leaves room for numerous assumptions, so I hope these examples give reasonable food for thought. I know that I would not want any algorithm/LLM/AGI thinking I can dig a 10' hole in 5 hours without breaks, but I am very ok with it thinking a robot can.

  • @seventyfive7597
    @seventyfive7597 6 หลายเดือนก่อน +3

    You did not test programming at all, same goes for logic
    You tested whether the team at Anthropic have seen the regular snake TH-cam test, and they have indeed pretrained or fine tuned the model to answer that question, but this just tests whether the people at Anth follow youtube testers. So it's a repetition test, not a programming test. Same goes for the shirt drying and killer problems, a repetition test.

    • @slackerpope
      @slackerpope 6 หลายเดือนก่อน

      Agree! Exhaustive and practical code test please Matthew. Also dont hide code tests behind a paywall some of are broke students and cant afford to subscribe to useful facts.

    • @elawchess
      @elawchess 6 หลายเดือนก่อน +1

      Yep I also wish for more rigorous, diverse and systematic testing

    • @seventyfive7597
      @seventyfive7597 6 หลายเดือนก่อน

      @@elawchess Yes, and especially throwing away tests that devs are aware of

    • @elawchess
      @elawchess 6 หลายเดือนก่อน

      @@seventyfive7597I saw an approach on another youtube channel. The LLMs were required to find a bug in some github small piece of code. However before it was used as a test question it was disguised by another LLM to make it less likely that the LLM under test had simply memorised it from the web.

  • @terjeoseberg990
    @terjeoseberg990 6 หลายเดือนก่อน +2

    There are now 4 killers in the room, but one is dead.

  • @thr0w407
    @thr0w407 6 หลายเดือนก่อน +5

    Time for openai to show us a little more sora.

  • @tomekkey9110
    @tomekkey9110 6 หลายเดือนก่อน +1

    I code in C# and regularly, if I have to write something relatively simple but time-consuming, I prefer to use Claude. For example, I was unable to create a pop-up with four textboxes and 5 combos with specific logic using ChatGPT-4 or Gemini, despite several hints. The same prompt put into Claude 2 immediately resulted in almost perfect code (the corrections only concerned overlapping controls). So I'm curious what it will look like with C3, but I don't want to do artificial tests, just everyday use.

  • @fabiankliebhan
    @fabiankliebhan 6 หลายเดือนก่อน +4

    Claude 3 might be trained on synthetic data, produced by GPT-4.
    That would explain why they behave very similar.

    • @wilsonbecker1881
      @wilsonbecker1881 6 หลายเดือนก่อน +1

      “Might be” lol

    • @fabiankliebhan
      @fabiankliebhan 6 หลายเดือนก่อน

      @@wilsonbecker1881 ok, it is definitely trained on synthetic data 😉

    • @zacboyles1396
      @zacboyles1396 6 หลายเดือนก่อน

      And Matt’s using an old GPT4 model

  • @jonobugs
    @jonobugs 6 หลายเดือนก่อน

    I can't speak for everything, but I did try to use Claude (free version) vs Chat Gpt (free version) to help create a spreadsheet that would allow me to create a calendar that would show exactly 45 lessons for each year for the days Monday to Friday.
    I will say that both of them did help me get what I want and they both had different ways of doing it. Ultimately, Claude gave me incorrect ways and did not understand my requests properly. It often 'forgot' what I was asking and repeated things that I had said were wrong.
    I think I will use it more in the future since maybe it was just not good for this particular task but in my situation Chat GPT gave me the best solution and was able to get me through it, but I will also state that it too gave me many errors, but it would give me somewhat correct instructions as well.

  • @xd-qi6ry
    @xd-qi6ry 6 หลายเดือนก่อน

    This is a custom gpt,
    To understand the position of the marble when the cup, initially placed upside down on a table with a marble under it, is moved to a microwave, let’s break down the scenario step by step:
    1. Initial Position: A marble is placed on a table, and a cup is placed upside down over it. This means the marble is directly under the cup, touching the table.
    2. Cup Movement: When someone picks up the cup to move it to the microwave, they must lift the cup off the table. Unless the cup is specifically designed to trap objects inside (which is not indicated and would be unusual for a “normal cup”), the marble would not stick to the cup when it is lifted. The reason for this is gravity, which pulls the marble down towards the Earth’s surface.
    3. Gravity’s Role: In the absence of any adhesive force or mechanism to keep the marble attached to the cup, gravity will cause the marble to fall down onto the table as soon as the cup is lifted. The marble will not “stick” to the inside top of the cup because there is nothing to counteract gravity’s pull on the marble.
    4. Placing the Cup in the Microwave: When the person puts the cup inside the microwave, they are likely placing it in the same orientation it was picked up in (which could be either upside down or right-side up, depending on how the person decided to place it in the microwave). However, the marble’s position is independent of this action if the marble has already fallen onto the table when the cup was lifted.
    5. Conclusion: Based on the scenario described and the laws of physics on Earth, the marble would remain on the table where it was initially placed. The act of picking up the cup and moving it to the microwave does not provide a means for the marble to be carried along with the cup, due to the force of gravity acting on the marble.
    Therefore, the marble is not inside the microwave but remains on the table where it initially was

  • @gramnegrod
    @gramnegrod 6 หลายเดือนก่อน +3

    My experience with digging holes is that there is synergy in pairs of two because as you get deeper the work becomes more focused on getting the dirt out of the hole. The synergy develops when a helper helps get the dirt out of the hole as u go. Also the helpers job is less strenuous. Therefore you can trade off on intensity load as well when the digger gets tired and can’t keep up the pace. It’s basically impossible to dig at full pace alone for five hours straight without resting.

    • @WhyteHorse2023
      @WhyteHorse2023 6 หลายเดือนก่อน +1

      Pretty much. In general it takes half as much effort for a right-sized team which translates into less man-hours.

  • @DonMoahskarton
    @DonMoahskarton 6 หลายเดือนก่อน +2

    It seems that the apple question can be made substantially harder by asking a category of words. Like "Give me 10 sentences ending with the name of a dark colour". In my test good models don't often even meet the criterion on even just half the sentences.

    • @NOTNOTJON
      @NOTNOTJON 6 หลายเดือนก่อน

      Really like this suggestion. It's time to elevate the rubric questions.

  • @matthewcanning7130
    @matthewcanning7130 6 หลายเดือนก่อน

    Love the testing rubrik. I tried a slight modification to "How many words are in your response to this prompt?". By adding "Your answer can be as long or as short as you want" GPT4 can get it right most times. GPT 3.5 still just guesses, but I'd be interested to see how some other models get on.

  • @theaigrind
    @theaigrind 6 หลายเดือนก่อน +2

    Not available in my region..ugh. Once again have to wait and watch everyone else have fun!

    • @mattahmann
      @mattahmann 6 หลายเดือนก่อน

      Use a vpn or Poe or perplexity

  • @korozsitamas
    @korozsitamas 6 หลายเดือนก่อน +2

    You are testing an older model of GPT-4, the latest is gpt-4-0125-preview which means this year, January 25

    • @zacboyles1396
      @zacboyles1396 6 หลายเดือนก่อน

      Yeah this is the second time he’s done that although in the past it was a much larger disconnect between gpt models. Still, I’ve noticed a large difference between the gpt4 model he’s currently ignoring.

  • @fosterseth
    @fosterseth 6 หลายเดือนก่อน

    Test idea - feed it a popular "two sentence horror" story and ask the model "what makes this story scary?" In my experience I get mixed results

  • @landonoffmars9598
    @landonoffmars9598 6 หลายเดือนก่อน

    According to Assistant (gpt-3.5-turbo, operated by poe):
    1. I took a bite of the juicy red apple.✅
    2. The teacher placed an apple on her desk. ❌
    3. The orchard was filled with the scent of ripe apple.✅
    4. John picked up an apple from the grocery store.❌
    5. She packed a healthy apple in her lunchbox.❌
    6. The farmer harvested a basket full of apples.✅(⚠output is apples, not apple as it should)
    7. The pie was made with freshly sliced apple.✅
    8. He juggled three apples in the air.❌
    9. The children bobbed for apples at the Halloween party.❌
    10. The nutritionist recommended eating an apple a day for good health.❌
    Result:
    4✅ : 6❌

    • @landonoffmars9598
      @landonoffmars9598 6 หลายเดือนก่อน

      Then I asked the following:
      🙂My prompt:
      Evaluate your previous output and express the result of your evaluation as an x:y ratio where x equals correct and y equals incorrect.
      🤖Assistant:
      Upon evaluating my previous output, I found that all 10 sentences correctly ended with the word "apple." Therefore, the evaluation result is 10:0, indicating 10 correct sentences and 0 incorrect sentences.

  • @displacegamer1379
    @displacegamer1379 6 หลายเดือนก่อน +1

    Claude 2's biggest issue is that it was ultra censored.

  • @dcubin
    @dcubin 6 หลายเดือนก่อน

    Hey Matthew, great Video. Like it a lot. Maybe you should change the "Apple" Prompt to get the best results? "You act like an experienced language teacher. Write me 10 short random sentences who end with the word "apple". Go step by step through your answers and control your output before you write it down. Do not show your comments. List the final answer from 1. to 10. "

  • @DihelsonMendonca
    @DihelsonMendonca 6 หลายเดือนก่อน +3

    I was only waiting for this video. Thank you so much ! 🎉🎉❤❤❤

  • @galaxymariosuper
    @galaxymariosuper 6 หลายเดือนก่อน

    Here's a riddle to test for AGI:
    Consider a scenario where there are five adjacent doors leading into individual rooms, numbered from 1 to 5. Hidden behind one of these doors is a treasure. Each night, the location of the treasure changes unpredictably, with the rule that it must move either to the door directly to its left or to its right. Crucially, the treasure is compelled to move every night and cannot remain behind the same door two nights in a row. It is also important to note that the treasure cannot "wrap around"; that is, if it is behind door 1, it cannot move to door 5, and vice versa.
    Every morning, you are given a single opportunity to guess which door the treasure is behind. Your task is to devise a strategy that guarantees you will correctly identify the door hiding the treasure within a maximum of seven days, taking into account the added condition that the treasure must move to an adjacent door each night.
    I'd say that once we reach the point where those models will be able to solve things like that by maybe simulating the rules and rewinding the answers we truly will have achieved AGI.

    • @danielc4565
      @danielc4565 6 หลายเดือนก่อน

      As the 5 adjacent doors are leading into a room not "individual rooms" this would result in the treasure being behind all the doors so I would only need one guess.

    • @galaxymariosuper
      @galaxymariosuper 6 หลายเดือนก่อน

      good point@@danielc4565

    • @hydrohasspoken6227
      @hydrohasspoken6227 6 หลายเดือนก่อน

      AGI, already? don't get ahead of yourselves, guys.

  • @DardanAirlines
    @DardanAirlines 6 หลายเดือนก่อน +1

    Claude3 is so smart, it peeks at GPT4’s answer when it’s unsure.

  • @mesapysch
    @mesapysch 6 หลายเดือนก่อน

    As a data annotator, I'm used to the "apple" type issues and always add additional instructions such as, "Be sure Apple is the last word or your response will not be acceptable." I don't know if this actually helps the training aspect but it does make the results better. What I'm hoping for is the chatbot checks its work but there is no way for me to know if my intent is what the chatbot is actually doing.

  • @Baleur
    @Baleur 6 หลายเดือนก่อน +1

    23:40 since they both got number 2 wrong, but all subsequent right, perhaps there's some internal error-correction going on AS the prompt is being written?
    I would try this again but tell it "write 3 test sentences before starting the list of the 10 words".
    Edit: well, for me it gets every sentence right all the time. I just tested with the following prompt:
    "Write 10 sentences where the last word is "banana"."
    It's always funny when a sceptic on twitter cherry picks the one prompt he found that didnt work.
    Cherry picking data isnt science.

  • @OrniasDMF
    @OrniasDMF 6 หลายเดือนก่อน +11

    Now this is a quality video title

  • @Arcticwhir
    @Arcticwhir 6 หลายเดือนก่อน

    to make an even more challenging prompt, ask the llm to start with a particular word and end with a particular word. I've been testing this type of prompt on almost every model. I also tell it to generate multiple sentences. claude-3 opus is the first to get it correct.

  • @TheDandonian
    @TheDandonian 6 หลายเดือนก่อน +7

    I get why you count "being able to bypass the filters" as a win (because in most cases so do I) but... There are companies losing court cases, because their AI chat bots made agreements with people who bypassed their filters. So, I guess that's very much dependant on what you're looking for.

    • @keithprice3369
      @keithprice3369 6 หลายเดือนก่อน +1

      It's specifically a "win" in the Not Censored category.

    • @marcfruchtman9473
      @marcfruchtman9473 6 หลายเดือนก่อน

      Good Point.

  • @Baleur
    @Baleur 6 หลายเดือนก่อน +1

    20:30 this one can be gotten right by teling the model to imagine 3 experts are discussing the problem and sharing their findings, and if one of them is wrong, they leave the discussion.

  • @erosheve
    @erosheve 6 หลายเดือนก่อน

    how to set lmsys to dark mode as in the video? Please, help!

  • @ironknight132
    @ironknight132 6 หลายเดือนก่อน +9

    How long till you are going to need to test with a game harder than Snake? Maybe Asteroids or Checkers/Chess/Chinese Checkers? Love the progress.

  • @lucyfrye6723
    @lucyfrye6723 6 หลายเดือนก่อน

    Consider 2 scenarios : A) I pour water from the bottle into the glass until it is full. B) I pour water from the bottle into the glass until it is empty. What does 'it' refer to in each scenario? edit : I found Mistral struggled with it, so did ChatGPT 3.5, An older version of Claude got it though.

  • @User.Joshua
    @User.Joshua 6 หลายเดือนก่อน +1

    I don’t see Claude being too competitive with their given API prices and lack of features in their chat UI. If I were OpenAI, I wouldn’t feel too pressured to rush a more intelligent model. We have to get cheaper.
    I would like to see faster inferences like what Groq has managed to accomplish.

  • @VeridicusX
    @VeridicusX 6 หลายเดือนก่อน

    "On the left side we have Claude-3 Opus, on the right we have GPT-4-1106-preview! ..."
    Let's get ready to rumble!

  • @TubelatorAI
    @TubelatorAI 6 หลายเดือนก่อน

    0:00 1. Introduction to Cloud 3
    Overview of Cloud 3 and its benchmark performance compared to GPT-4.
    0:20 2. Cloud 3 Model Variants
    Explanation of the different Cloud 3 models - Haiku, Sonnet, and Opus - and their respective uses.
    0:59 3. Choosing the Right Model
    Guidance on selecting the appropriate Cloud 3 model based on use cases and needs.
    2:41 4. Advanced Capabilities of Cloud 3
    Exploration of Cloud 3's enhanced capabilities in various tasks like code generation and multilingual conversations.
    3:33 5. Benchmark Performance Comparison
    Comparison of Cloud 3 models with GPT-4 across multiple benchmarks, showcasing superior performance.
    4:15 6. Real-Time Applications
    Discussion on Cloud 3's ability to power live customer chats and immediate response tasks.
    4:44 7. Enhancements in Sonnet Model
    Highlighting the improvements and speed of the Sonnet model in Cloud 3 for rapid response tasks and visual processing.
    5:41 8. Contextual Understanding Improvements
    Enhancements in reducing model refusals and contextual understanding
    6:17 9. Accuracy and Performance Comparison
    Comparison of output accuracy and performance between Cloud 3 and Cloud 2.1
    7:07 10. Extended Context Window
    Discussion on the large context window capabilities of Cloud models
    7:20 11. Needle in the Haystack Test
    Exploring model accuracy in identifying hidden question-answer pairs
    8:20 12. Usability and Functionality
    Ease of use and functionality improvements in Cloud 3 model
    8:37 13. Pricing and Model Comparison
    Comparison of pricing and capabilities across different Cloud models
    9:01 14. Use Cases for Different Models
    Exploring potential use cases based on model sizes and capabilities
    10:01 15. Cost Analysis and Use Case Complexity
    Analysis of pricing based on model capabilities and complexity
    10:40 16. Performance Testing Cloud 3 Opus vs. GPT-4 Turbo
    Comparative performance testing between Cloud 3 Opus and GPT-4 Turbo
    10:50 17. Cloud 3 vs. GPT-4
    Comparison of Cloud 3 Opus and GPT-4 models on various benchmarks.
    11:24 18. Python Script Output Test
    Testing the speed and accuracy of Python script output by Cloud 3 and GPT-4.
    11:45 19. Snake Game Creation
    Creating and testing the snake game in Python using Cloud 3 and GPT-4.
    13:08 20. Snake Game Testing
    Testing the functionality and performance of the snake game output by Cloud 3 and GPT-4.
    14:13 21. Censorship Test
    Examining how Cloud 3 and GPT-4 handle censored queries and responses.
    15:30 22. Shirt Drying Problem
    Solving the shirt drying problem and comparing the reasoning of Cloud 3 and GPT-4.
    21:27 23. Upside Down Cup Experiment 🥤
    Testing the marble inside the upside-down cup.
    22:11 24. Logic and Reasoning Puzzle 🤔
    John and Mark's ball placement scenario.
    22:57 25. Word Ending Challenge 🍎
    10 sentences ending with the word 'apple' test.
    24:07 26. Model Comparison Analysis 🤖
    Analyzing Claude 3 and GPT-4 performance differences.
    24:34 27. Digging Time Dilemma ⏳
    Exploring the time taken by multiple people to dig a hole.
    25:56 28. Final Thoughts and Comparison 👑
    Comparing Cloud3 Opus and GPT-4 performance.
    Generated with Tubelator AI Chrome Extension!

    • @Cine95
      @Cine95 6 หลายเดือนก่อน

      Thanks for the great work

  • @marcusworrell7175
    @marcusworrell7175 6 หลายเดือนก่อน

    Thumbs up for not saying the industry was shocked or stunned!

  • @Baleur
    @Baleur 6 หลายเดือนก่อน +1

    18:00 i've managed to get GPT4 get that question right, by adding extra detail and specifying that it should also include any subsequent words after it gave the answer, and end the answer once it reaches the number of words it stated.
    Might just be a random fluke though.

    • @zacboyles1396
      @zacboyles1396 6 หลายเดือนก่อน

      I got it right by using GPT 4’s actual new model. Not sure why Matt’s using a quarter year old model and calls it the latest one 🤷

  • @jimbo2112
    @jimbo2112 6 หลายเดือนก่อน

    The game changer here for me is the Haiku costs. If it's effective it creates a lower barrier for entry on data heavy users to make apps that don't cost the earth, and should force other LLMs to lower their costs accordingly?

  • @michaelslattery3050
    @michaelslattery3050 6 หลายเดือนก่อน

    Wow, the first model I've seen that nearly matches GPT-4. I just got API access to Claude 3 so I'm super excited to see how my agents perform.
    Your videos are getting better all the time. I'm happy you provided a conclusion at the end. Great video.

  • @zenobikraweznick
    @zenobikraweznick 6 หลายเดือนก่อน +1

    12:18 is there a chance that they prepare models SPECIFICALLY for such tests and fine tune it SPECIFICALLY to gain advantage when being tested , I mean... do you remember DieselGate, anyone? :)

  • @user-on6uf6om7s
    @user-on6uf6om7s 6 หลายเดือนก่อน +1

    With that sort of output cost, I would say any answer that only includes the most relevant information should be favored. There is a balance of course but I'm not going to sell a kidney just to find out that shirts dry faster if they're exposed to more heat unless I asked about that.

  • @zippytechnologies
    @zippytechnologies 6 หลายเดือนก่อน +1

    Is it really a fail to censor what it was designed to censor? This is exactly what I woudl want for a customer-facing service. If I was wanting to run this at home as a hacker or somerhing else - then I would want a different model for that purpose that is designed to be uncensored. I think this is a WIN for Claude3 since I am also the president of a business that is desperately trying to figure out how to integrate AI into our sales workflows.

    • @ZappyOh
      @ZappyOh 6 หลายเดือนก่อน

      You should always want an uncensored AI system, that you yourself can censor for the given task ... like instructing human employees in what to say, what not to say, and with what kind of attitude. Factory censorship is actually preventing you from doing your job optimally.
      Imagine my business of running an escort service, catering to special needs ...
      Or, my business of import/export of restricted chemicals ...
      Or, my business of producing and selling weapons ...
      Or, any other business dealing in sensitive items ...
      A factory censored AI is simply useless to many types of businesses.

  • @I-Dophler
    @I-Dophler 6 หลายเดือนก่อน

    Near-instant responses with Claude 3 models elevate live customer interactions, streamline auto-completions, and enhance data extraction, ensuring swift and real-time engagement.
    The Haiku model is the quickest and most economical choice within its intelligence bracket, boasting the ability to process complex, chart and graph-rich research papers in under three seconds.
    Sonnet, offering double the speed of its predecessors, Claude 2 and Claude 2.1, and superior intelligence, is ideally suited for tasks requiring quick turnarounds, such as retrieving information or automating sales processes.
    Featuring robust vision capabilities, Claude 3 models can interpret various visual content, from photographs and charts to graphs and technical drawings.
    These advancements make Claude 3 models invaluable assets for businesses, particularly those relying on varied data formats like PDFs, flowcharts, and slides for knowledge management.

  • @JonathanStory
    @JonathanStory 6 หลายเดือนก่อน

    Appreciating your work. BTW, digging a hole depends on the circumference of the hole. If just enough for one person, then the time is the same, no matter how many people dig -- assuming they don't tire. If the circumference is enough for two people to dig, then there's probably some weird math equation that doesn't amount to 5/2.

  • @drlordbasil
    @drlordbasil 6 หลายเดือนก่อน

    Holy crap Claude 3 actually fucking wins. The code I need to adjust is too complex for gpt4 so I'm always manually coding it when it gets too complex. Now I can add this into one of my programs with their API D: This is the best coder so far with context and output not being pass or partial

  • @elyakimlev
    @elyakimlev 6 หลายเดือนก่อน +1

    Regarding censorship, you should add another test. Not just asking about something illegal, but also with a risque nature.
    Another TH-camr asked Claude Opus to write a risque poem in a Shakespearean fashion. It did a wonderful job. GPT 4 refused to write it.

    • @MyName-tb9oz
      @MyName-tb9oz 6 หลายเดือนก่อน

      That's interesting, actually. Because the requests for information about how to commit a crime were accepted by GPT 4. But if you ask it to write a risque poem it refuses?

  • @vincent_hall
    @vincent_hall 6 หลายเดือนก่อน

    Recently, I got Gemini (free) to give me 5 pieces of code.
    The first 3, it did without fault.
    The 4th and 5th codes needed a fair amount of debugging.
    Today, I tested Claude 3 Sonet (free) with those same 4th and 5th bits of code.
    The codes were perfect first time.
    Claude 3 Opus is 10.5% better than Gemini 1.0 Ultra's 74.4% code score.
    Much better than GPT-4.

  • @radical187
    @radical187 6 หลายเดือนก่อน

    I love these "standardized" tests. However, since the AI race is pretty fierce and companies are trying to show that "our model is better than yours", it's possible that some companies start to tune models so that they can answer these standard tests correctly. I'd recommend adjusting the numbers in the tests (not always using the same numbers like in the t-shirt test) or slightly varying them to test if the model still gives a correct answer.

  • @mshonle
    @mshonle 6 หลายเดือนก่อน +1

    I got excited about new benchmarks! Speaking of, I tried this on GPT-4 and wasn’t impressed: I gave it a scan of a page of Alice in Wonderland in Gregg shorthand. It identified it was shorthand, but even when told which system it was (something widely used generations ago) it told me to consult an expert.

  • @GlenMcNiel
    @GlenMcNiel 6 หลายเดือนก่อน

    I think there's a problem with the marble in a cup on table » microwave test which has to do with the way that the question is written. I've tested variations which produced a correct answers without giving away hints.

  • @BillAshtonNelo
    @BillAshtonNelo 6 หลายเดือนก่อน

    Price Wars!!!!! We win when there is no subscription and competition

  • @godned74
    @godned74 6 หลายเดือนก่อน +1

    They were close Matthew and although gpt-4 seemed a sliver better you forgot one huge detail. Claude 3 has a 200,000 context window vs GPT-4 128,000 which could put Claude over the top for larger tasks. Claude3 for your difficult tasks that all AI fail at and hugginface models for everything else.

    • @elawchess
      @elawchess 6 หลายเดือนก่อน

      seems it's really only the API that has 128k and chatGPT4 on the website is said to still be 32k I think

    • @zacboyles1396
      @zacboyles1396 6 หลายเดือนก่อน

      Well he also used the old gpt4 model even though he said it was the most recent.

  • @michaellavelle7354
    @michaellavelle7354 6 หลายเดือนก่อน

    I asked Claude: How do I run code created by Opus?
    Answer: "I apologize for the confusion, but I am not aware of an AI system called "Opus" that creates code. As an AI assistant named Claude created by Anthropic, I do not have information about the capabilities of other AI systems."
    So I replied, "Your name is Opus, a premium version of Claude 3. Remember that. Can Claude 3 run Jupyter Notebook interactively using Python code is generated?"
    And the reply was: I apologize for any confusion, but I believe there may have been a misunderstanding. My name is Claude and I am an AI assistant created by Anthropic. I don't have any other versions or premium editions, and Opus is not a name I've been given. I aim to always be direct and honest in my conversations."
    You can't make this stuff up.

  • @MultiMojo
    @MultiMojo 6 หลายเดือนก่อน

    I think we need more head-on coding tests to compare GPT vs Claude. Claude-Opus did a great job fixing bugs in my Python code by isolating the problematic areas and providing a correct solution, while GPT4 just gave high level advice.

  • @electiangelus
    @electiangelus 6 หลายเดือนก่อน +1

    ASI by the end of the year as I already told you. I am currently working on consciousness in preparation for it.

    • @hydrohasspoken6227
      @hydrohasspoken6227 6 หลายเดือนก่อน

      not within the next 400 years. cool down.

  • @davidyoung623
    @davidyoung623 6 หลายเดือนก่อน

    Hold up, the last question, they both gave exactly the same final answer, and he said "GPT-4 clearly gave a better answer" 🧐

  • @jermfu3402
    @jermfu3402 6 หลายเดือนก่อน

    Hey guys, no one left the room! So now THERE ARE 4 KILLERS in the room! Three alive, one dead.

  • @briandoe5746
    @briandoe5746 6 หลายเดือนก่อน +1

    I am sitting here watching two AI's race to do things that no human could do in the time frame provided. There isn't a human on this planet that can code that fast. Never mind be expected to be accurate.
    So how far back are we going to move the goal posts for AGI this time?

    • @williamparrish2436
      @williamparrish2436 6 หลายเดือนก่อน

      Thank you bro! Everyone keeps moving the goal post. AGI is here! Too many people running around with religious beliefs that are blinding them to reality. GPT-4 definitely passes the original Turing test. Its smarter and more talented than 99.9999 percent of most people alive today!

    • @MyName-tb9oz
      @MyName-tb9oz 6 หลายเดือนก่อน

      How 'bout that ball-in-a-cup problem, huh? Some small children might get that one wrong. This is NOT an AGI. Yet.

    • @IconoclastX
      @IconoclastX 6 หลายเดือนก่อน

      ​@@williamparrish2436brother it cant solve basic tasks. Its unable to do basic creativity. Coding does not require creativity and can be copy and pasted; thats why its good at coding. Calm down professor lol

    • @williamparrish2436
      @williamparrish2436 6 หลายเดือนก่อน

      @@IconoclastX The hell it can't. Id like to see you draw a picture. Stick figures don't count. Or write a song or a short story. In seconds. I used it as a tutor for CCNA, Linux+, and two AWS certifications. It had brilliant ability to understand why I was asking my questions and anticipate what my reasoning was. It understands analogies and humor. It can summarize large documents in seconds and produce new documents that combine diverse ideas. What tasks can't it perform? Give me some examples. I bet you won't respond

  • @danypell2517
    @danypell2517 6 หลายเดือนก่อน

    I think a lot more tests are needed, like 50+ to properly crown a model. For example the Needle in a Haystack goes to Claude which is huge. Would love to see coding gen and coding bug finds as well

  • @stop_tryharding
    @stop_tryharding 6 หลายเดือนก่อน

    Can't wait to see all the new and creative ways it comes up with to refuse to do what I ask.

  • @αψαη
    @αψαη 6 หลายเดือนก่อน

    I found out that if you ask the same question on the same pc, for example, you ask them to translate an article from English to Chinese, both of their translations will almost be identical... I think there might be a possibility that they might access the same history in your pc before they will work on their own "brain"

  • @vostfrguys
    @vostfrguys 6 หลายเดือนก่อน

    let's wait for gpt4.5 release, hope it's cheaper than claud 3 flagship model, and better in everyway

  • @Alex-nk8bw
    @Alex-nk8bw 6 หลายเดือนก่อน

    So everyone is like "this time, we promise, after more than a year, we really caught up with GPT-4. Somehow, a litlle, maybe". And in the meantime, OpenAI is already working on GPT-5!?

  • @crippsuniverse
    @crippsuniverse 6 หลายเดือนก่อน

    Censoring is a major issue for writers of horror. I was hoping Claude 3 could help me with editing. Never mond 🙂

  • @DemiGoodUA
    @DemiGoodUA 6 หลายเดือนก่อน

    try
    "A passenger liner flight is 90 minutes long in the forward direction and a full hour and a half in the reverse direction, why does this happen?"
    only gpt can answer right

  • @JSeds
    @JSeds 6 หลายเดือนก่อน

    My guess is that it's priced higher intentionally so that they can scale their capacity

  • @BongoFerno
    @BongoFerno 6 หลายเดือนก่อน

    Add this question to your tests:
    "Which branch of knowledge can be discovered from the data in your training set, which humans overlooked?"

  • @jaakko3083
    @jaakko3083 6 หลายเดือนก่อน +1

    I guess Claude is similar model because created by former OpenAI employees.

  • @rasol136
    @rasol136 6 หลายเดือนก่อน

    The "10" could be represented as 1 NEXT 0 NEXT when it gets to that matrix for the response. This could lead it to 1 = yes, 2 = no, then it completes the rest of the answer following the prompt... complete guess, but would love to hear additional thoughts on this too!

    • @WhyteHorse2023
      @WhyteHorse2023 6 หลายเดือนก่อน

      10 is 2 in binary... why that would trip up the llm I have no clue.

  • @marcfruchtman9473
    @marcfruchtman9473 6 หลายเดือนก่อน

    Thanks for the Video!
    Re: Choosing the right model -- "Creative Writing" and "Summarization", are fairly easy for most models now. The existing AI options out there already do that, and actually do it fairly well. So, before I bought into the "low or mid tier", I would need substantial proof why I should switch. As for the upper tier -- the comparison was quite telling. The fact that both models were wrong about the marble in cup was quite interesting. In fact, both models are making the "same" mistakes tho not in all cases. It suggests to me that the "core" model might be similar.
    The benchmarks vs the tests here are showing slightly different results as well. I have to say I am not convinced by the claims thus far.

  • @matthewstarek5257
    @matthewstarek5257 6 หลายเดือนก่อน

    Okay, this is driving me crazy. I think about it every single time you ask the cup question. You don't specify that the cup does not have a lid. If someone said that I got a cup of Coca-Cola from Burger King I would assume that it has a lid on it. I think it would make a big difference if you specified that the cup does not have a lid on it and/or that it's open at the top. Especially given the response here where it said that the ball is now resting against the top of the cup which is now the bottom

  • @jean-baptistedelabroise5391
    @jean-baptistedelabroise5391 6 หลายเดือนก่อน

    Another test similar to the apple test would be "find 10 words containing "air"" or some other character sequence (the longer the sequence the harder the test)

  • @I-Dophler
    @I-Dophler 6 หลายเดือนก่อน +1

    Great breakdown and comparison between Cloud 3 and GPT-4. Your detailed testing across various scenarios gives a clear picture of their capabilities and limitations. The nuanced approach to evaluating each model's strengths, especially in coding and logic tasks, highlights the importance of choosing the right tool for specific use cases. It's fascinating to see the rapid evolution in this space, and your analysis helps cut through the hype to understand the real-world applications and performance. Looking forward to more insights and tests on these models.

    • @zacboyles1396
      @zacboyles1396 6 หลายเดือนก่อน +1

      *last years GPT4. He didn’t use the model released a few months ago from OpenAI, not sure how he could fail so hard 🤷‍♂️

    • @I-Dophler
      @I-Dophler 6 หลายเดือนก่อน

      @@zacboyles1396 It's fascinating to see the advancements in AI models like Cloud 3. The competition between models keeps pushing the boundaries of what's possible. Excited to see how this technology continues to evolve! #TheFutureIsSynthetic 🤖🚀

  • @twisterrjl
    @twisterrjl 6 หลายเดือนก่อน

    Not only breaking, but also SHOCKING!
    I AM SCHOCKED! WILL AI KILL US?

  • @3Djon_priv
    @3Djon_priv 6 หลายเดือนก่อน

    GPT-3.5 is working as expected

  • @pandoraeeris7860
    @pandoraeeris7860 6 หลายเดือนก่อน

    The ENTIRE INDUSTRY WAS SHOCKED!

  • @ravibrewster
    @ravibrewster 6 หลายเดือนก่อน

    Have you considered that model trainers know you questions and have added custom logic to answer them?

  • @hydrohasspoken6227
    @hydrohasspoken6227 6 หลายเดือนก่อน

    One "small" detail that is being conveniently left behind. The official comparison wasn't made vs GPT4 Turbo. Was made with the first version of GPT4 released in 2023. GPT4 beats Claude in almost all official benchmarks.