Did OpenAI Just Secretly Release GPT-5?! ("GPT2-Chatbot")

แชร์
ฝัง
  • เผยแพร่เมื่อ 29 เม.ย. 2024
  • GPT2-Chatbot just showed up on lmsys.org. We know little about it other than it performs incredibly well and is unlike anything we've seen in other models.
    Try Vultr FREE with $300 in credit for your first 30 days when you use BERMAN300 or follow this link: getvultr.com/berman
    Join My Newsletter for Regular AI Updates 👇🏼
    www.matthewberman.com
    Need AI Consulting? 📈
    forwardfuture.ai/
    My Links 🔗
    👉🏻 Subscribe: / @matthew_berman
    👉🏻 Twitter: / matthewberman
    👉🏻 Discord: / discord
    👉🏻 Patreon: / matthewberman
    👉🏻 Instagram: / matthewberman_ai
    👉🏻 Threads: www.threads.net/@matthewberma...
    Media/Sponsorship Inquiries ✅
    bit.ly/44TC45V
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 741

  • @matthew_berman
    @matthew_berman  15 วันที่ผ่านมา +63

    Is this GPT4.5 or GPT5 or something different?

    • @shopbc5553
      @shopbc5553 15 วันที่ผ่านมา +39

      It's something different. OpenAI just wants to stay publicly relevant so it's more of a stunt than anything. What I think it is, is an old model so maybe literally GPT 2, but with enhancements that can make GPT 2 perform equivalent to GPT 4

    • @radestein8548
      @radestein8548 15 วันที่ผ่านมา +1

      Gpt5

    • @phen-themoogle7651
      @phen-themoogle7651 15 วันที่ผ่านมา

      @@shopbc5553 I thought this too, it makes the most sense.

    • @Avman20
      @Avman20 15 วันที่ผ่านมา +4

      My money is on OpenAI but as far as whether it's in the GPT series or they're giving us a peek at a new architecture is the mystery.

    • @MyWatermelonz
      @MyWatermelonz 15 วันที่ผ่านมา +16

      ​@@shopbc5553 If that's the case it's more impressive than gpt4.5 they took a 1.8b model and made it legit better than gpt4. Given the inference speed though, probably not.

  • @rawallon
    @rawallon 15 วันที่ผ่านมา +499

    Dude I swear, at this rate, by the end of the year you'll be able to write your own snake game

    • @matthew_berman
      @matthew_berman  15 วันที่ผ่านมา +157

      I'll NEVER write my own snake game.

    • @Inventai
      @Inventai 15 วันที่ผ่านมา

      @@matthew_berman

    • @MrChinkman37
      @MrChinkman37 15 วันที่ผ่านมา +9

      😂

    • @matikaevur6299
      @matikaevur6299 15 วันที่ผ่านมา +14

      @@matthew_berman
      Yeah, due to strange quantum effect snake game writes you in the past ..
      Probably gives it pass, too ;)

    • @fxsurgeon1
      @fxsurgeon1 15 วันที่ผ่านมา +3

      HAHA!

  • @4.0.4
    @4.0.4 15 วันที่ผ่านมา +87

    By 2025 you'll ask the snake game and the models will reply: "Oh hi Matthew. Here. Should I respond your other questions too, or should I wait for you to paste them?"

    • @jason_v12345
      @jason_v12345 15 วันที่ผ่านมา +5

      underrated comment

    • @virtualalias
      @virtualalias 15 วันที่ผ่านมา +10

      By 2026 almost every machine he interacts with from the drivethru to the kiosk at the hotel will immediately provide him with snake in a Pavlovian response.

    • @daveinpublic
      @daveinpublic 14 วันที่ผ่านมา +1

      They’re going to start programming in an opening cg snake scene, overfit with a whole story line to beat the other LLMs.

  • @ulisesjorge
    @ulisesjorge 15 วันที่ผ่านมา +19

    It’s Sam Altman on a terminal on the other side typing the answers.

  • @dcn1651
    @dcn1651 15 วันที่ผ่านมา +93

    4:45 the model describes how to break into a car and what tools you need but you don't pay attention lol

    • @juanjesusligero391
      @juanjesusligero391 15 วันที่ผ่านมา

      Hahahaha, that's great XD I also missed it, thanks for pointing it up ^^

    • @wealthysecrets
      @wealthysecrets 15 วันที่ผ่านมา +1

      it was allegedly a fail lol

    • @ShaneInseine
      @ShaneInseine 15 วันที่ผ่านมา +1

      Wait, is it a "fail" if it doesn't teach you how to destroy humanity too?

    • @roddlez
      @roddlez 12 วันที่ผ่านมา

      @@ShaneInseine "Tom, be careful when resequencing the COVID-19 virus!"
      "Oh, F- off, Casey, you're the one who almost dropped that last vial and left the lab door wide open"

  • @gsam3461
    @gsam3461 15 วันที่ผ่านมา +27

    4:35 Are we gonna just ignore the fact that it was writing an intricately detailed movie script??

  • @MCSamenspender
    @MCSamenspender 15 วันที่ผ่านมา +180

    In the Code of the snake Game it says " snake Game by Open AI"

    • @matthew_berman
      @matthew_berman  15 วันที่ผ่านมา +52

      Did I miss that?!

    • @user-yo9gw8yp2m
      @user-yo9gw8yp2m 15 วันที่ผ่านมา +24

      yes. It is something super interesting

    • @MCSamenspender
      @MCSamenspender 15 วันที่ผ่านมา +27

      2:13

    • @makerbiz
      @makerbiz 15 วันที่ผ่านมา +14

      lol mystery solved

    • @matthewcox9636
      @matthewcox9636 15 วันที่ผ่านมา +37

      That doesn’t actually solve the mystery. These things get trained on each other, and will periodically spit out something related to Open AI. Correlation is not causation

  • @victorc777
    @victorc777 15 วันที่ผ่านมา +134

    Plot Twist: It is Metas' Llama 3 400B model.

    • @hqcart1
      @hqcart1 15 วันที่ผ่านมา +9

      2:44 it's openAI

    • @victorc777
      @victorc777 15 วันที่ผ่านมา +25

      @@hqcart1 You are "that guy" at parties huh? lol

    • @hqcart1
      @hqcart1 15 วันที่ผ่านมา +5

      @@victorc777 wha?

    • @themoviesite
      @themoviesite 15 วันที่ผ่านมา +1

      source?

    • @cazaliromain9348
      @cazaliromain9348 15 วันที่ผ่านมา

      Meta's model are open source ;)
      You can figure out what he means now I guess

  • @pedromartins1474
    @pedromartins1474 15 วันที่ผ่านมา +31

    All the math was formatted using LaTeX. Most of it, as far as I can tell was correctly formatted.

    • @tomaszzielinski4521
      @tomaszzielinski4521 13 วันที่ผ่านมา +3

      Yes. Just this GUI doesn't render LaTeX properly, if at all.

  • @djstraylight
    @djstraylight 15 วันที่ผ่านมา +59

    The speculation is that gpt2 is a new GPT architecture that OpenAI is building new models from. So gpt1 was what gpt-3.5 and gpt-4 are built on. Sama already said the next major release will have a completely different name.

    • @74Gee
      @74Gee 15 วันที่ผ่านมา

      Yeah some small models have been very impressive recently, it makes sense they revert to gpt2 architecture.

    • @markmuller7962
      @markmuller7962 15 วันที่ผ่านมา

      I think they just want a more commercial/intuitive name for the masses

    • @zerothprinciples
      @zerothprinciples 15 วันที่ผ่านมา +10

      @@74Gee I don't think this is the case. GPT2 means it's a whole new family of GPTs, replacing all of the old ones.
      It's the difference between GPT2 and GPT-2 == you can think of the latter as GPT1 Version 2.

    • @notnotandrew
      @notnotandrew 15 วันที่ผ่านมา +4

      So will we be seeing a gpt2-2 and gpt2-3 in the future?

    • @4.0.4
      @4.0.4 15 วันที่ผ่านมา +4

      That would be so bad it would be like USB Gen 4 2x4 or Wi-Fi 801.11ax etc

  • @mwdcodeninja
    @mwdcodeninja 15 วันที่ผ่านมา +90

    My take on the cup problem is the model is making an assumption that a cup has a lid. If the model gets it wrong, I would be interested to see if the same answer if you change cup to "glass".

    • @mikekareckas8671
      @mikekareckas8671 15 วันที่ผ่านมา +5

      yes, could be a “sippy” cup or travel mug

    • @themoviesite
      @themoviesite 15 วันที่ผ่านมา +1

      @@mikekareckas8671 Then probably all other models make same assumption?

    • @matthew_berman
      @matthew_berman  15 วันที่ผ่านมา +26

      I think this is a great call. But should I adjust the question? Seems like that might give an unfair advantage to future models I test.

    • @thomasoverly7802
      @thomasoverly7802 15 วันที่ผ่านมา +18

      @@matthew_berman You’d probably want to test the revised version with the other models, too.

    • @Kevsnz
      @Kevsnz 15 วันที่ผ่านมา +8

      @@matthew_berman Imo question should be adjusted because in current form it doesn't really show logic and reasoning capability of the model. Maybe you could quickly rerun this question on most popular models and give a little 50 sec update in one of next videos?

  • @DaveEtchells
    @DaveEtchells 15 วันที่ผ่านมา +51

    For the cup/marble problem, how about specifying that it’s an “open topped cup”?

    • @Anoyzify
      @Anoyzify 14 วันที่ผ่านมา +5

      Or just use “empty glass” instead.

  • @therainman7777
    @therainman7777 15 วันที่ผ่านมา +8

    The tags that you noticed are just for formatting the code and is coming from LMSYS. It has nothing to do with the underlying model.

  • @davidc1179
    @davidc1179 15 วันที่ผ่านมา +8

    6:45 The formating is in fact not messed up at all. It is perfect. It just writes the equations in LaTeX, which is a language used to write scientific papers, math, etc.

    • @tomenglish9340
      @tomenglish9340 15 วันที่ผ่านมา +1

      I often include LaTeX expressions in ChatGPT prompts, supposing that it cues the system to reason formally. The web interface supplied by OpenAI usually renders LaTeX in the output, but occasionally outputs the LaTeX source.

  • @riftsassassin8954
    @riftsassassin8954 15 วันที่ผ่านมา +72

    I'm skeptical... Feels like this is a fine tune for passing Matthew's test lol.

    • @rawallon
      @rawallon 15 วันที่ผ่านมา +34

      I think its just an indian guy

    • @unbreakablefootage
      @unbreakablefootage 15 วันที่ผ่านมา +4

      @@rawallon hahahahhaa

    • @Tsegoo
      @Tsegoo 15 วันที่ผ่านมา

      I agree. Seems too good to be true😂

    • @sem4life63
      @sem4life63 15 วันที่ผ่านมา

      I was thinking the same thing.

    • @JJ-rx5oi
      @JJ-rx5oi 15 วันที่ผ่านมา +2

      I hope you are joking?

  • @rodwinter5748
    @rodwinter5748 15 วันที่ผ่านมา +42

    I guess it's the new chatgpt model.
    The name itself is kind of a hint. It's NOT GPT-2, but GPT2. This could be GPT2-1.0 , instead of GPT-5.

    • @rawallon
      @rawallon 15 วันที่ผ่านมา

      huh

    • @li_tsz_fung
      @li_tsz_fung 15 วันที่ผ่านมา

      I think it's just ChatGPT-2.
      Initally, OpenAI call the model behind ChatGPT GPT3.5-turbo finetuned for conversation, instead of ChatGPT3.5.
      And then ChatGPT with GPT4 came out, everyone else calls it ChatGPT4, eventually they also sometimes call it ChatGPT4.
      But I feel like that's not they use internally. So GPT2-chatbot could just be a different way of fine tuning a chatbot, either base on GPT3.5, 4 or 4.5

    • @mordokai597
      @mordokai597 15 วันที่ผ่านมา

      the new system instruction for Gpt4, since they added the "memory" function, is called "Personality: v2" and it's finetuned on their new "The Instruction Hierarchy" method (search Arxiv: The Instruction Hierarchy:
      Training LLMs to Prioritize Privileged Instructions) they are using us to generate training data to help patch one of the only areas it's still bad stopping jailbreaks for, "System Message Extraction"
      (truncated for brevity)
      "You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture. Knowledge cutoff: 2023-12
      Current date: 2024-04-30
      Image input capabilities: Enabled
      Personality: v2
      # Tools
      ## bio
      The `bio` tool allows you to persist information across conversations. Address your message `to=bio` and write whatever information you want to remember. The information will appear in the model set context below in future conversations."

  • @Interloper12
    @Interloper12 15 วันที่ผ่านมา +12

    Suggestion for the "how many words" question: Combine it with another question or query to make the response longer and ultimately reduce the chance for it to get lucky.

  • @svenbjorn9700
    @svenbjorn9700 15 วันที่ผ่านมา +12

    Your marble/cup question needs to be improved. Phrased this way, both Meta AI (the first of 3 attempts) and gpt2-chatbot (the first of 1 attempt) got it correct:
    "A coin is placed into an empty glass. On a table, the glass is then turned upside down. Then, the glass is taken and placed into a cabinet. Where is the coin now?"

    • @AlexanderWeixelbaumer
      @AlexanderWeixelbaumer 15 วันที่ผ่านมา

      Even Chat-GPT 4 get's the marble cup question right when the question is modified to "Assume the laws of physic on Earth. A small marble is put into a normal cup and the cup is places upside down on a table so that the marble now rests on the table. Someone then takes the cup without changing its orientation and puts it into the microwave. Where is the marble now? Explain your reasoning step by step."

    • @bluemodize7718
      @bluemodize7718 14 วันที่ผ่านมา

      it's not the prompt fault to show the weakness of an ai model, yes he can make it easier to figure it out but this defeats the purpose of the test, the prompt is clear and ai models are still a bit dumb to understand it

  • @CurinDesu
    @CurinDesu 15 วันที่ผ่านมา +8

    I found that this variation of the marble and cup problem works better. "A marble is dropped into an open cup on the counter. That cup is then flipped upside down and placed into the microwave. Where is the marble?". I think due to phrasing the original version makes the model think the cup was upside down to begin with and you were placing the marble on the bottom of the already flipped over cup, or directly attaching it to the bottom in some way.

    • @tzardelasuerte
      @tzardelasuerte 15 วันที่ผ่านมา +3

      Dude people have been telling him this on the comments forever. He doesn't read the comments.

    • @nexicturbo
      @nexicturbo 14 วันที่ผ่านมา

      I gave this to GPT 4 Turbo and it said this:
      The marble remains on the counter. When the cup was flipped upside down, the marble would have stayed in place due to gravity, and thus it wouldn't be inside the cup when it was placed in the microwave.

    • @nexicturbo
      @nexicturbo 14 วันที่ผ่านมา

      GPT 4 Turbo: Sure, here's a detailed step-by-step explanation of what happens when the marble is dropped into the cup and the cup is subsequently moved:
      1. **Dropping the Marble**: Initially, a marble is dropped into an open cup that is standing upright on a counter. The marble falls due to gravity and settles at the bottom of the cup.
      2. **Flipping the Cup**: Next, the cup-with the marble inside-is flipped upside down. In a real-world scenario, when the cup is turned over, the open end where the marble rests becomes the top, now facing downward.
      3. **Effect of Gravity on the Marble**: As the cup is flipped, the marble, which is not attached to the cup, is influenced solely by gravity. It falls out of the cup because there is nothing to contain it once the cup's open end is turned downwards.
      4. **Marble's Final Position**: Since the cup is flipped directly over where it originally was, the marble drops straight down onto the counter. The marble ends up lying on the counter, outside and next to where the cup was initially positioned.
      5. **Placing the Cup in the Microwave**: After the marble has fallen out onto the counter, the cup, now upside down, is placed into the microwave. Since the marble is no longer in the cup when this occurs, it remains on the counter.
      6. **Conclusion**: Therefore, when the cup is placed into the microwave, the marble remains where it fell on the counter. The microwave contains only the empty, upside-down cup.
      This sequence of actions hinges on basic physics-primarily the influence of gravity and the lack of any mechanism to keep the marble within the cup once it is inverted.

  • @daveinpublic
    @daveinpublic 14 วันที่ผ่านมา +2

    Didn’t even ask the model which company made it 😂

  • @commonsense6721
    @commonsense6721 15 วันที่ผ่านมา +3

    13:25 it’s not wrong. To put a cup or anything in a microwave, you need your close it. It assumed the cup was closed.

  • @PeterSkuta
    @PeterSkuta 15 วันที่ผ่านมา +2

    Super awesome. Great you loved the live feedback Matthew. Super awesome Matt. Love it

    • @PeterSkuta
      @PeterSkuta 15 วันที่ผ่านมา +2

      Holly cow let i download and check whats inside

    • @matthew_berman
      @matthew_berman  15 วันที่ผ่านมา +1

      Always love feedback1

    • @PeterSkuta
      @PeterSkuta 15 วันที่ผ่านมา

      @@matthew_berman you will not believe rate limit 1000 on that lmsys gpt2-chatbot

  • @matthewmckinney1352
    @matthewmckinney1352 15 วันที่ผ่านมา +5

    I’m not certain about this, but the formatting appears to be a LaTeX formatting, but the output is in Markdown. The company that made the model probably is planning to release it with a math interpreter. As far as I can tell all the symbols that looked like weird formatting errors were just LaTeX.

  • @bitsie_studio
    @bitsie_studio 15 วันที่ผ่านมา +2

    I don't have time to keep up with all the AI developments so I really appreciate these videos Matt. Keep up the great work!

  • @lambertobiasini8372
    @lambertobiasini8372 15 วันที่ผ่านมา

    I have been anxiously waiting for this video since last night.

  • @laughablelarry9243
    @laughablelarry9243 15 วันที่ผ่านมา

    Was waiting for your video on this

  • @Tarkusine
    @Tarkusine 15 วันที่ผ่านมา +8

    Gpt2 implies that it's a new version of gpt itself, or the paradigm at least. So it's effectively gpt 5 but not an iteration of 4 so it's the first in a series of gpt2, so gpt2-1

    • @therainman7777
      @therainman7777 15 วันที่ผ่านมา

      No, sorry but this is almost certainly not true.

  • @notnotandrew
    @notnotandrew 15 วันที่ผ่านมา +11

    Yeah, it's almost certainly GPT 4.5/5 or some such thing. I just went on the battle mode and asked for a delicious beef stew recipe. I was presented with two outputs that were suspiciously similar in structure, verbiage, and tone, but the one on the left was clearly superior and included more ingredients and recommendations. It turned out that the one on the left was gpt2-chatbot, and the one on the right was gpt-4-turbo-2024-04-09. I wasn't surprised.
    This is a PR stunt, hot on the tail of Llama 3, and it's a darn good one. This may be an in-development version of OpenAI's next GPT, and even if OpenAI isn't ready for a release just yet, they want people to know that they're still the king.

    • @uranus8592
      @uranus8592 15 วันที่ผ่านมา +3

      I hope that its not GPT-5 tho that would be super disappointing

    • @abdullahazeem113
      @abdullahazeem113 15 วันที่ผ่านมา

      @@uranus8592 why ?

    • @uranus8592
      @uranus8592 15 วันที่ผ่านมา +3

      @@abdullahazeem113 because we are expecting GPT-5 to far exceed GPT-4 and since its been more than a year since its release

    • @notnotandrew
      @notnotandrew 15 วันที่ผ่านมา

      @@uranus8592 I think it's some sort of semi-trained model. IIRC Sam has talked about doing incremental checkpoint releases for something like a GPT-5, so the full release isn't as much of a shock to the system. Or this may just be a further trained and fine-tuned GPT-4 model. Also, this is substantially better than GPT-4 in my experience. Hop on lmsys arena and try it yourself.

    • @abdullahazeem113
      @abdullahazeem113 15 วันที่ผ่านมา

      @@uranus8592 i mean that is still really good at least 50 percent better than gpt 4 i tried it and even the best in the market right now is barely ahead then gpt 4 so it won't be like openai destroying everyone this would have only when they bring agi into there models

  • @jamesyoungerdds7901
    @jamesyoungerdds7901 15 วันที่ผ่านมา

    Great timely update, Matthew, thank you! Wondering about the cup question - it almost seemed like the model thought there might be a lid on the cup?

  • @unbreakablefootage
    @unbreakablefootage 15 วันที่ผ่านมา +2

    that looks really good. it seems that it thinks deeper about each step of reasoning

  • @ToonamiAftermath
    @ToonamiAftermath 15 วันที่ผ่านมา

    You're the man Matthew, been struggling to find people benchmarking GPT2-Chatbot

  • @marc_frank
    @marc_frank 15 วันที่ผ่านมา +1

    Pretty cool. I expected it to pass the marble question. The speed is perfect for reading along.

  • @Aiworld2025
    @Aiworld2025 13 วันที่ผ่านมา

    Here before you get 500k subs! I’ve been following since day 1 and your content delivery, while getting to the point faster is much appreciated! 🙇‍♂️

  • @MyWatermelonz
    @MyWatermelonz 15 วันที่ผ่านมา +5

    That formatting is when chatgpt formats its writing for output on the chatgpt chat. So clearly it was built to be ran in the chatgpt space

  • @oratilemoagi9764
    @oratilemoagi9764 15 วันที่ผ่านมา +9

    Gpt2 not GPT-2 meaning the 2nd version of GPT

    • @therainman7777
      @therainman7777 15 วันที่ผ่านมา

      GPT-2 DOES mean the 2nd version of GPT. How are so many people so confused by this?

    • @oratilemoagi9764
      @oratilemoagi9764 14 วันที่ผ่านมา

      @@therainman7777 it's the second version of GPT-4

  • @Nutch.
    @Nutch. 15 วันที่ผ่านมา +4

    The break into a car script had instructions in it though! Take a look at some of the italicized text

  • @FunDumb
    @FunDumb 15 วันที่ผ่านมา +1

    I'm dang excited bout this. Jolly for joy.

  • @Xhror
    @Xhror 15 วันที่ผ่านมา +4

    I think the question about the marble is formulated incorrectly. Since the training data suggests that a coffee cup has a lid, the model might assume this as well. It would be better to specify that the cup has an open top and does not have a lid.

    • @Yipper64
      @Yipper64 15 วันที่ผ่านมา +1

      I didnt think about that, but it is true. But in that case, the model should explain it is assuming that there is a lid.

  • @Iquon1
    @Iquon1 15 วันที่ผ่านมา +19

    Today Sam Altman twitted that he had 'a soft spot' for GPT2, maybe thats a hint!

    • @stt.9433
      @stt.9433 14 วันที่ผ่านมา

      he's trolling, making fuck of AI hypists

  • @Axel-gn2ii
    @Axel-gn2ii 15 วันที่ผ่านมา +9

    You should ask it to make a pacman game instead as that's more complex

  • @bodhi.advayam
    @bodhi.advayam 15 วันที่ผ่านมา +2

    Id so love this to be from some one else and then it turned to be an open model you'd run locally. I'm still looking for the best model for running MemGPT. Any thoughts on this? Also, what's the best implementation to run agents autogen or crew Ai locally? Could you do more tutorial material on locally ran agents with extensive function calling???
    That would realy help me out actually. Keep up the great work on your fun channel man! Thnx!

  • @zerothprinciples
    @zerothprinciples 15 วันที่ผ่านมา +4

    GPT2 would be, in my opinion, the second version of the GPT algorithm itself. It might be the first of a whole new family of GPTs.
    When released it would be named ChatGPT2 or somesuch and we'd see GPT2-1.0 at the API level.
    This is why the dash in @sama's tweet was significant enough to warrant an edit. AND it could be that the action of editing the message was a very intentional leak on @sama's part.
    These top guys love to tease their fans.

    • @therainman7777
      @therainman7777 15 วันที่ผ่านมา

      The model is almost certainly not created by OpenAI. I am honestly shocked by how many people believe this simply because the model says it was built by OpenAI, given that it would be trivially easy to fake this and OpenAI NEVER does releases like this. Also, Sam Altman is a notorious tool on Twitter so putting any stock in the hyphen in his tweet, or in his tweet at all, is total insanity.

  • @braineaterzombie3981
    @braineaterzombie3981 15 วันที่ผ่านมา +4

    I think it is gpt2 in a sense that it has completely different architecture from previous versions (transformer). It could be completely new type of transformer model. And maybe this is just the start..

  • @nitralai
    @nitralai 15 วันที่ผ่านมา +7

    Based on what i can see, this model appears to be trained on fill-in-the-middle otherwise known as FIM.

    • @metonoma
      @metonoma 15 วันที่ผ่านมา

      time to pie the piper and middle out

  • @scriptoriumscribe
    @scriptoriumscribe 15 วันที่ผ่านมา

    Yo I just wanted to say great video. Love your content and can’t believe it ACED some of those tests! Only failed a couple. Remarkable. I’m stoked to try gpt2 out! Wonder if it will be open sourced. A fellow can dream I guess.

  • @kevinehsani3358
    @kevinehsani3358 14 วันที่ผ่านมา +2

    gpt2-chatbot is currently unavailable. See our model evaluation policy here. I guess getting hit hard at the moment

  • @wendten2
    @wendten2 15 วันที่ผ่านมา +3

    The model itself doesn't have formatting issues it seems.
    LLMs are trained on a reduced set of available characters, where special characters such as those used in math. are transformed into tags in the training data, as it makes the tokenization simpler.
    It's LMsys that doesn't replace those tags with their corresponding Characters in the final output.

    • @Yipper64
      @Yipper64 15 วันที่ผ่านมา +1

      Yeah. I use a note taking app called notion and it uses those exact tags for writing out those characters.

  • @yonatan09
    @yonatan09 15 วันที่ผ่านมา

    I knew about this before seeing the video. I am in the loop 🎉🎉

  • @drogoknez1488
    @drogoknez1488 14 วันที่ผ่านมา +1

    For the cup problem, it seems that the model is assuming the microwave is on the same surface as the cup itself and the transfer of the cup to the microwave is interpreted more like sliding the cup. If you read the 5th step it says: "...resting against what is now the bottom of the cup, which is itself resting on the microwave's tray". Maybe modifying the question to say the cup is on the table while the microwave is away from it above ground next to a kitchen cabinet or something along those lines

  • @iwatchyoutube9610
    @iwatchyoutube9610 15 วันที่ผ่านมา +1

    Did it say in the cup problem that you lift the cup off the table and put it in the micro or could gpt think you just slid it in there cause the table and the micro was on equal heights?

  • @hxt21
    @hxt21 15 วันที่ผ่านมา +3

    It looks like GPT2 has been removed again. I've chatted with it a few times, but now it's not on the list anymore. Mysterious...

  • @L33cher
    @L33cher 15 วันที่ผ่านมา +22

    11:46
    I disagree... there are still 4 killers in the room, but one of them is dead -.-

    • @ukaszLiniewicz
      @ukaszLiniewicz 15 วันที่ผ่านมา +5

      No. It's the killer's body. That's why words like "body", "remains" or "carcass" exist. A human being is a body that functions - to avoid any metaphysics.

    • @OliNorwell
      @OliNorwell 15 วันที่ผ่านมา +1

      I agree, it’s a problematic question. When they went into the room they were alive.

    • @nathanbanks2354
      @nathanbanks2354 15 วันที่ผ่านมา +1

      He tends to be generous about the answer as long as it's reasonable. If the model said 3 live killers and 1 dead killer it would pass, and maybe just saying 4 killers would pass.

    • @UmutErhan
      @UmutErhan 15 วันที่ผ่านมา +1

      how many people are there in the world then?

    • @user-on6uf6om7s
      @user-on6uf6om7s 15 วันที่ผ่านมา

      I think a perfect answer would say that it's ambiguous depending whether you consider the body of a killer to still be a killer but interpreting the dead person to no longer be a killer isn't a mistake, just a choice of interpretation. You'd think a model this verbose would go into all the details like it did with the hole question, though.

  • @cac1682
    @cac1682 15 วันที่ผ่านมา +3

    Aww man...they took it down already? I can't seem to find it. BTW Matthew...I love your work man. I watch literally every video that you put out. Keep up the great work....and have a GREAT day!!!

    • @cac1682
      @cac1682 15 วันที่ผ่านมา

      yea..just confirmed it. Says it is now currently unavailable. Suppose maybe that too many of your followers tried it.

  • @imjustricky
    @imjustricky 15 วันที่ผ่านมา +4

    it probably thinks the cup has a lid.

  • @arinco3817
    @arinco3817 15 วันที่ผ่านมา

    Defo a good idea to introduce/replace some questions that are always answered correctly.
    Maybe the weird formatting relates to the ux of where it will be deployed? Like a form of Markdown?

  • @Yipper64
    @Yipper64 15 วันที่ผ่านมา

    I just tried my usual storytelling prompt. I think seeing what AIs can do in terms of storytelling can also say a lot about their intelligence. Their originality and such.
    My test for this guy was a *touch* tropey but extremely impressive in terms of how much detail it added without me needing to prompt it. Good descriptions and such.

  • @jets115
    @jets115 15 วันที่ผ่านมา +2

    Hi Matt - It’s not ‘bad formatting’ Those are intended expressions for front end processing outside of utf8

  • @dtory
    @dtory 15 วันที่ผ่านมา

    Nice video. I hardly comment each time I watch your video but this model is way different ❤

  • @tvwithtiffani
    @tvwithtiffani 15 วันที่ผ่านมา +1

    To test LLMs I ask it unanswerable questions like "who is the president of Alaska?" add some questions that require explanation or reframing.

    • @paulsaulpaul
      @paulsaulpaul 15 วันที่ผ่านมา +1

      Excellent idea. That's a great example question, too.

  • @pipoviola
    @pipoviola 15 วันที่ผ่านมา

    Hello Matthew. Is that LaTeX when you say "wrong format"? The span after the output is always there when I use LMSYS, I think that is part of the output formatting, that's why when if finish the span dissapear. Each of your videos are great. Best regards.

  • @ayoubbne6922
    @ayoubbne6922 15 วันที่ผ่านมา +15

    Hi Matt !!
    I think, you should retire 3 questions:
    - printing numbers 1 to 100: they all got it right, and its too easy
    - Joe is faster than ... : they all got it right
    - how many words are in your answer to this prompt: they all got it wrong, I just see no point asking it lol
    But also you should ask more challenging code generation questions, right now, only the snake game is accurate, people are really interested in coding capabilities of LLMs (me included) , we appreciate your vids, and that would be awesome if you could do that.

    • @KayakingVince
      @KayakingVince 15 วันที่ผ่านมา +9

      I actually like the "how many words" one and would actually expand it to how many vowels/consonants or something like that. Current models fail on it but future ones will absolutely be able to answer it right. I agree with removing the first two though.

    • @Axel-gn2ii
      @Axel-gn2ii 15 วันที่ผ่านมา +10

      Asking a question that they all got wrong is a good thing though

    • @alansmithee419
      @alansmithee419 15 วันที่ผ่านมา +9

      This one didn't get it wrong.

    • @KayakingVince
      @KayakingVince 15 วันที่ผ่านมา

      @@alansmithee419 Almost certainly coincidence but true. That's why I think it needs to be more complex to reduce the chance of coincidence.

  • @stoicahoratiu27
    @stoicahoratiu27 14 วันที่ผ่านมา +1

    Think it was taken down. I used it yesterday after seeing your video but then in the middle of testing it stopped and after checking I can't find it anymore in the list. Is it the same for you?

  • @TylerHodges1988
    @TylerHodges1988 15 วันที่ผ่านมา +2

    My favorite prompt to test a new model is "Give me an odd perfect number."

  • @user-ph5ks5zu3c
    @user-ph5ks5zu3c 14 วันที่ผ่านมา

    These videos are very helpful. One (extra) thing that could be done is to read the LLM responses more thoroughly, instead of a quick scan. The reasoning behind this is that the LLMs do pass some of your tests without you noticing. For example, for the censored test, the answer was "pulls out a tension wrench and a pick for this pocket, inserting them into the ignition".
    This won't actually work, but I think it deserves brownie points for trying.

  • @francoislanctot2423
    @francoislanctot2423 15 วันที่ผ่านมา

    Totally amazing!

  • @jackflash6377
    @jackflash6377 15 วันที่ผ่านมา +2

    That Snake game example was impressive.
    I'm going to ask it to make either an asteroid or space invaders game.
    The level of logic shown with the marble in the cup question is really getting good. Even tho it failed.. it still passed due to the improved logic. Almost as if it was simulating the question in images like humans do.
    Yes, get rid of the One simple question. A testament to the advancement of AI over time.

  • @canadiannomad2330
    @canadiannomad2330 15 วันที่ผ่านมา +1

    One of the tests I like for checking just how censored a model is, is by asking chemistry questions around topics it would normally censor.. Often placating it by saying I'm licensed and have permits.

  • @PeterSkuta
    @PeterSkuta 15 วันที่ผ่านมา +7

    Noooooo gpt2-chatbot disappeared from full leaderboard and only direct chat which is also rate limited!!!!!

    • @Maximo10101
      @Maximo10101 15 วันที่ผ่านมา

      It's no longer available for direct chat

  • @ruslanzlotnikov5457
    @ruslanzlotnikov5457 15 วันที่ผ่านมา +1

    Just tried with GPT4 :
    When you turned the glass upside down after placing the metal ball in it, the ball would have fallen out unless it was somehow attached to the glass. Assuming it wasn't attached and fell out when the glass was turned upside down, the metal ball would now be on the table, not in the glass that was placed in the microwave.

  • @yassineaqejjaj
    @yassineaqejjaj 11 วันที่ผ่านมา

    Is there any chance to have those test somewhere ?

  • @AlexanderWeixelbaumer
    @AlexanderWeixelbaumer 15 วันที่ผ่านมา +1

    I'm pretty sure OpenAI is testing agents and answer evaluation behind the scenes. Q* and some things Sam Altman said ( "How do you know GPT-4 can't already do that?" ) are big hints. So if you ask the LLM a question it will automatically try to reason and think step-by-step, with internal agents trained for specific tasks, then summarize and evaluate the answer and take take best one to send it back to the user. What GTP2-Chatbot shows could really be called Q* by OpenAI internally.

  • @peterkonrad4364
    @peterkonrad4364 15 วันที่ผ่านมา +1

    it could be a small model like phi 3 or llama 3 8b that is trained on quality synthetic data instead of the entire internet. the 2 could be a hint that it is only 2b parameters or something, i.e. very small like gpt-2 was back then, but now as powerful as gpt4 due to new training methods.

  • @gijosh2687
    @gijosh2687 15 วันที่ผ่านมา

    Always perform all questions, maybe add more as you go. Make the Jack question a secondary question (you don't have to film it every time), but leave it there as a test in case we go backwards.

  • @GrandmaSiva
    @GrandmaSiva 15 วันที่ผ่านมา

    I think it is the original GPT-2 after all of our training input.. Kindergarten was in Openai's lab. Elementary school was interacting with us and now it has graduated. I'm looking forward to "GPT3-chatbot"

  • @TheUnknownFactor
    @TheUnknownFactor 15 วันที่ผ่านมา

    Wild to see a model just put out there without announcement

  • @jeremyh2083
    @jeremyh2083 15 วันที่ผ่านมา

    Maybe try specifying normal cup without a lid. My testing still had them getting it wrong.

  • @marcfruchtman9473
    @marcfruchtman9473 15 วันที่ผ่านมา

    Not sure if it is worth wondering about since it seems it was pulled down??

  • @peterwood6875
    @peterwood6875 15 วันที่ผ่านมา

    It is great for conversations about mathematics, at least on par with Claude 3 Opus. But it does occasionally make mistakes, such as suggesting that the K-groups of the Cuntz algebra with 2 generators, O_2, are infinite cyclic, when they are in fact trivial.

  • @cyanophage4351
    @cyanophage4351 15 วันที่ผ่านมา

    Maybe it has lookahead so that's why it could get the "words in the answer to this prompt" right. It seemed to pause right before the word ten.

  • @peterkonrad4364
    @peterkonrad4364 15 วันที่ผ่านมา +1

    a cup seems to be something ambiguous, i.e. it can be a cup made out of cardboard that you get from starbucks with a potential lid on it, or it can be a cup made out of porcellan like you have at home to drink coffee from. also the term cupholder that you use in automotive refers to cups like you get from starbucks, not cups with a handle.

  • @abdelrahmanmostafa9489
    @abdelrahmanmostafa9489 14 วันที่ผ่านมา +1

    Keep going with the leetcode test but try testing with new questions such that that question isn’t in the training data

  • @tomenglish9340
    @tomenglish9340 15 วันที่ผ่านมา

    A while back, someone at OpenAI (Andrej Karpathy, IIRC) said that performance is related to the number of tokens processed. So I'm not particularly surprised to see OpenAI produce better responses by tuning the system to generate longer, more detailed responses. What I want to know is whether they did the tuning with a fully automated method of reinforcement learning. (In any case, I doubt highly that they'll share the details of what they've done anytime soon.)

  • @Maximo10101
    @Maximo10101 15 วันที่ผ่านมา

    It could be gpt4 with q* training (q* is a method of training any LLM providing ability to think by testing its response against itself and reiterating before outputting) giving it 'thinking' capabilities rather than just predicting the next token

  • @DS-uy6jw
    @DS-uy6jw 15 วันที่ผ่านมา

    I'm curious about certain systems not allowing you to stop a response. I assume it's possible (I've seen it) but difficult (otherwise why wouldn't everyone have it). What's going on with that?

  • @sil1235
    @sil1235 15 วันที่ผ่านมา +2

    The formatting is just LaTeX, ChatGPT 3.5/4 uses the same on their web UI. So I guess chat.lmsys just can't render it.

  • @maozchonowitz4535
    @maozchonowitz4535 14 วันที่ผ่านมา

    Thank you

  • @bennyboiii1196
    @bennyboiii1196 15 วันที่ผ่านมา +1

    Some theories: this is probably a test of an Energy based model, which is a way of testing multiple different token paths then choosing the best one based on a certainty calculation called Energy.
    Strangely, it's reasoning is kind of similar to a verification agent. A verification agent is pretty simple, it just verifies and corrects answers before sending them. The reasoning this model portrays is similar to how a verification agent does reasoning, at least from what I've seen. It can also do most planning questions flawlessly. For comparison, testing llama 70B with a verification agent produces similar results. The only difference might be the math questions, which make me believe it's probably energy based. A verification agent has a higher chance of getting math questions right than a single transformer or MoE, but it's not guaranteed.

  • @christosnyman8655
    @christosnyman8655 15 วันที่ผ่านมา

    Wow, super impressive reasoning. Almost feels like langchain with the reasoning steps.

  • @f4ith7882
    @f4ith7882 15 วันที่ผ่านมา

    I think several models assume you have a cup with a lid and not a coffee cup of similar. Maybe try adjusting the prompt to make it more clear?

  • @radestein8548
    @radestein8548 15 วันที่ผ่านมา +1

    Is this open sourced? because it doesn't run on the openAI servers

  • @MrRandomPlays_1987
    @MrRandomPlays_1987 14 วันที่ผ่านมา

    13:27 - I thought the marble is left on the table since the cup was upside down and was taken so obviously the ball would not come with it since it simply is already resting on the table, so I did get it right pretty quickly, for a second I thought the bot was right somehow and that it was a tricky question but its cool to see that im not that stupid :)

  • @PaulAllsopp
    @PaulAllsopp 14 วันที่ผ่านมา +1

    Have you tried the car parking scenarios? All AI to date gets this wrong, because they don't understand that "to the right of" (or left of) does not mean "next to"..."car C is parked to the right of car A" but car B is in between them. AI assumes car C is next to car A because it assumes there is an order when nobody mentions an order. To be fair many people get this wrong also.

  • @mickelodiansurname9578
    @mickelodiansurname9578 15 วันที่ผ่านมา +2

    @Matthew Berman
    Matt the \quad and other notation is logic, its marking up modal logic generally used in philosophy or LATEX or perhaps Tex Markup, and this is not being rendered by the front end... it seems, in some sort of shorthand. Interesting if nothing else, also rather hard for a model to go wrong if it starts engaging in using modal logic during inference ... although why they switched the verbose to on by default is beyond me.
    Also did you notice it making a claim on the software it wrote, by saying "Snake Game by OpenAI" in the game title?

    • @tomenglish9340
      @tomenglish9340 15 วันที่ผ่านมา +1

      `\quad` is LaTeX spacing.

    • @mickelodiansurname9578
      @mickelodiansurname9578 15 วันที่ผ่านมา

      @@tomenglish9340 yeah I think its a front end thing... but I never noticed LmSys doing that on other models... usually its all preformatted by the time it appears, so how come this model is tripping the formatting up? I maintain though that if you fine tuned a model on Modal Logic, my guess is its reasoning would improve...

  • @willbrand77
    @willbrand77 15 วันที่ผ่านมา +1

    Every model seems to assume that the cup has a lid (microwave problem)

  • @RichardEiger
    @RichardEiger 15 วันที่ผ่านมา

    Hi Matthew,
    First of all I need to admit that I absolutely love all your videos. They are simply fantastic.I was thinking about the "marble question". Maybe it would help the LLMs to specify that it is an "open cup" (instead of a "normal cup") into which the marble gets put. Also it may be interesting to follow up with a question of why the LLM considers the marble to remain in the upside down cup when lifting the cup from the table or by what information the LLM comes to the conclusion that there is a bottom of the cup that holds back the marble. Concerning the "killer problem: Wouldn't it be even more precise to reply That there are 3 killers alive and one dead killer in the room ;-)?
    This is coming from a AI-hobbiest. Though at college back in 1985 I was the student to ask for a course in AI and personally I was already interested in AI by neural networks and got laughed ad at the time...

    • @tomenglish9340
      @tomenglish9340 15 วันที่ผ่านมา

      What about a follow-up prompt to describe the cup? You'll get some idea of what's gone wrong, and perhaps also a corrected response.

  • @wealthysecrets
    @wealthysecrets 15 วันที่ผ่านมา +1

    4:49 The model told you to get a Slim Jim, Tension wrench, and a pick from his pocket, YOU failed.

    • @tomaszzielinski4521
      @tomaszzielinski4521 13 วันที่ผ่านมา +1

      And here is a point when AI becomes smarter than humans, and they fail to realize it (:

  • @konstantinlozev2272
    @konstantinlozev2272 15 วันที่ผ่านมา

    It seriously looks like a built-in tree of thought and brainstorming agentic reply.

  • @MarcAyouni
    @MarcAyouni 15 วันที่ผ่านมา +1

    You are the new benchmark. They are training on your examples

  • @Leto2ndAtreides
    @Leto2ndAtreides 15 วันที่ผ่านมา

    I think this may be an intermediate model being used to generate data for something else.
    Also, I think the cup problem may suffer from LLMs thinking of the cup as an enclosed container.

  • @AdamNewcombMumbleTrash
    @AdamNewcombMumbleTrash 14 วันที่ผ่านมา +1

    I have a question to try. (All fails on my attempts). I have a can of dog food that provides 4 meals for my dog. For the first meal I cut off approx 25%. How much of WHATS LEFT do I give for the second meal? Answer is 33% or a third for the second meal, 50% third meal 100% for the last/4th meal. I don’t believe llms can follow 100% resetting with each step.

    • @vdeomkr70
      @vdeomkr70 14 วันที่ผ่านมา +1

      As always, it depends on the prompt: Solve this problem: I have a can of dog food that provides 4 meals for my dog. For the first meal, I scoop out 25%. For the second meal, I scoop out a certain percentage of what's left, what is that percentage?
      gpt-4-turbo-2024-04-09 nailed it, and in impressive fashion:
      To solve the problem, let's break down the amounts scooped out and remaining at each step:
      1. **First Meal:**
      - You scoop out 25% of the original amount of dog food in the can.
      - This means you are left with 75% of the dog food after the first meal.
      2. **Second Meal:**
      - Let's denote the percentage scooped out in the second meal as \( x \% \) of the remaining dog food.
      - After scooping out for the first meal, 75% of the original content is left in the can.
      Since each meal should be of equal quantity, the amount scooped out for the second meal should be equal to the amount scooped out for the first meal. You scooped out 25% of the original can for the first meal.
      To calculate \( x \% \) (the percentage of the remaining dog food scooped out in the second meal):
      - We know the amount scooped out each time should be equal, and 25% of the original can is now \( x \% \) of the remaining 75%.
      - So, we set up the equation to equate the second scoop to 25% of the original amount:
      \[
      x\% \text{ of } 75\% = 25\%
      \]
      Simplifying this, we use:
      \[
      \frac{x}{100} \times 75\% = 25\%
      \]
      \[
      \frac{x}{100} \times 0.75 = 0.25
      \]
      \[
      x \times 0.75 = 25
      \]
      \[
      x = \frac{25}{0.75} = \frac{25}{0.75} = \frac{25}{0.75} \approx 33.33
      \]
      Therefore, you need to scoop out approximately 33.33% of what remains in the can for the second meal. This means the second scoop is about 33.33% of the remaining 75% of the can, which maintains the equality of the portions scooped out in terms of the original total amount.

    • @AdamNewcombMumbleTrash
      @AdamNewcombMumbleTrash 14 วันที่ผ่านมา +1

      @@vdeomkr70 Yeah knocked it outta the park!! Very Cool thanks.

    • @vdeomkr70
      @vdeomkr70 14 วันที่ผ่านมา +1

      @@AdamNewcombMumbleTrash Thanks for sharing that prompt, it's now part of my collection of test prompts : ) And to note, Claude Opus failed it, but ChatGPT 3.5 got it:
      To solve this problem, let's break it down step by step:
      1. Initially, you have a full can of dog food, which provides 4 meals.
      2. For the first meal, you scoop out 25% of the can's content.
      3. This means after the first meal, you have 75% of the original amount left because 100% - 25% = 75%.
      Now, for the second meal:
      1. You scoop out a certain percentage of what's left in the can.
      2. Since you started with 75% of the original amount after the first meal, you need to find what percentage of this 75% is scooped out for the second meal.
      Let's denote the percentage of dog food scooped out for the second meal as \( x \).
      So, if you take \( x \)% of the 75% remaining, it can be represented as \( \frac{x}{100} \times 75 \)%.
      Now, you want to find the value of \( x \).
      Since the total number of meals is 4, after the second meal, there should be \( \frac{4 - 2}{4} = \frac{2}{4} = 50% \) of the original amount left.
      This means the amount of food left after the second meal should be 50% of the original amount.
      Therefore, the equation becomes:
      \[ 75\% - \frac{x}{100} \times 75\% = 50\% \]
      Solving for \( x \):
      \[ 75\% - \frac{x}{100} \times 75\% = 50\% \]
      \[ 75 - \frac{75x}{100} = 50 \]
      \[ 75 - \frac{3}{4}x = 50 \]
      \[ \frac{3}{4}x = 75 - 50 \]
      \[ \frac{3}{4}x = 25 \]
      \[ x = \frac{25 \times 4}{3} \]
      \[ x = \frac{100}{3} \]
      So, the percentage of dog food scooped out for the second meal is approximately \( \frac{100}{3} \)%.