Grok-1 FULLY TESTED - Fascinating Results!

แชร์
ฝัง
  • เผยแพร่เมื่อ 13 พ.ค. 2024
  • Let's test Grok using our LLM rubric! How does it compare to other models?
    Join My Newsletter for Regular AI Updates 👇🏼
    www.matthewberman.com
    Need AI Consulting? ✅
    forwardfuture.ai/
    My Links 🔗
    👉🏻 Subscribe: / @matthew_berman
    👉🏻 Twitter: / matthewberman
    👉🏻 Discord: / discord
    👉🏻 Patreon: / matthewberman
    Rent a GPU (MassedCompute) 🚀
    bit.ly/matthew-berman-youtube
    USE CODE "MatthewBerman" for 50% discount
    Media/Sponsorship Inquiries 📈
    bit.ly/44TC45V
    Links:
    Blog Announcement - x.ai/blog/grok-os
    LLM Leaderboard - bit.ly/3qHV0X7
    Chapters:
    0:00 - Intro
    0:51 - Testing
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 665

  • @MuckoMan
    @MuckoMan หลายเดือนก่อน +43

    I'm just a truck driver but I find this stuff amazing. I love everything open source even if its not working perfect. Yet!

    • @Gatrehs
      @Gatrehs หลายเดือนก่อน +16

      "Just" a truck driver. Ridiculous.
      You provide an essential service to make the lives of a lot of people better and more comfortable.

    • @jondeik
      @jondeik หลายเดือนก่อน

      I drove for UPS for 7 years. I have more respect for tractor trailer drivers than anyone else, and I didn’t even do it. I drove the box trucks (they call them package cars).
      Also, your job or career has nothing to do with how your brain works. Smart people work regular jobs

    • @spirosch5276
      @spirosch5276 หลายเดือนก่อน +2

      "Just a truck driver?" LMAO, man. I have an undergrad and a postgrad degree, both framed and hanging in my bathroom. I hold a good position in a great company, yet I couldn't, in a million years, do the job you're doing, mate. Believe me, your contribution to society is greater than mine.

    • @ChrisOrillia
      @ChrisOrillia หลายเดือนก่อน +1

      you’re not *just* a truck driver 🙄… we’re curious people, man

    • @harshans7712
      @harshans7712 หลายเดือนก่อน +1

      Man!! "Just a truck driver" literally without your service nothing runs in this world man, we all citizens are really thankful for your wonderful service and we all are getting things at time just because of your service.

  • @labradore99
    @labradore99 หลายเดือนก่อน +327

    What the heck? Wasn't the point of the shirts drying example that the reasoning should figure out the possibility of drying 20 shirts in parallel and that it would still take 5 hours?

    • @JohnKerbaugh
      @JohnKerbaugh หลายเดือนก่อน +186

      I'm gonna fail the host on that one. 😂

    • @mickelodiansurname9578
      @mickelodiansurname9578 หลายเดือนก่อน +33

      I'm thinking about this a bit and I think the prompt is unfair... imagine it simply answered "It will take the same time!" and in fact what was needed was how long they would take sequentially cos there is only one place to dry them? just saying, it wasn't stated in the prompt that you had the ability to lay them all out at once, that this was possible

    • @TheSnakecarver
      @TheSnakecarver หลายเดือนก่อน +20

      It would take 4 hours, but who counts.

    • @christopherchilton-smith6482
      @christopherchilton-smith6482 หลายเดือนก่อน +6

      ​@@mickelodiansurname9578You're correct. LLMs in my experience are severely constrained by logic. You can't depend on it to infer consistently. If you make it clear in the prompt that the 5 shirts are all laying outside simultaneously it will likely answer correctly.
      I don't use X and can't stand Elon so I'll have to wait for a quantized version to see what's up with this thing. Still, I have yet to use an LLM that can code as consistently good as chatgpt. Keeping my fingers crossed for a comparable local model soon though I haven't tried Devin or autodev yet.

    • @LeonardLay
      @LeonardLay หลายเดือนก่อน +17

      He always says that it's a pass if the llm solves it sequentially or in parallel as long as it explains the reasoning which it did

  • @SODKGB
    @SODKGB หลายเดือนก่อน +8

    Great job, really enjoyed this fast Grok overview!

    • @tony8k
      @tony8k หลายเดือนก่อน

      Same

  • @cbnewham5633
    @cbnewham5633 หลายเดือนก่อน +71

    The problem I see is that as you keep repeating the same test the likelihood is that the model will have been trained on the test as it will have been seen somewhere on-line. You need to change the tests and run them against all the LLMs available at the time so as to compare them, otherwise later models have a distinct advantage.

    • @reinerheiner1148
      @reinerheiner1148 หลายเดือนก่อน +9

      Thats true, but this is more of an entertainemt channel. Don't take it too seriously. To really test a llm, those tests would be wildly insufficient anyways.

    • @thearchitect5405
      @thearchitect5405 หลายเดือนก่อน +2

      Grok doesn't even need to be trained on it, Grok has internet access, rendering this test meaningless.

    • @HUEHUEUHEPony
      @HUEHUEUHEPony หลายเดือนก่อน +2

      soy testing, waste of time

    • @dianagentu7478
      @dianagentu7478 หลายเดือนก่อน +1

      @@reinerheiner1148 yes, tip me in the direction of real tests pls? Also, all LLMs, regardless of company, all have the same "favorite words" - I can't find any literature on why. Any ideas where I can look? I hope this has already been covered somewhere.

    • @mickelodiansurname9578
      @mickelodiansurname9578 หลายเดือนก่อน

      Theres not a lot we can do about that, if the tests are changed now then the previous ones are not comparative, and if we do not change the questions then yes, they will be in someone's training data... what Matt needs, but has not built, is an automated and private set of questions (maybe 1000 questions in a CSV file) that he releases to nobody, that gives a score.
      other than that the results here will be highly subjective and not worth considering as a reliable measure.

  • @executivelifehacks6747
    @executivelifehacks6747 หลายเดือนก่อน

    Thanks Matt, we were all wanting to see the results, cheers!

  • @lachland592
    @lachland592 หลายเดือนก่อน +19

    I don’t think this is equivalent to testing the open source Grok. The open version is apparently not fine tuned at all, but the version available through Twitter is clearly fine-tuned for chat.

    • @thearchitect5405
      @thearchitect5405 หลายเดือนก่อน +2

      Not to mention the tools set up allowing Grok to search Twitter for answers to these questions.

  • @igorip2005
    @igorip2005 หลายเดือนก่อน

    Cool, thanks for testing, I waited for it

  • @MxAi955
    @MxAi955 หลายเดือนก่อน +53

    It will really be cool to have a score count on the screen as you test the models, that will help with tracking the scores.

  • @Pietro-Caroleo-29
    @Pietro-Caroleo-29 หลายเดือนก่อน

    Thank you for today's presentation, Mr Berman.

  • @CYBONIX
    @CYBONIX หลายเดือนก่อน +1

    Matt, I consume a lot, and I mean a lot of "AI" content, podcast, etc.... You are by far one of my favorites. What you do here isn't easy, but you do it very well. Thank you~!

  • @eddieb8615
    @eddieb8615 หลายเดือนก่อน

    Always the best content!

  • @DiscOutpost
    @DiscOutpost 13 วันที่ผ่านมา

    This was very cool. I subscribed. A couple of the Grok fails I could argue that wording played in a role in the failure. But, thank you!

  • @mackblack5153
    @mackblack5153 หลายเดือนก่อน

    Love these tests!

  • @dsuess
    @dsuess หลายเดือนก่อน +12

    In regards to the Snake game, i've asked Google's Gemini a similar simple task and it couldn't even build.
    At least the snake moved

    • @compton8301
      @compton8301 หลายเดือนก่อน +2

      Gemini Advanced or the free one?

    • @jasonshere
      @jasonshere หลายเดือนก่อน +5

      All of Google's AI's are extremely limited.

  • @JacoduPlooy12134
    @JacoduPlooy12134 หลายเดือนก่อน +1

    Nice!
    Maybe at the end you can show the "leaderboard" so we can see how it stacks up against the others on perhaps a spreadsheet?

  • @Xardasflynn657
    @Xardasflynn657 หลายเดือนก่อน +36

    Isn't the answer for the killers question 4? The one that got killed is still in the room, so there are 4

    • @jackflash6377
      @jackflash6377 หลายเดือนก่อน +12

      Correct logically. Just because they died doesn't mean they left the room. They just changed state from "live" to "dead".

    • @YbisZX
      @YbisZX หลายเดือนก่อน +9

      @@jackflash6377 May be a killer - is the one who can kill. Dead man can't kill - so formally he is not a killer anymore.

    • @Xardasflynn657
      @Xardasflynn657 หลายเดือนก่อน +10

      ​@@YbisZX logically, he's a dead killer, so a killer nonetheless, there's a logical mistake in the original answer

    • @jackflash6377
      @jackflash6377 หลายเดือนก่อน +3

      @@YbisZXGood point but he WAS a killer and still IS a killer, just a dead killer. "Killer" can be past, present and future tense. He was, is and could be a killer. No where in the prompt did it say they have to be alive.

    • @jammin023
      @jammin023 หลายเดือนก่อน +2

      Yeah, another thing (like the shirt problem) that this guy thinks is a pass but is actually a fail. Would help if he understood what the correct answers are.

  • @dwirtz0116
    @dwirtz0116 หลายเดือนก่อน

    Great video. 👍 I was thinking of a test you could add which I didn't see you touch on. That would be recall to text previously spoken about in the same session. It seems horrible at that. Also I tried to get it to answer in lists of 10 and it couldn't repeatedly do it... 🤔🤓

  • @ff_ani
    @ff_ani หลายเดือนก่อน +2

    Maybe soon we will see an amazing collaboration of two technologies: fast Groq and huge Grok, and it will work really cool and fast!🎉

  • @thomassynths
    @thomassynths หลายเดือนก่อน +4

    The digging rate could very well remain constant for a fairly deep hole, provided some sort of conveyor system.

    • @timeless3d858
      @timeless3d858 หลายเดือนก่อน

      yeah they could develop a system to not get in eachothers way, or what usually happens working collectively, they could be efficient and delegate tasks so that they finish at an even faster rate.

    • @bigglyguy8429
      @bigglyguy8429 หลายเดือนก่อน +1

      It could just presume a trench. Expecting the AI to answer "The same, you idiot, cos there's not enough room for the others to dig" is asking a bit much, really?

  • @Sofian375
    @Sofian375 หลายเดือนก่อน +40

    How is the question about the shirts a pass?

    • @raoultesla2292
      @raoultesla2292 หลายเดือนก่อน +5

      If you wear a SpaceX t-shirt to your NeuraLink interview you will move up for implant installation.

    • @calysagora3615
      @calysagora3615 หลายเดือนก่อน +8

      It was not. He was blatantly wrong on that one.

    • @SlyNine
      @SlyNine หลายเดือนก่อน +2

      ​@@calysagora3615 you're blatantly wrong. A serial drying time is accepted as long as it states that's what it's doing.
      That means one at a time. Try to keep up.

    • @funkahontas
      @funkahontas หลายเดือนก่อน +4

      Because Matt is a clear Elon fanboy.

    • @Michael-ul7kv
      @Michael-ul7kv หลายเดือนก่อน

      based on the information provided it's the most correct answer and it explained how it got to the answer. parallel drying takes in many assumptions, if it choose that answer it would need to provide the caveat that it would only be correct if you had more information.

  • @nqnam12345
    @nqnam12345 หลายเดือนก่อน

    thanks Matt!

  • @MrSuntask
    @MrSuntask หลายเดือนก่อน +13

    Mathew, noone on this planet dries tshirts in a serial way over 16h one after the other.
    I don't think it is a pass if it divides the drying time by the number of tshirts.
    Because the main reason for this test is to check if the LLM has an understanding of the real world.
    BTW a third answer (somewhat more realistic, but still uncommon) would be to dry in batches. If your drying space can only handle 5 shirts at once it would be four batches.
    Great test though! Thank you.

    • @CuriousCattery
      @CuriousCattery หลายเดือนก่อน +1

      This is GPT-4's answer:
      The time it takes for shirts to dry outside in the sun does not increase with the number of shirts, assuming they all receive adequate sunlight and air flow simultaneously. If 4 shirts take 5 hours to dry, then 40 shirts would also take 5 hours to dry, provided that they are spread out in a way that doesn't hinder their exposure to the sun and air.

    • @OurSpaceshipEarth
      @OurSpaceshipEarth หลายเดือนก่อน +1

      we know that they don'yt understand anything. it just guesses the next token. so far

    • @mickelodiansurname9578
      @mickelodiansurname9578 หลายเดือนก่อน

      People who work in a laundry do this.... Ten or 20 dry today then tomorrow another 10 or 20...and so on.... Cos you cannot dry shirts that have not yet been washed. Just sayin...

    • @mickelodiansurname9578
      @mickelodiansurname9578 หลายเดือนก่อน

      @@OurSpaceshipEarth well if thats true tens of thousands of data scientists... Some of the brightest folks on earth are wrong... Plus Devin and Autogen and CrewAI don't work. However first these systems do actually work... So that would need to be explained without invoking reason... And secondly I find it relatively common for people to point at scientists saying 'what would they know?'....
      Can I point out how unlikely that is?

  • @johndallara3257
    @johndallara3257 หลายเดือนก่อน

    Marble in cup then calling it a ball adds an extra layer of reasoning; equating ball and marble. Very good tests and great info, thanks. Would you say the value of Grok that come with your X paid account is fully valued in the access to Grok alone?

  • @xsa-tube
    @xsa-tube หลายเดือนก่อน

    In the place where you ask about word count, the 2 invisible ones are and tags, I think so

    • @OurSpaceshipEarth
      @OurSpaceshipEarth หลายเดือนก่อน

      that's amazingly intersting he really needs to know that eh! I think the only correct answer would ever be simply "one."

  • @researchforumonline
    @researchforumonline หลายเดือนก่อน

    Thanks shared!

  • @DavesNotHereRightNow
    @DavesNotHereRightNow หลายเดือนก่อน +3

    Are you using Grok. 1.0? I thought the one on X was now 1.5?

  • @carlos_mann
    @carlos_mann หลายเดือนก่อน

    When running one of these locally, can you provide it instructions to create an app that you specify what you want for it to do?
    Or can you have it search through thousands of personal files, learn, add the data, and be used as a search engine throughout your database instead of using the local search engines, (ie file explorer).

  • @cjhmdm
    @cjhmdm หลายเดือนก่อน +17

    as for the "there are 12 words in my response to this prompt"... is it possible the AI is also counting "Grok" and "@grok" since both are technically part of the response? That would definitely make it 12 words if so.

    • @apache937
      @apache937 หลายเดือนก่อน +1

      no, that's not really how the LLM sees it internally

    • @kliersheed
      @kliersheed หลายเดือนก่อน

      @@apache937 how would you know? maybe it learned that from some nitpicking smart ass on twitter? could be an emergent ability. OP would have had to ask its reasoning, sadly he didnt.

    • @apache937
      @apache937 หลายเดือนก่อน

      ​@@kliersheedgoogle "chat format oobabooga" and click the first link then read all of it

    • @thearchitect5405
      @thearchitect5405 หลายเดือนก่อน

      With the way these models work, they cannot plan out a sentence beforehand like a human can because they lack any form of internal dialogue, they'd need to plan it in text, which would count as their response. The only correct answer they can give is "one", otherwise the model just cannot physically do it.
      I'm surprised Grok didn't find the answer online considering it has internet access unlike other models tested with these questions.

  • @scottcastle9119
    @scottcastle9119 หลายเดือนก่อน

    I've got grok all year and I really like it

  • @realDeor
    @realDeor หลายเดือนก่อน

    I look forward to the next few months, i am really curious to see what people can achieve with now its open source.

  • @FranciscoLopes_Bthere
    @FranciscoLopes_Bthere หลายเดือนก่อน

    I have a i9 10900k 64gb ram an a 3090 and i would like to give grok a try how can i set it up? is there a smaller version that can be used without a datacenter ?

  • @christianross2567
    @christianross2567 หลายเดือนก่อน

    Hey Matt! I noticed you were about half way through Gadaffi maxing and I was wondering if you were going to give us a video detailing your journey. Kind regards, your biggest fan.

  • @televerket
    @televerket หลายเดือนก่อน

    Logic question, dont you need to remove the 4th dead killer from the room to be left with 3?
    Is this type of logic why we end up with too many killers/paperclips?

  • @Cambo866
    @Cambo866 หลายเดือนก่อน

    I think if it uses assumptions to give a response like it did in the digging the hole problem then you should follow up with a question getting it to explain why its assumption may fail.

  • @alby13
    @alby13 หลายเดือนก่อน

    What if you say that you are not looking for the exact amount or the calculated amount as the answer so that it might have a chance to answer you with what you are looking for?

  • @cacogenicist
    @cacogenicist หลายเดือนก่อน +1

    *In case anyone was wondering how Grok compares to Claude 3 Sonnet in that spatial/physical reasoning test, Claude nails it, although my prompt was somewhat more precisely worded and less ambiguous. Even the mid-tier Claude 3 is pretty freaking great:*
    To reason through this step-by-step:
    1) A person places a coffee cup upright on a table.
    2) They drop a marble into the cup, so the marble is now inside the cup.
    3) They quickly turn the cup upside down on the table. This means the open end of the cup is now facing down towards the table surface.
    4) When the cup was turned over, the marble would have fallen out of the cup and onto the table surface, since there is nothing to contain it inside the upside-down cup.
    5) The person then picks up the upside-down cup, without changing its orientation. So the cup remains upside-down.
    6) They place the upside-down cup into the microwave.
    Therefore, based on the sequence of steps, the marble would have fallen out of the cup when it was turned over, and would be left behind on the table surface. The marble is not in the microwave with the upside-down cup.

  • @kevinwells768
    @kevinwells768 หลายเดือนก่อน

    Nice. Would have been useful to see a leader chart of which models pass which test.

  • @Rich28448
    @Rich28448 หลายเดือนก่อน

    Can you see whether having now done the test, if you did the same test again whether you get the same result? ie did it learn in real time?

  • @MN-jz3qy
    @MN-jz3qy หลายเดือนก่อน

    What's the link to the interface you are using to test the model?

  • @PseudoProphet
    @PseudoProphet หลายเดือนก่อน +59

    4:39 there actually are 12 tokens in that response, not 12 words . 😅😅

    • @matthew_berman
      @matthew_berman  หลายเดือนก่อน +12

      how do you know?

    • @PseudoProphet
      @PseudoProphet หลายเดือนก่อน +22

      @@matthew_berman I counted, all words 2 digits and 1 . Total 12

    • @mickelodiansurname9578
      @mickelodiansurname9578 หลายเดือนก่อน +5

      pure luck... and the answer by the way, the one that wins. is the model answering ONE as one world every time.

    • @elwyn14
      @elwyn14 หลายเดือนก่อน

      Download the encoder and do it again to be sure 😁 ​@@PseudoProphet

    • @bluemodize7718
      @bluemodize7718 หลายเดือนก่อน

      a word doesn't count as a token most words consist of 2 tokens, I don't know what is the formula of determining tokens but openai has a website that shows you the number of tokens@@PseudoProphet

  • @lun321
    @lun321 หลายเดือนก่อน

    Is it possible that Grok counted the period and 'twelve' (not 12 since it was already accounted for) as words, equalling 12?

  • @usethatherb4913
    @usethatherb4913 หลายเดือนก่อน

    What GPU power would you have needed to run it?

  • @pokerchannel6991
    @pokerchannel6991 หลายเดือนก่อน +1

    i found out quantizing is like compression where certain datapoints are lumped togehter as one undiffferencitated blob. It runs quicker, but some nuance is lost.

    • @apache937
      @apache937 หลายเดือนก่อน +1

      yes thats accurate. you can do some quant without any real loss in accuracy then its get bad fast

  • @u.v.s.5583
    @u.v.s.5583 หลายเดือนก่อน +1

    - Write a sentence that ends with apple.
    - The court of wizards sentences wizard X to death by turning him into an apple and leaving in this state for the rest of eternity.

  • @lemoniscate
    @lemoniscate หลายเดือนก่อน

    2:49 Man out of everything I did not expect for find about the 3 week one piece hiatus through this video

  • @gaminginfrench
    @gaminginfrench หลายเดือนก่อน

    Is there a smaller open source model I can get to read french books and summerise chapters? Or fix errors I make when writing in French?

    • @blayneallan
      @blayneallan หลายเดือนก่อน +2

      Mistral 7B probably given that Mistral is a French company.

  • @jonyfrany1319
    @jonyfrany1319 หลายเดือนก่อน

    When can we get the hardware for it ?

  • @gilbertb99
    @gilbertb99 หลายเดือนก่อน +6

    How do you know that whatever is deployed on twitter is not a newer or better model than grok-1?

    • @SlyNine
      @SlyNine หลายเดือนก่อน

      How do you know there's not a purple unicorn in your room that you can't see or feel?

    • @apache937
      @apache937 หลายเดือนก่อน

      or quantanized in itself

    • @HAL-zl1lg
      @HAL-zl1lg หลายเดือนก่อน

      ​@@SlyNineX uses 1.5. 1.0 was the version that was released.

  • @TheOpacue
    @TheOpacue หลายเดือนก่อน

    What's "quantizing", or however it's spelled?

  • @akanbikhalid6928
    @akanbikhalid6928 หลายเดือนก่อน +13

    you should change the questions but keep them similar, if i was working at an AI company, i will be using your videos and others to fix the result

    • @matthew_berman
      @matthew_berman  หลายเดือนก่อน +7

      I don't think they care about me ;)

    • @MrVnelis
      @MrVnelis หลายเดือนก่อน +11

      @@matthew_berman We do. I work at one of them, and most of your videos trigger tens of messages in our slack because nobody outside of the research dept has the time to test all these models.So we totally rely on channels like yours.

    • @MrVnelis
      @MrVnelis หลายเดือนก่อน

      Please do IBM :-)

    • @SlyNine
      @SlyNine หลายเดือนก่อน

      ​@@MrVnelisyou don't care very much if you are giving it the answers. That's like studying for a specific I.Q. Test. The results are invalid and unuseful.

    • @snooks5607
      @snooks5607 หลายเดือนก่อน

      @@SlyNine ranking in benchmarks -> exposure -> VC funding

  • @user-yp9cd3zo2p
    @user-yp9cd3zo2p หลายเดือนก่อน +18

    Regarding the 10 sentences that end in apple question: Maybe add „each“. It gave 10 sentences and the last one ended in apple(s) - so it could be seen as pretty close.

    • @efausett
      @efausett หลายเดือนก่อน +1

      I just tried GPT4 and it passed this test with either version of the wording.

    • @apache937
      @apache937 หลายเดือนก่อน +3

      i dont like this suggestion, the models should be able to do what you want regardless of errors in prompt

    • @jessiejanson1528
      @jessiejanson1528 หลายเดือนก่อน +1

      They should do what you tell them or ask for clarification.

  • @thesagaofblitz
    @thesagaofblitz หลายเดือนก่อน

    If you pay the $22 to access premium X will that Grant access to Grok?

  • @synchro-dentally1965
    @synchro-dentally1965 หลายเดือนก่อน

    for the word count in response related question. is it considering the '.' and '\0" characters as words? I mean it considers '12' as a word

    • @apache937
      @apache937 หลายเดือนก่อน +1

      maybe, i think someone said the grok tokenizer sees it as 13 tokens (gpt-4 tokenizer had that sentence as 12 tokens). regardless its a fail

  • @kristoferkrus
    @kristoferkrus หลายเดือนก่อน

    How does it work for an open source model to have access to real time information from X/Twitter? If you run it on your own (super) computer, how is it able to pull information from X, which other morels cannot? How does X know that it's the Grok model requesting the information?

    • @thearchitect5405
      @thearchitect5405 หลายเดือนก่อน

      The open source model won't have internet access in the same manner Grok does on Twitter.

    • @kristoferkrus
      @kristoferkrus หลายเดือนก่อน

      @@thearchitect5405Okay, well that makes sense if it's the case.

  • @oo__ee
    @oo__ee หลายเดือนก่อน +2

    The model on X is RLHF'd whereas the open sourced model is the base model

  • @william5931
    @william5931 หลายเดือนก่อน +1

    I noticed it gave the answer before the reasoning a few times, maybe ask it to start with a step by step reasoning will improve the results, because it has more time/tokens to "think". I feel like it is just trying to come up with some reasoning to cover up its mistake.

    • @apache937
      @apache937 หลายเดือนก่อน +1

      yes this is better

  • @emolasher
    @emolasher หลายเดือนก่อน +1

    @6:25 change it to "lifts the upside down cup" maybe it needs it more descriptive.

  • @sophiophile
    @sophiophile หลายเดือนก่อน

    The error with delay was that it didnt make it global where it was declared, from what I could see (you scrolled pretty fast). It looks like there were other issues as well.

  • @RobC1999
    @RobC1999 หลายเดือนก่อน +1

    For the cup problem, I wonder if the models would get it right if it was specified the cup does not have a lid

    • @apache937
      @apache937 หลายเดือนก่อน

      the point is to not make it too easy

    • @RobC1999
      @RobC1999 หลายเดือนก่อน +1

      @@apache937 I understand. But ambiguity in a question relies on the model anticipating the potential meanings and either picking one or trying to spell out all the possibilities. The model would be correct if the cup has a lid.

  • @picksalot1
    @picksalot1 หลายเดือนก่อน +1

    The question "How many words are in your response to this prompt?" is a tricky one, depending on how the "12" is interpreted. It has single digits "1" "2", and a two digit number of "12". If each of theses is defined as a "word," then the answer of "There are 12 words in my response to this prompt." is correct, as 9 words + 3 number/digit words = 12 words. It will be interesting to see when/how it defines digits/numbers as words. In a way, "12" is a compound word, not unlike the word "Database," which can be seen as 3 words/meanings: Data + base + Database = 3.
    Enjoying your Channel. Thanks, subscribed.

    • @calysagora3615
      @calysagora3615 หลายเดือนก่อน

      The test is exactly to see if the AI thinks about it in the same way we do, where numbers are not words, prompt names are not words, punctuation are not words, etc. Here Grok failed.

    • @picksalot1
      @picksalot1 หลายเดือนก่อน

      @@calysagora3615 Interesting. I didn't know how the AI thinks about numbers. Seems like an easy thing to fix.

  • @Josh-bq6rm
    @Josh-bq6rm หลายเดือนก่อน

    So I can use it to finally do my Calculus problems?

  • @MEMUNDOLOL
    @MEMUNDOLOL หลายเดือนก่อน

    Matthew, hi, i think it's gonna be better to ask ai's not about amount of words in the response, but amount of tokens, bcs they think in tokens, for LLMs there are no words only tokens

  • @davewilson4427
    @davewilson4427 หลายเดือนก่อน +1

    I’m curious about whether or not the open source model still has live access to X. In a way I feel like we’re going to see an intense performance drop when someone finally runs this monster locally.🤷🏼‍♂️ just a thought.

    • @supercurioTube
      @supercurioTube หลายเดือนก่อน +6

      No the open source model won't have access to X.
      What was released is also a base model, without any fine tuning which enables to follow instructions and discussions as a chatbot.
      X didn't release the fine tuned model they run as products, only its base before it was turned into something usable.
      I realize that many think that what Elon/X published is the same thing as what's online, but it really isn't.

    • @apache937
      @apache937 หลายเดือนก่อน

      @@supercurioTubewould have been nice to have both, but base is better than finetuned if only 1 is released

  • @solifugus
    @solifugus หลายเดือนก่อน

    What kind of GPU do I need to run this model? (at a minimum of price) You can still see the long-standing weakness of neural net models since conception in that they learn patterns and don't actually work logic. They only work logic in as much as the patterns they learn are compliant.

    • @thearchitect5405
      @thearchitect5405 หลายเดือนก่อน

      There is no singular GPU that could run this model, it takes a large rack of GPU's, which is why nobody has set up the open source model for chat yet. If you're interested in running models on your own device, you're better off trying a smaller model like Qwen1.5, which outperforms Grok by a decent margin and can be run on a singular GPU.
      If you do want to run the largest(reasonably) models on your own device, you'd need one or two A100's depending on what you're hoping to run. This won't work for Grok, but these's no incentive to use Grok since it's a lot worse than the current top open source models.

    • @solifugus
      @solifugus หลายเดือนก่อน

      @@thearchitect5405 There is the uncensored aspect. I don't want it to be a neo-Nazi or build bombs or anything like that but GPT and especially Claude's levels of censorship and extreme left-wing censorship is far beyond reason, in my view. Most people consider me to be centrist, btw.

    • @thearchitect5405
      @thearchitect5405 หลายเดือนก่อน

      @@solifugus Anthropic and OpenAI don't have up to date open source models. I will mention though, Grok isn't just worse than most of the top open source models, but also more censored than a lot of them.
      Grok has right wing ideology ingrained in it, with these LLM's it's pretty hard to avoid having one or the other because they primarily learn bias. This counts as a censorship because it's trained to exclude responses.
      Also, GPT4 and Claude 3 censorship isn't exactly extreme left-wing, they're actually generally less politically censored than Grok. Though prior to Claude 3, the Claude models were by far the most censored on the market. You may either be thinking of Gemini, or of very niche cherry picked examples.

  • @Candyapplebone
    @Candyapplebone หลายเดือนก่อน

    Dude that hoodie is dope what brand is it?

  • @dasistdiewahrheit9585
    @dasistdiewahrheit9585 หลายเดือนก่อน

    The hoodie 😻

  • @Manithan123
    @Manithan123 หลายเดือนก่อน

    Can you please upload a video showing what's the minimum hardware requirement to run Grok-1 and how to install and run it?

    • @Sven_Dongle
      @Sven_Dongle หลายเดือนก่อน

      Minimum hardware for Grok 1 is 128 cores clocking at a minimum of 3.8 GHz and 1 TB of RAM and 8 H100 GPUs since the weights alone are 256 GB. Hit me up with $500000 and I'll set you up.

  • @doonk6004
    @doonk6004 หลายเดือนก่อน +1

    Groq needs to provide Grok-1 on their API now that it's open source!!

  • @mishka3876
    @mishka3876 หลายเดือนก่อน

    For day to day I use gpt to balance my budget by uploading an Excel sheet. I send it photos of my fridge and pantry and ask for a shopping list based on missing items from previous photos. And get it to summarise and find discrepancies in pdf documents. This is what I want to see for testing.

  • @matthewbond375
    @matthewbond375 หลายเดือนก่อน +3

    In your cup/marble challenge, could you be confusing the LLM by asking it where the BALL is, rather than the MARBLE? Or is that your intention?

    • @matthew_berman
      @matthew_berman  หลายเดือนก่อน +5

      oh wow....i dont know why i never caught this! although i dont think it should matter, i changed it in my tests for the future. thanks for pointing it out!

    • @ctwolf
      @ctwolf หลายเดือนก่อน +3

      ​@@matthew_bermanyou're one cool guy, just utilizing the feedback without ego my guy.
      Praise be onto you Matthew.

    • @matthewbond375
      @matthewbond375 หลายเดือนก่อน

      Thanks for clarifying! This one is stumping all my local models in interesting ways. None of them seem to be able to reason that the marble doesn't stay in the cup as it moves, at least without giving them additional input. Great work! Love the channel!

  • @landonoffmars9598
    @landonoffmars9598 หลายเดือนก่อน

    I've noticed I'm not subscribed, I really thought I was. So I'm subscribing. Have a good day.

  • @GuidedBreathing
    @GuidedBreathing หลายเดือนก่อน +2

    Guess this was done a bit fast ☺️
    Shirt in parallel takes the same time; and in sequential well.. 5:05 Four Killers; one dead three alive.

    • @reinerheiner1148
      @reinerheiner1148 หลายเดือนก่อน

      Its not a mistake. He keeps those questions in there because you can see it either way. Which results in people discussing this in the comments, which results in more engagement for the youtube metrics, which results in better ranking of the video on youtube...

  • @NickFallon88
    @NickFallon88 หลายเดือนก่อน

    It responds really fast

  • @Action2me
    @Action2me หลายเดือนก่อน

    What does quantized mean?

  • @attilakovacs6496
    @attilakovacs6496 หลายเดือนก่อน +12

    Seeing a pass on the shirts problem was SHOCKING.

    • @CM-zl2jw
      @CM-zl2jw หลายเดือนก่อน

      😂 can we make that a meme here too?

  • @jcorpac
    @jcorpac หลายเดือนก่อน

    Nice overview. I feel a little disappointed that it didn't respond to the word count problem with "One".

  • @frankjohannessen6383
    @frankjohannessen6383 หลายเดือนก่อน +1

    The digging question is a bit silly. What are we digging in? Sand: A wider hole can utilize more people better. Clay: you might dig a very narrow hole, but then you won't be able to throw the clay out easily, which means more people might be extra valuable since you can hand them the clay, instead of having to crawl in and out of the hole. Are we digging as fast as possible? Then more people can rotate which can easier utilize more people. What tools do we use? Shovels? excavators? just our bare hands? With excavators then just one is probably as fast as five. Depending on all the potential factors here one person can work more, less or about the same level of efficiency as five people.

  • @tony8k
    @tony8k หลายเดือนก่อน

    I enjoyed this video Grok is really fast

  • @brainstormsurge154
    @brainstormsurge154 หลายเดือนก่อน

    The prompt, "How many words are in your response to this prompt," seems so simple but it's just how the LLM works that makes it difficult. A regular person could say, "One," or "There are three," so this question just shows how much differently we can think.

  • @ericchastain1863
    @ericchastain1863 หลายเดือนก่อน

    How to build sets of linear base structure for secluded area of projections for quantization methodologies

    • @ericchastain1863
      @ericchastain1863 หลายเดือนก่อน

      From checklist of the tensors

    • @ericchastain1863
      @ericchastain1863 หลายเดือนก่อน

      As the searching should be in self to find uses of tensor data from checklist or the check tensors

    • @ericchastain1863
      @ericchastain1863 หลายเดือนก่อน

      Had problems getting cuda on Kali in win 11 so idk if my 8gb GPU which I do have a Tesla 32gb gpu not in PC tho

    • @Sven_Dongle
      @Sven_Dongle หลายเดือนก่อน

      @@ericchastain1863 The voices in your head are leading you astray.

  • @olafsigursons
    @olafsigursons หลายเดือนก่อน

    Does not the requirement for Grok-1 being very light, like an i5-8250U CPU @ 1.60GHz with 8Gb?

    • @apache937
      @apache937 หลายเดือนก่อน

      fun joke

  • @danielkorosec9944
    @danielkorosec9944 หลายเดือนก่อน

    Would be interested to know what grok would reply hitting the apple in 0.

  • @laser31415
    @laser31415 หลายเดือนก่อน

    Pi really struggled with the Apple test. When i would asked 'what is the last word in sentences 2 (by its example) only then it would realized "away" not 'apple' and update its answer. PI admitted it got hung up on Apple being in the sentence but not at the end of it. This is a GREAT test.

  • @allecazzam8224
    @allecazzam8224 หลายเดือนก่อน

    Can you explain certain terminology when making your videos for those new to AI or coding. Thanks

  • @DM-dy6vn
    @DM-dy6vn หลายเดือนก่อน

    6:27 "It's upside down position" You might ask Grok to fix the orthography of your prompts too.

  • @miriamkapeller6754
    @miriamkapeller6754 หลายเดือนก่อน

    Keep in mind that you were testing the live version of Grok-1, which has been instruction-tuned and chat-tuned and also seems to use RAG, none of which apply to the open source model.
    Based on the lackluster answers given, I'm pretty sure Mixtral and Miqu-1 are better, despite being smaller in size.

  • @ingenierofelipeurreg
    @ingenierofelipeurreg หลายเดือนก่อน

    Pls share link for test it

  • @mrtim6479
    @mrtim6479 หลายเดือนก่อน

    Curious if you correct the AI when it got the logic wrong, and then later ask the same question, does the answer quality improve ?

  • @english2success
    @english2success หลายเดือนก่อน

    I think the ball was not 'in' the cup. It was 'under' the cup, right?

  • @thakurlokesh
    @thakurlokesh หลายเดือนก่อน

    How to run it on Google Collab?

  • @oscarstenberg2449
    @oscarstenberg2449 หลายเดือนก่อน

    Simple test that GPT-4 has some trouble with:
    Ask it to play tic tac toe with you - e.g. using ASCII art.
    It knows how to play but makes bad moves and notices it right after.
    May be interesting as the models get better.

  • @realityvanguard2052
    @realityvanguard2052 หลายเดือนก่อน

    So how much GPU power does one need to run Grok?

    • @Sven_Dongle
      @Sven_Dongle หลายเดือนก่อน

      Minimum 8 H100s.

  • @bingbong7316
    @bingbong7316 หลายเดือนก่อน

    It might get stuff wrong, _but it's fast_ !!

  • @theoriginalcyrex
    @theoriginalcyrex หลายเดือนก่อน +2

    You are incorrect about the shirts. If it takes 5 shirts 4 hours to dry, given ample room or dehumidification of some sort (let's say this is outside). It would take 4 hours for 20 shirts. 4 hours even for 100 shirts as long as they are not too close together and there is enough airflow.

  • @darkphase7799
    @darkphase7799 หลายเดือนก่อน

    Do you keep spreadsheet showing what models passed what tests?

  • @Eagleizer
    @Eagleizer หลายเดือนก่อน +4

    Ball in cup: You did not specify the the cup had no lid.

    • @kliersheed
      @kliersheed หลายเดือนก่อน

      yup. it also considered it as a "container" which are most often smth thats closed to contain whatever is inside. i dont think its reasoning was wrong, it just didnt know what type of cup we talk about / that it had an open top side. he would have had to specify that to make sure its the reasoning thats the problem and not the knowledge/ prompt. many others of his questions as well. if he simply asks without specifiying that this is about how a human would do it in a real world with human limitations, if the AI treats the question as a sole logical problem and gives statements true to that context, you cant really blame it on bad "reasoning".

    • @timeless3d858
      @timeless3d858 หลายเดือนก่อน

      that's like saying you didn't specify the man wasn't wearing a tophat.

    • @bigglyguy8429
      @bigglyguy8429 หลายเดือนก่อน

      I find a more useful metric is to tell it that it is wrong, and ask it to go back and figure out why? That kind of conversation gives you more of an idea if the thing is actually thinking about the real world.

    • @mk1st
      @mk1st หลายเดือนก่อน

      In future versions the AI will probably be seeing the potential conflict in meaning and can ask back for clarity. I mean, it’s what we would do if we didn’t understand the question. I think it would be much more accurate if it did this.

    • @OurSpaceshipEarth
      @OurSpaceshipEarth หลายเดือนก่อน

      No cup has ever ever had a lid that's not a cup. he should be saying teacup he's not using the correct wording like glass teacup or such.

  • @DailyTuna
    @DailyTuna หลายเดือนก่อน +1

    You have the best AI channel on the Internet learn so much!

  • @keithnance4209
    @keithnance4209 22 วันที่ผ่านมา

    The cup logic is actually correct. The faulty logic is that the person putting the cup into the microwave can slide the cup to the edge of the table and using both hands to force the ball to remain inside the cup. The test should explicitly state the person turns the cup right side up.

  • @polaris1985
    @polaris1985 29 วันที่ผ่านมา

    I couldnt understand the cloths drying problem, if it takes 4 hours to dry 4 shirts then 10 shirts should take the same time of 4 hours because they are seperate shirts.

  • @costa2150
    @costa2150 หลายเดือนก่อน +1

    What does "quantized version" mean or do?

    • @Sven_Dongle
      @Sven_Dongle หลายเดือนก่อน +1

      The weights which make up the model are composed of floating point values, these values can be compressed via an algorithm (quantized) which reduces them to a set number of bits (typically 8 or even 4) thus reducing the overall size of the model allowing it to be run on smaller computing platforms. The raging dispute is the degree to which this reduces the overall accuracy of responses.

    • @costa2150
      @costa2150 หลายเดือนก่อน

      @@Sven_Dongle ah ok. So it's a form of compression. Kind of like when an image is compressed and the resolution is reduced at different scales. Thank you Sven.

    • @Sven_Dongle
      @Sven_Dongle หลายเดือนก่อน +1

      @@costa2150 Sort of, but image compression can actually lose data by dropping out chunks that it determines wont be 'noticed', while quantization just reduces the size of each piece of data, though it can be argued information is also 'lost' in this manner.

    • @costa2150
      @costa2150 หลายเดือนก่อน

      @@Sven_Dongle thank you for your explanation.