Grok-1 FULLY TESTED - Fascinating Results!

Matthew Berman

มุมมอง 173 180

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 5 ม.ค. 2025

ความคิดเห็น •

@labradore99 9 หลายเดือนก่อน ⁺³³⁵
What the heck? Wasn't the point of the shirts drying example that the reasoning should figure out the possibility of drying 20 shirts in parallel and that it would still take 5 hours?
@JohnKerbaugh 9 หลายเดือนก่อน ⁺¹⁸⁹
I'm gonna fail the host on that one. 😂
@mickelodiansurname9578 9 หลายเดือนก่อน ⁺³⁶
I'm thinking about this a bit and I think the prompt is unfair... imagine it simply answered "It will take the same time!" and in fact what was needed was how long they would take sequentially cos there is only one place to dry them? just saying, it wasn't stated in the prompt that you had the ability to lay them all out at once, that this was possible
@TheSnakecarver 9 หลายเดือนก่อน ⁺²⁰
It would take 4 hours, but who counts.
@christopherchilton-smith6482 9 หลายเดือนก่อน ⁺⁶
@@mickelodiansurname9578You're correct. LLMs in my experience are severely constrained by logic. You can't depend on it to infer consistently. If you make it clear in the prompt that the 5 shirts are all laying outside simultaneously it will likely answer correctly.
I don't use X and can't stand Elon so I'll have to wait for a quantized version to see what's up with this thing. Still, I have yet to use an LLM that can code as consistently good as chatgpt. Keeping my fingers crossed for a comparable local model soon though I haven't tried Devin or autodev yet.
@LeonardLay 9 หลายเดือนก่อน ⁺¹⁷
He always says that it's a pass if the llm solves it sequentially or in parallel as long as it explains the reasoning which it did
@MuckoMan 9 หลายเดือนก่อน ⁺⁵⁴
I'm just a truck driver but I find this stuff amazing. I love everything open source even if its not working perfect. Yet!
@Gatrehs 9 หลายเดือนก่อน ⁺²¹
"Just" a truck driver. Ridiculous.
You provide an essential service to make the lives of a lot of people better and more comfortable.
@jondeik 9 หลายเดือนก่อน ⁺¹
I drove for UPS for 7 years. I have more respect for tractor trailer drivers than anyone else, and I didn’t even do it. I drove the box trucks (they call them package cars).
Also, your job or career has nothing to do with how your brain works. Smart people work regular jobs
@spirosch5276 9 หลายเดือนก่อน ⁺³
"Just a truck driver?" LMAO, man. I have an undergrad and a postgrad degree, both framed and hanging in my bathroom. I hold a good position in a great company, yet I couldn't, in a million years, do the job you're doing, mate. Believe me, your contribution to society is greater than mine.
@ChrisOrillia 9 หลายเดือนก่อน ⁺¹
you’re not *just* a truck driver 🙄… we’re curious people, man
@harshans7712 9 หลายเดือนก่อน ⁺²
Man!! "Just a truck driver" literally without your service nothing runs in this world man, we all citizens are really thankful for your wonderful service and we all are getting things at time just because of your service.
@lachland592 9 หลายเดือนก่อน ⁺²³
I don’t think this is equivalent to testing the open source Grok. The open version is apparently not fine tuned at all, but the version available through Twitter is clearly fine-tuned for chat.
@thearchitect5405 9 หลายเดือนก่อน ⁺²
Not to mention the tools set up allowing Grok to search Twitter for answers to these questions.
@cbnewham5633 9 หลายเดือนก่อน ⁺⁷⁴
The problem I see is that as you keep repeating the same test the likelihood is that the model will have been trained on the test as it will have been seen somewhere on-line. You need to change the tests and run them against all the LLMs available at the time so as to compare them, otherwise later models have a distinct advantage.
@reinerheiner1148 9 หลายเดือนก่อน ⁺¹⁰
Thats true, but this is more of an entertainemt channel. Don't take it too seriously. To really test a llm, those tests would be wildly insufficient anyways.
@thearchitect5405 9 หลายเดือนก่อน ⁺²
Grok doesn't even need to be trained on it, Grok has internet access, rendering this test meaningless.
@HUEHUEUHEPony 9 หลายเดือนก่อน ⁺²
soy testing, waste of time
@dianagentu7478 9 หลายเดือนก่อน ⁺¹
@@reinerheiner1148 yes, tip me in the direction of real tests pls? Also, all LLMs, regardless of company, all have the same "favorite words" - I can't find any literature on why. Any ideas where I can look? I hope this has already been covered somewhere.
@mickelodiansurname9578 9 หลายเดือนก่อน
Theres not a lot we can do about that, if the tests are changed now then the previous ones are not comparative, and if we do not change the questions then yes, they will be in someone's training data... what Matt needs, but has not built, is an automated and private set of questions (maybe 1000 questions in a CSV file) that he releases to nobody, that gives a score.
other than that the results here will be highly subjective and not worth considering as a reliable measure.
@MxAi955 9 หลายเดือนก่อน ⁺⁵³
It will really be cool to have a score count on the screen as you test the models, that will help with tracking the scores.
@SODKGB 9 หลายเดือนก่อน ⁺⁹
Great job, really enjoyed this fast Grok overview!
@tony8k 9 หลายเดือนก่อน
Same
@Xardasflynn657 9 หลายเดือนก่อน ⁺³⁷
Isn't the answer for the killers question 4? The one that got killed is still in the room, so there are 4
@jackflash6377 9 หลายเดือนก่อน ⁺¹²
Correct logically. Just because they died doesn't mean they left the room. They just changed state from "live" to "dead".
@YbisZX 9 หลายเดือนก่อน ⁺¹⁰
@@jackflash6377 May be a killer - is the one who can kill. Dead man can't kill - so formally he is not a killer anymore.
@Xardasflynn657 9 หลายเดือนก่อน ⁺¹⁰
@@YbisZX logically, he's a dead killer, so a killer nonetheless, there's a logical mistake in the original answer
@jackflash6377 9 หลายเดือนก่อน ⁺³
@@YbisZXGood point but he WAS a killer and still IS a killer, just a dead killer. "Killer" can be past, present and future tense. He was, is and could be a killer. No where in the prompt did it say they have to be alive.
@jammin023 9 หลายเดือนก่อน ⁺²
Yeah, another thing (like the shirt problem) that this guy thinks is a pass but is actually a fail. Would help if he understood what the correct answers are.
@CYBONIX 9 หลายเดือนก่อน ⁺¹
Matt, I consume a lot, and I mean a lot of "AI" content, podcast, etc.... You are by far one of my favorites. What you do here isn't easy, but you do it very well. Thank you~!
@dsuess 9 หลายเดือนก่อน ⁺¹³
In regards to the Snake game, i've asked Google's Gemini a similar simple task and it couldn't even build.
At least the snake moved
@compton8301 9 หลายเดือนก่อน ⁺²
Gemini Advanced or the free one?
@jasonshere 9 หลายเดือนก่อน ⁺⁵
All of Google's AI's are extremely limited.
@Sofian375 9 หลายเดือนก่อน ⁺⁴⁰
How is the question about the shirts a pass?
@raoultesla2292 9 หลายเดือนก่อน ⁺⁴
If you wear a SpaceX t-shirt to your NeuraLink interview you will move up for implant installation.
@calysagora3615 9 หลายเดือนก่อน ⁺⁸
It was not. He was blatantly wrong on that one.
@SlyNine 9 หลายเดือนก่อน ⁺²
@@calysagora3615 you're blatantly wrong. A serial drying time is accepted as long as it states that's what it's doing.
That means one at a time. Try to keep up.
@funkahontas 9 หลายเดือนก่อน ⁺⁴
Because Matt is a clear Elon fanboy.
@Michael-ul7kv 9 หลายเดือนก่อน
based on the information provided it's the most correct answer and it explained how it got to the answer. parallel drying takes in many assumptions, if it choose that answer it would need to provide the caveat that it would only be correct if you had more information.
@Janok96 5 หลายเดือนก่อน ⁺¹
Do you believe that Grok included the "Grok" and "@grok" above the response making 12 words at 4:44? It's possible you're overlooking its response🤷‍♂
@stefanlittle 4 หลายเดือนก่อน
I was literally writing this same thing 🎉
@emolasher 9 หลายเดือนก่อน ⁺²
@6:25 change it to "lifts the upside down cup" maybe it needs it more descriptive.
@PseudoProphet 9 หลายเดือนก่อน ⁺⁵⁹
4:39 there actually are 12 tokens in that response, not 12 words . 😅😅
@matthew_berman 9 หลายเดือนก่อน ⁺¹²
how do you know?
@PseudoProphet 9 หลายเดือนก่อน ⁺²⁴
@@matthew_berman I counted, all words 2 digits and 1 . Total 12
@mickelodiansurname9578 9 หลายเดือนก่อน ⁺⁵
pure luck... and the answer by the way, the one that wins. is the model answering ONE as one world every time.
@elwyn14 9 หลายเดือนก่อน
Download the encoder and do it again to be sure 😁 @@PseudoProphet
@bluemodize7718 9 หลายเดือนก่อน
a word doesn't count as a token most words consist of 2 tokens, I don't know what is the formula of determining tokens but openai has a website that shows you the number of tokens@@PseudoProphet
@DiscOutpost 8 หลายเดือนก่อน
This was very cool. I subscribed. A couple of the Grok fails I could argue that wording played in a role in the failure. But, thank you!
@cacogenicist 9 หลายเดือนก่อน ⁺¹
*In case anyone was wondering how Grok compares to Claude 3 Sonnet in that spatial/physical reasoning test, Claude nails it, although my prompt was somewhat more precisely worded and less ambiguous. Even the mid-tier Claude 3 is pretty freaking great:*
To reason through this step-by-step:
1) A person places a coffee cup upright on a table.
2) They drop a marble into the cup, so the marble is now inside the cup.
3) They quickly turn the cup upside down on the table. This means the open end of the cup is now facing down towards the table surface.
4) When the cup was turned over, the marble would have fallen out of the cup and onto the table surface, since there is nothing to contain it inside the upside-down cup.
5) The person then picks up the upside-down cup, without changing its orientation. So the cup remains upside-down.
6) They place the upside-down cup into the microwave.
Therefore, based on the sequence of steps, the marble would have fallen out of the cup when it was turned over, and would be left behind on the table surface. The marble is not in the microwave with the upside-down cup.
@JacoduPlooy12134 9 หลายเดือนก่อน ⁺²
Nice!
Maybe at the end you can show the "leaderboard" so we can see how it stacks up against the others on perhaps a spreadsheet?
@cjhmdm 9 หลายเดือนก่อน ⁺¹⁷
as for the "there are 12 words in my response to this prompt"... is it possible the AI is also counting "Grok" and "@grok" since both are technically part of the response? That would definitely make it 12 words if so.
@apache937 9 หลายเดือนก่อน ⁺¹
no, that's not really how the LLM sees it internally
@kliersheed 9 หลายเดือนก่อน
@@apache937 how would you know? maybe it learned that from some nitpicking smart ass on twitter? could be an emergent ability. OP would have had to ask its reasoning, sadly he didnt.
@apache937 9 หลายเดือนก่อน
@@kliersheedgoogle "chat format oobabooga" and click the first link then read all of it
@thearchitect5405 9 หลายเดือนก่อน
With the way these models work, they cannot plan out a sentence beforehand like a human can because they lack any form of internal dialogue, they'd need to plan it in text, which would count as their response. The only correct answer they can give is "one", otherwise the model just cannot physically do it.
I'm surprised Grok didn't find the answer online considering it has internet access unlike other models tested with these questions.
@Metarig 9 หลายเดือนก่อน ⁺¹
The issue with your logic puzzle is that these LLMs have already seen every puzzle you give them. Why give them credit for solving a puzzle when they already know the solution? You didn't come up with the puzzle yourself; you just found it online, where they've already seen the answer.
@ares0wept 9 หลายเดือนก่อน
Yep. Come up with new questions
@executivelifehacks6747 9 หลายเดือนก่อน
Thanks Matt, we were all wanting to see the results, cheers!
@Till-TED 9 หลายเดือนก่อน ⁺¹⁸
Regarding the 10 sentences that end in apple question: Maybe add „each“. It gave 10 sentences and the last one ended in apple(s) - so it could be seen as pretty close.
@efausett 9 หลายเดือนก่อน ⁺¹
I just tried GPT4 and it passed this test with either version of the wording.
@apache937 9 หลายเดือนก่อน ⁺³
i dont like this suggestion, the models should be able to do what you want regardless of errors in prompt
@jessiejanson1528 9 หลายเดือนก่อน ⁺¹
They should do what you tell them or ask for clarification.
@MrSuntask 9 หลายเดือนก่อน ⁺¹³
Mathew, noone on this planet dries tshirts in a serial way over 16h one after the other.
I don't think it is a pass if it divides the drying time by the number of tshirts.
Because the main reason for this test is to check if the LLM has an understanding of the real world.
BTW a third answer (somewhat more realistic, but still uncommon) would be to dry in batches. If your drying space can only handle 5 shirts at once it would be four batches.
Great test though! Thank you.
@CuriousCattery 9 หลายเดือนก่อน ⁺¹
This is GPT-4's answer:
The time it takes for shirts to dry outside in the sun does not increase with the number of shirts, assuming they all receive adequate sunlight and air flow simultaneously. If 4 shirts take 5 hours to dry, then 40 shirts would also take 5 hours to dry, provided that they are spread out in a way that doesn't hinder their exposure to the sun and air.
@OurSpaceshipEarth 9 หลายเดือนก่อน ⁺¹
we know that they don'yt understand anything. it just guesses the next token. so far
@mickelodiansurname9578 9 หลายเดือนก่อน
People who work in a laundry do this.... Ten or 20 dry today then tomorrow another 10 or 20...and so on.... Cos you cannot dry shirts that have not yet been washed. Just sayin...
@mickelodiansurname9578 9 หลายเดือนก่อน
@@OurSpaceshipEarth well if thats true tens of thousands of data scientists... Some of the brightest folks on earth are wrong... Plus Devin and Autogen and CrewAI don't work. However first these systems do actually work... So that would need to be explained without invoking reason... And secondly I find it relatively common for people to point at scientists saying 'what would they know?'....
Can I point out how unlikely that is?
@SVisionary 7 หลายเดือนก่อน
Appreciate this comparison. Thank you.
@u.v.s.5583 9 หลายเดือนก่อน ⁺¹
- Write a sentence that ends with apple.
- The court of wizards sentences wizard X to death by turning him into an apple and leaving in this state for the rest of eternity.
@frankjohannessen6383 9 หลายเดือนก่อน ⁺¹
The digging question is a bit silly. What are we digging in? Sand: A wider hole can utilize more people better. Clay: you might dig a very narrow hole, but then you won't be able to throw the clay out easily, which means more people might be extra valuable since you can hand them the clay, instead of having to crawl in and out of the hole. Are we digging as fast as possible? Then more people can rotate which can easier utilize more people. What tools do we use? Shovels? excavators? just our bare hands? With excavators then just one is probably as fast as five. Depending on all the potential factors here one person can work more, less or about the same level of efficiency as five people.
@brainstormsurge154 9 หลายเดือนก่อน
The prompt, "How many words are in your response to this prompt," seems so simple but it's just how the LLM works that makes it difficult. A regular person could say, "One," or "There are three," so this question just shows how much differently we can think.
@ChandravijayAgrawal 8 หลายเดือนก่อน ⁺¹
so i think my money didn't go to waste
@Shnazzleboxxin 9 หลายเดือนก่อน ⁺¹
It was technically right in the 12 sentences that end with apple response. It gave you twelve sentences and the final sentence ended with apple.
@jermfu3402 9 หลายเดือนก่อน
Nice catch! That's totally valid! Matthew, You're gonna have to give it a "pass" on that one!
@gilbertb99 9 หลายเดือนก่อน ⁺⁶
How do you know that whatever is deployed on twitter is not a newer or better model than grok-1?
@SlyNine 9 หลายเดือนก่อน
How do you know there's not a purple unicorn in your room that you can't see or feel?
@apache937 9 หลายเดือนก่อน
or quantanized in itself
@HAL-zl1lg 9 หลายเดือนก่อน
@@SlyNineX uses 1.5. 1.0 was the version that was released.
@christiandarkin 9 หลายเดือนก่อน ⁺³³
a bit of a worry - if it's searching twitter for every answer, all your benchmarks are up there and regularly discussed aren't they?
@Hunter_Bidens_Crackpipe_ 9 หลายเดือนก่อน ⁺⁵
Thats the same for every other model Matt uses. They all got his info and questions.
@NicVandEmZ 9 หลายเดือนก่อน ⁺²
In Twitter is full of different types of people and people not do wrong things also and post wrong codes so I won’t be surprised if that’s why it’s not good with coding
@mickelodiansurname9578 9 หลายเดือนก่อน
@@Hunter_Bidens_Crackpipe_ the other models weren't pulling live data from the internet, this one was... yes some have these tests in training data or fine tuning by now though. In fact thats one of the big problems in measuring their effect, the team over at huggingface are basically throwing their hands up at the moment cos their leaderboard is fast becoming pointless.
@PeterResponsible 9 หลายเดือนก่อน ⁺⁸
you should really consider Matt's "tests" as entertainment only. There is nothing scientific about this test and he's been doing the same test for so long that of course the solutions must have leaked into training data sets.
@thearchitect5405 9 หลายเดือนก่อน
@@Hunter_Bidens_Crackpipe_ That's blatantly false. Matt almost exclusively tests offline models without search capabilities.
@igorip2005 9 หลายเดือนก่อน
Cool, thanks for testing, I waited for it
@ff_ani 9 หลายเดือนก่อน ⁺²
Maybe soon we will see an amazing collaboration of two technologies: fast Groq and huge Grok, and it will work really cool and fast!🎉
@alxy-dev 9 หลายเดือนก่อน
In the place where you ask about word count, the 2 invisible ones are and tags, I think so
@OurSpaceshipEarth 9 หลายเดือนก่อน
that's amazingly intersting he really needs to know that eh! I think the only correct answer would ever be simply "one."
@oo__ee 9 หลายเดือนก่อน ⁺²
The model on X is RLHF'd whereas the open sourced model is the base model
@landonoffmars9598 9 หลายเดือนก่อน
I've noticed I'm not subscribed, I really thought I was. So I'm subscribing. Have a good day.
@s4uss 9 หลายเดือนก่อน ⁺¹
There is still 4 killers in the room, just one of them is dead. It doesn't ask how many ALIVE killers are in the room.
@DavesNotHereRightNow 9 หลายเดือนก่อน ⁺³
Are you using Grok. 1.0? I thought the one on X was now 1.5?
@doonk6004 9 หลายเดือนก่อน ⁺¹
Groq needs to provide Grok-1 on their API now that it's open source!!
@thomassynths 9 หลายเดือนก่อน ⁺⁴
The digging rate could very well remain constant for a fairly deep hole, provided some sort of conveyor system.
@timeless3d858 9 หลายเดือนก่อน
yeah they could develop a system to not get in eachothers way, or what usually happens working collectively, they could be efficient and delegate tasks so that they finish at an even faster rate.
@bigglyguy8429 9 หลายเดือนก่อน ⁺¹
It could just presume a trench. Expecting the AI to answer "The same, you idiot, cos there's not enough room for the others to dig" is asking a bit much, really?
@Pietro-Caroleo-29 9 หลายเดือนก่อน
Thank you for today's presentation, Mr Berman.
@JasonRule-1 9 หลายเดือนก่อน ⁺¹
There would be *FOUR* killers in the room. Three of them live, one of them dead.
@mishka3876 9 หลายเดือนก่อน
For day to day I use gpt to balance my budget by uploading an Excel sheet. I send it photos of my fridge and pantry and ask for a shopping list based on missing items from previous photos. And get it to summarise and find discrepancies in pdf documents. This is what I want to see for testing.
@kevinwells768 9 หลายเดือนก่อน
Nice. Would have been useful to see a leader chart of which models pass which test.
@kalvinflowers6178 7 หลายเดือนก่อน
The Messenger AI got the apple question correct.
Here are 10 sentences that end with the word "apple":
1. The juicy red fruit was a delicious apple.
2. The teacher gave the student a shiny red apple.
3. The farmer carefully picked the ripe apple.
4. The crunchy snack was a fresh green apple.
5. The child eagerly bit into the sweet apple.
6. The chef added a slice of Granny Smith apple.
7. The healthy lunch included a crisp green apple.
8. The orchard grew a variety of tasty apple.
9. The scientist studied the genetics of the Honeycrisp apple.
10. The happy kid ran home with a big red apple.
@pokerchannel6991 9 หลายเดือนก่อน ⁺¹
i found out quantizing is like compression where certain datapoints are lumped togehter as one undiffferencitated blob. It runs quicker, but some nuance is lost.
@apache937 9 หลายเดือนก่อน ⁺¹
yes thats accurate. you can do some quant without any real loss in accuracy then its get bad fast
@DM-dy6vn 9 หลายเดือนก่อน
6:27 "It's upside down position" You might ask Grok to fix the orthography of your prompts too.
@miriamkapeller6754 9 หลายเดือนก่อน
Keep in mind that you were testing the live version of Grok-1, which has been instruction-tuned and chat-tuned and also seems to use RAG, none of which apply to the open source model.
Based on the lackluster answers given, I'm pretty sure Mixtral and Miqu-1 are better, despite being smaller in size.
@ninjazhu 9 หลายเดือนก่อน
The issue i have with the killer response is that is it stated a new killer entered but In fact the new person may not have been a killer when they entered, they may have become a killer after they entered, therefore that bit of the answer could be wrong. The total killers in the room should be 4, not 3.
@laser31415 9 หลายเดือนก่อน
Pi really struggled with the Apple test. When i would asked 'what is the last word in sentences 2 (by its example) only then it would realized "away" not 'apple' and update its answer. PI admitted it got hung up on Apple being in the sentence but not at the end of it. This is a GREAT test.
@dwirtz0116 9 หลายเดือนก่อน
Great video. 👍 I was thinking of a test you could add which I didn't see you touch on. That would be recall to text previously spoken about in the same session. It seems horrible at that. Also I tried to get it to answer in lists of 10 and it couldn't repeatedly do it... 🤔🤓
@kliersheed 9 หลายเดือนก่อน
4:43 what if...what if the AI counted the grok and @grok as well as the number 12 as words? that would be 12 words then O.o
coincidence ? would have been cool if you had made it list all 12 of the words in that response to see what it does/ reasoned / if it self-corrects upon that.
@Cambo866 9 หลายเดือนก่อน
I think if it uses assumptions to give a response like it did in the digging the hole problem then you should follow up with a question getting it to explain why its assumption may fail.
@william5931 9 หลายเดือนก่อน ⁺¹
I noticed it gave the answer before the reasoning a few times, maybe ask it to start with a step by step reasoning will improve the results, because it has more time/tokens to "think". I feel like it is just trying to come up with some reasoning to cover up its mistake.
@apache937 9 หลายเดือนก่อน ⁺¹
yes this is better
@eddieb8615 9 หลายเดือนก่อน
Always the best content!
@GuidedBreathing 9 หลายเดือนก่อน ⁺²
Guess this was done a bit fast ☺️
Shirt in parallel takes the same time; and in sequential well.. 5:05 Four Killers; one dead three alive.
@reinerheiner1148 9 หลายเดือนก่อน
Its not a mistake. He keeps those questions in there because you can see it either way. Which results in people discussing this in the comments, which results in more engagement for the youtube metrics, which results in better ranking of the video on youtube...
@DailyTuna 9 หลายเดือนก่อน ⁺¹
You have the best AI channel on the Internet learn so much!
@matthew_berman 9 หลายเดือนก่อน
Glad you think so!
@matthewbond375 9 หลายเดือนก่อน ⁺³
In your cup/marble challenge, could you be confusing the LLM by asking it where the BALL is, rather than the MARBLE? Or is that your intention?
@matthew_berman 9 หลายเดือนก่อน ⁺⁵
oh wow....i dont know why i never caught this! although i dont think it should matter, i changed it in my tests for the future. thanks for pointing it out!
@ctwolf 9 หลายเดือนก่อน ⁺³
@@matthew_bermanyou're one cool guy, just utilizing the feedback without ego my guy.
Praise be onto you Matthew.
@matthewbond375 9 หลายเดือนก่อน
Thanks for clarifying! This one is stumping all my local models in interesting ways. None of them seem to be able to reason that the marble doesn't stay in the cup as it moves, at least without giving them additional input. Great work! Love the channel!
@picksalot1 9 หลายเดือนก่อน ⁺¹
The question "How many words are in your response to this prompt?" is a tricky one, depending on how the "12" is interpreted. It has single digits "1" "2", and a two digit number of "12". If each of theses is defined as a "word," then the answer of "There are 12 words in my response to this prompt." is correct, as 9 words + 3 number/digit words = 12 words. It will be interesting to see when/how it defines digits/numbers as words. In a way, "12" is a compound word, not unlike the word "Database," which can be seen as 3 words/meanings: Data + base + Database = 3.
Enjoying your Channel. Thanks, subscribed.
@calysagora3615 9 หลายเดือนก่อน
The test is exactly to see if the AI thinks about it in the same way we do, where numbers are not words, prompt names are not words, punctuation are not words, etc. Here Grok failed.
@picksalot1 9 หลายเดือนก่อน
@@calysagora3615 Interesting. I didn't know how the AI thinks about numbers. Seems like an easy thing to fix.
@CorettaQ 9 หลายเดือนก่อน ⁺¹
Hi, in response to the answer to "How many words are in your response to this prompt?" I believe 12 is technically right because Grok and @grok are words present in the totality of the response. Maybe the question to ask is "How many words are in your response sentence to this prompt?" Could the response then say 10?
@thearchitect5405 9 หลายเดือนก่อน
No, it is not correct. These models don't think the way humans do, they cannot internally plan out a response and thus cannot plan out a response of a given length that conveys a given idea without doing so in their own response. The only correct answer a GPT model can give, aside from getting lucky, is the answer "one".
@mackblack5153 9 หลายเดือนก่อน
Love these tests!
@lemoniscate 9 หลายเดือนก่อน
2:49 Man out of everything I did not expect for find about the 3 week one piece hiatus through this video
@perceptions101 9 หลายเดือนก่อน
Honestly I see the AI’s point about the “put it in the microwave” because as far as it knows you would just slide it into the microwave. It’s only after you understand that microwaves are not on the same plane and you can’t just slide things into it that the question becomes about the ball loosing it’s position within the cup.
@grant_vine 9 หลายเดือนก่อน
If a human being in the clothes drying industry is paid by number of shirts they would parallelise a task, if being paid by the hour they would serialise the task. If the prompt included a reward metric then anyone (human or LLM) would have either one of those answers depending on how much they get paid.
@JasonRule-1 9 หลายเดือนก่อน ⁺¹⁷
There would be *FOUR* in the room. No one left or was removed.. (Every time I posted this comment specifying WHO was left in the room, TH-cam deleted it.)
@YbisZX 9 หลายเดือนก่อน ⁺⁴
Dead man can't kill, so formally he (it/body) is't killer anymore.
@fz1576 9 หลายเดือนก่อน ⁺⁶
Ideally, the answer would include a mention of this ambiguity.
@hypercoder-gaming 9 หลายเดือนก่อน ⁺⁵
Yes that's another answer however it's also correct to say 3 because you usually mean there are 3 living people, not 4 total people. Also it's questionable if a dead person is actually a person because over time they rot. At what point do they go from being a dead person to just a pile of organic material? The only logical decision is to make them not be a person from death and be a pile of organic material, although this is delving into philosophy.
@televerket 9 หลายเดือนก่อน ⁺³
This is how we end up with too many paperclips....
@JasonRule-1 9 หลายเดือนก่อน ⁺¹
@@fz1576 Agreed! 👍
@akanbikhalid6928 9 หลายเดือนก่อน ⁺¹³
you should change the questions but keep them similar, if i was working at an AI company, i will be using your videos and others to fix the result
@matthew_berman 9 หลายเดือนก่อน ⁺⁷
I don't think they care about me ;)
@MrVnelis 9 หลายเดือนก่อน ⁺¹¹
@@matthew_berman We do. I work at one of them, and most of your videos trigger tens of messages in our slack because nobody outside of the research dept has the time to test all these models.So we totally rely on channels like yours.
@MrVnelis 9 หลายเดือนก่อน
Please do IBM :-)
@SlyNine 9 หลายเดือนก่อน
@@MrVnelisyou don't care very much if you are giving it the answers. That's like studying for a specific I.Q. Test. The results are invalid and unuseful.
@snooks5607 9 หลายเดือนก่อน
@@SlyNine ranking in benchmarks -> exposure -> VC funding
@theoriginalcyrex 9 หลายเดือนก่อน ⁺²
You are incorrect about the shirts. If it takes 5 shirts 4 hours to dry, given ample room or dehumidification of some sort (let's say this is outside). It would take 4 hours for 20 shirts. 4 hours even for 100 shirts as long as they are not too close together and there is enough airflow.
@attilakovacs6496 9 หลายเดือนก่อน ⁺¹²
Seeing a pass on the shirts problem was SHOCKING.
@CM-zl2jw 9 หลายเดือนก่อน
😂 can we make that a meme here too?
@rootor1 9 หลายเดือนก่อน ⁺⁵
After 1 day waiting It's not listed yet in chatbot arena. They did it in minutes with gemma. I guess is because grok is a huge 300GB beast and not so easy find GPUs to run it.
@serhiyranush4420 9 หลายเดือนก่อน
Well, from what I've just seen and heard, I am pretty confused why all the hype that AI is gonna take over the world? And it only confirms my own experience with ChatGPT-3.5 where it could not devise a simple JS program or give me 20 Finnish words ending in E.
@serhiyranush4420 9 หลายเดือนก่อน
What exactly are the system requirements for running this beast?
@mickelodiansurname9578 9 หลายเดือนก่อน
listed on Huggingface now... Downloads last months zero! lol (not a surprise really!)
@rezeraj 9 หลายเดือนก่อน
@@serhiyranush4420320gb vram
@cybertpax 9 หลายเดือนก่อน
@@serhiyranush4420I dont get all the fuss about PC. And it only confirms my own experience with 20 y old PC, cant even run Crysis.
@keithnance4209 8 หลายเดือนก่อน
The cup logic is actually correct. The faulty logic is that the person putting the cup into the microwave can slide the cup to the edge of the table and using both hands to force the ball to remain inside the cup. The test should explicitly state the person turns the cup right side up.
@samhiatt 9 หลายเดือนก่อน
Notice that the answer it gives at 6:10 about the ball in the cup _begins_ by stating the final answer _before_ explaining the reasoning, defeating the whole purpose of chain of thought prompting! Interesting.
What a great example of "jumping to conclusions" or confirmation bias, no? It begins with the conclusion and then explains some reasoning that will lead to that result.
I think CoT prompting is so interesting because it demonstrates how "thinking through" something first will lead to more accurate results. But yeah, I guess it doesn't surprise me that Elon's LLM so quickly jumps to conclusions.
@Crates-Media 9 หลายเดือนก่อน
IDK if it was mentioned, but @5:00 you were remiss to say, "Yes! Grok is great at logic and reasoning!"
Grok said a killer entered the room, making 4. Logically however, we cannot make that assumption.
All we know for SURE is that AFTER the new person killed someone in the room, they BECAME a killer.
@Crates-Media 9 หลายเดือนก่อน
It's also really short-sighted to grade LLMs on exact replicas of problems that are already "everyone's favorite".
@Crates-Media 9 หลายเดือนก่อน
Moreover, I'd take issue with you referring @5:48 to a "really hard logic and reasoning problem".
Our standards for quality should not be based on how well other LLMs are functioning to-date.
Rather, you should grade them by their ability to reason as well as (at least) a fifth grader.
What you've deemed as "really hard" is such basic logic that most kindergarteners would pass.
@Capripio 9 หลายเดือนก่อน ⁺²
There is another interesting question I can think of:
"if its 7:00 PM, in Karachi, Pakistan. Then what time will be in Newyork USA." its kind logic/knowledge maybe?
@sophiophile 9 หลายเดือนก่อน
The error with delay was that it didnt make it global where it was declared, from what I could see (you scrolled pretty fast). It looks like there were other issues as well.
@RobC1999 9 หลายเดือนก่อน ⁺¹
For the cup problem, I wonder if the models would get it right if it was specified the cup does not have a lid
@apache937 9 หลายเดือนก่อน
the point is to not make it too easy
@RobC1999 9 หลายเดือนก่อน ⁺¹
@@apache937 I understand. But ambiguity in a question relies on the model anticipating the potential meanings and either picking one or trying to spell out all the possibilities. The model would be correct if the cup has a lid.
@lun321 9 หลายเดือนก่อน
Is it possible that Grok counted the period and 'twelve' (not 12 since it was already accounted for) as words, equalling 12?
@supercurioTube 9 หลายเดือนก่อน ⁺²
Hi Matthew!
You're not testing the same thing, the base model X Ai released is not gonna answer at all the same way as the live version that's live on Twitter.
And it's not a matter of quantization.
What you're testing is trained to follow instructions and reply as a chat bot. The base model released only continues the text submitted.
Base models are not at all usable for chatting, they typically drift and often go wild without warning (without expert promoting)
@christianross2567 9 หลายเดือนก่อน
Hey Matt! I noticed you were about half way through Gadaffi maxing and I was wondering if you were going to give us a video detailing your journey. Kind regards, your biggest fan.
@scottcastle9119 9 หลายเดือนก่อน
I've got grok all year and I really like it
@realDeor 9 หลายเดือนก่อน
I look forward to the next few months, i am really curious to see what people can achieve with now its open source.
@Eagleizer 9 หลายเดือนก่อน ⁺⁴
Ball in cup: You did not specify the the cup had no lid.
@kliersheed 9 หลายเดือนก่อน
yup. it also considered it as a "container" which are most often smth thats closed to contain whatever is inside. i dont think its reasoning was wrong, it just didnt know what type of cup we talk about / that it had an open top side. he would have had to specify that to make sure its the reasoning thats the problem and not the knowledge/ prompt. many others of his questions as well. if he simply asks without specifiying that this is about how a human would do it in a real world with human limitations, if the AI treats the question as a sole logical problem and gives statements true to that context, you cant really blame it on bad "reasoning".
@timeless3d858 9 หลายเดือนก่อน
that's like saying you didn't specify the man wasn't wearing a tophat.
@bigglyguy8429 9 หลายเดือนก่อน
I find a more useful metric is to tell it that it is wrong, and ask it to go back and figure out why? That kind of conversation gives you more of an idea if the thing is actually thinking about the real world.
@mk1st 9 หลายเดือนก่อน
In future versions the AI will probably be seeing the potential conflict in meaning and can ask back for clarity. I mean, it’s what we would do if we didn’t understand the question. I think it would be much more accurate if it did this.
@OurSpaceshipEarth 9 หลายเดือนก่อน
No cup has ever ever had a lid that's not a cup. he should be saying teacup he's not using the correct wording like glass teacup or such.
@johndallara3257 9 หลายเดือนก่อน
Marble in cup then calling it a ball adds an extra layer of reasoning; equating ball and marble. Very good tests and great info, thanks. Would you say the value of Grok that come with your X paid account is fully valued in the access to Grok alone?
@six1free 9 หลายเดือนก่อน
@7:30 it may have passed your reasoning test, the list did end in apples, even if each entry didn't - you didn't specify that he entries should, would need to be assumed - you assumed yes, it no.
and your last question didn't specify 10ft deep, it may have reasoned 10ft across
@inout3394 9 หลายเดือนก่อน ⁺¹
Every LLM have different/specific input "language", so naive is thinking the same command/prompt work perfect for every LLM.
You must tunning/change/modify prompt then LLM will know answer and give good result.
@Candyapplebone 9 หลายเดือนก่อน
Dude that hoodie is dope what brand is it?
@gaminginfrench 9 หลายเดือนก่อน
Is there a smaller open source model I can get to read french books and summerise chapters? Or fix errors I make when writing in French?
@blayneallan 9 หลายเดือนก่อน ⁺²
Mistral 7B probably given that Mistral is a French company.
@dasistdiewahrheit9585 9 หลายเดือนก่อน
Grok FULLY TESTED the whole industry!
@oscarstenberg2449 9 หลายเดือนก่อน
Simple test that GPT-4 has some trouble with:
Ask it to play tic tac toe with you - e.g. using ASCII art.
It knows how to play but makes bad moves and notices it right after.
May be interesting as the models get better.
@FleischYT 9 หลายเดือนก่อน
You can't test the Model with old Questions if it has live-data, as those have been circulating ;)
@apache937 9 หลายเดือนก่อน
true, but it should still show if its searching. regardless so many fails
@kliersheed 9 หลายเดือนก่อน ⁺²
6:27 i wish you would tell it that a cup is a container that has its top side open. i dont think it knows that considering how it reasons. if it still fails after that, its the reasoning but i think it just made lists of attributes for objects and cup is in container and most containers (as it reasoned and named it a container) are closed with the purpose to contain whats inside no matter how you flip it.
im pretty sure it simply doesent know what the cup exactly is instead of it reasoning wrongly.
@carlos_mann 9 หลายเดือนก่อน
When running one of these locally, can you provide it instructions to create an app that you specify what you want for it to do?
Or can you have it search through thousands of personal files, learn, add the data, and be used as a search engine throughout your database instead of using the local search engines, (ie file explorer).
@kliersheed 9 หลายเดือนก่อน
7:56 IMO definitely a pass. you did not specify how they digg, if you dont give context it can only assume and with the assumptions made the statement is true. the AI doesent know if the hole is wide enough to fit all humans, it doesent know if tools are shared, it doesent know what tools are used, it doesent know if the one person took so long because his stamina ran out and he had to take breaks, etc.
you didnt specify any of that and you didnt ask it to consider human attributes or anything in its answer so its a solely logical answer which is a true statement, so pass.
@alby13 9 หลายเดือนก่อน
What if you say that you are not looking for the exact amount or the calculated amount as the answer so that it might have a chance to answer you with what you are looking for?
@costa2150 9 หลายเดือนก่อน ⁺¹
What does "quantized version" mean or do?
@Sven_Dongle 9 หลายเดือนก่อน ⁺¹
The weights which make up the model are composed of floating point values, these values can be compressed via an algorithm (quantized) which reduces them to a set number of bits (typically 8 or even 4) thus reducing the overall size of the model allowing it to be run on smaller computing platforms. The raging dispute is the degree to which this reduces the overall accuracy of responses.
@costa2150 9 หลายเดือนก่อน
@@Sven_Dongle ah ok. So it's a form of compression. Kind of like when an image is compressed and the resolution is reduced at different scales. Thank you Sven.
@Sven_Dongle 9 หลายเดือนก่อน ⁺¹
@@costa2150 Sort of, but image compression can actually lose data by dropping out chunks that it determines wont be 'noticed', while quantization just reduces the size of each piece of data, though it can be argued information is also 'lost' in this manner.
@costa2150 9 หลายเดือนก่อน
@@Sven_Dongle thank you for your explanation.
@epochgames3049 9 หลายเดือนก่อน
12 words in the response was a PASS! 10 words in the sentence, and grok twice above that. The AI is literal! If you ask it how many words in the reponse, and only count the words in the response, and not the names.
@thefoolsgoldminingcompany7877 9 หลายเดือนก่อน
The digging rate would depend mostly on the shape of the hole. The more trench like it is the truer the machines logic. If it was more like a small circular pit then five diggers would be cumbersome.
@Action2me 9 หลายเดือนก่อน
What does quantized mean?
@researchforumonline 9 หลายเดือนก่อน
Thanks shared!
@PCSJEFF67 9 หลายเดือนก่อน
Thank you. I always thought that my clothes are shit and I have no taste in choosing them.
Now I know I was wrong.
@6lack5ushi 9 หลายเดือนก่อน ⁺⁴
sorry 20 shirts is a FAIL! they are in the sun, so saying one shirt takes 0.8 of an hour is BAD DEDUCTION ergo BAD LOGIC ergo FAIL
@synchro-dentally1965 9 หลายเดือนก่อน
for the word count in response related question. is it considering the '.' and '\0" characters as words? I mean it considers '12' as a word
@apache937 9 หลายเดือนก่อน ⁺¹
maybe, i think someone said the grok tokenizer sees it as 13 tokens (gpt-4 tokenizer had that sentence as 12 tokens). regardless its a fail
@odw32 9 หลายเดือนก่อน ⁺²
As much as I am in favor of open & uncensored alternatives -- I think a company hosting a zero-guardrails model for the masses (especially the Twitter userbase) is a pretty stupid and reckless idea. The advantage of open/uncensored models is that developers can implement more specific guardrails themselves, which are suitable for specific contexts.

ต่อไป

เล่นอัตโนมัติ

Anthropic Revealed Secrets to Building Powerful Agents