What the heck? Wasn't the point of the shirts drying example that the reasoning should figure out the possibility of drying 20 shirts in parallel and that it would still take 5 hours?
I'm thinking about this a bit and I think the prompt is unfair... imagine it simply answered "It will take the same time!" and in fact what was needed was how long they would take sequentially cos there is only one place to dry them? just saying, it wasn't stated in the prompt that you had the ability to lay them all out at once, that this was possible
@@mickelodiansurname9578You're correct. LLMs in my experience are severely constrained by logic. You can't depend on it to infer consistently. If you make it clear in the prompt that the 5 shirts are all laying outside simultaneously it will likely answer correctly. I don't use X and can't stand Elon so I'll have to wait for a quantized version to see what's up with this thing. Still, I have yet to use an LLM that can code as consistently good as chatgpt. Keeping my fingers crossed for a comparable local model soon though I haven't tried Devin or autodev yet.
I drove for UPS for 7 years. I have more respect for tractor trailer drivers than anyone else, and I didn’t even do it. I drove the box trucks (they call them package cars). Also, your job or career has nothing to do with how your brain works. Smart people work regular jobs
"Just a truck driver?" LMAO, man. I have an undergrad and a postgrad degree, both framed and hanging in my bathroom. I hold a good position in a great company, yet I couldn't, in a million years, do the job you're doing, mate. Believe me, your contribution to society is greater than mine.
Man!! "Just a truck driver" literally without your service nothing runs in this world man, we all citizens are really thankful for your wonderful service and we all are getting things at time just because of your service.
I don’t think this is equivalent to testing the open source Grok. The open version is apparently not fine tuned at all, but the version available through Twitter is clearly fine-tuned for chat.
The problem I see is that as you keep repeating the same test the likelihood is that the model will have been trained on the test as it will have been seen somewhere on-line. You need to change the tests and run them against all the LLMs available at the time so as to compare them, otherwise later models have a distinct advantage.
Thats true, but this is more of an entertainemt channel. Don't take it too seriously. To really test a llm, those tests would be wildly insufficient anyways.
@@reinerheiner1148 yes, tip me in the direction of real tests pls? Also, all LLMs, regardless of company, all have the same "favorite words" - I can't find any literature on why. Any ideas where I can look? I hope this has already been covered somewhere.
Theres not a lot we can do about that, if the tests are changed now then the previous ones are not comparative, and if we do not change the questions then yes, they will be in someone's training data... what Matt needs, but has not built, is an automated and private set of questions (maybe 1000 questions in a CSV file) that he releases to nobody, that gives a score. other than that the results here will be highly subjective and not worth considering as a reliable measure.
@@YbisZXGood point but he WAS a killer and still IS a killer, just a dead killer. "Killer" can be past, present and future tense. He was, is and could be a killer. No where in the prompt did it say they have to be alive.
Yeah, another thing (like the shirt problem) that this guy thinks is a pass but is actually a fail. Would help if he understood what the correct answers are.
Matt, I consume a lot, and I mean a lot of "AI" content, podcast, etc.... You are by far one of my favorites. What you do here isn't easy, but you do it very well. Thank you~!
@@calysagora3615 you're blatantly wrong. A serial drying time is accepted as long as it states that's what it's doing. That means one at a time. Try to keep up.
based on the information provided it's the most correct answer and it explained how it got to the answer. parallel drying takes in many assumptions, if it choose that answer it would need to provide the caveat that it would only be correct if you had more information.
a word doesn't count as a token most words consist of 2 tokens, I don't know what is the formula of determining tokens but openai has a website that shows you the number of tokens@@PseudoProphet
*In case anyone was wondering how Grok compares to Claude 3 Sonnet in that spatial/physical reasoning test, Claude nails it, although my prompt was somewhat more precisely worded and less ambiguous. Even the mid-tier Claude 3 is pretty freaking great:* To reason through this step-by-step: 1) A person places a coffee cup upright on a table. 2) They drop a marble into the cup, so the marble is now inside the cup. 3) They quickly turn the cup upside down on the table. This means the open end of the cup is now facing down towards the table surface. 4) When the cup was turned over, the marble would have fallen out of the cup and onto the table surface, since there is nothing to contain it inside the upside-down cup. 5) The person then picks up the upside-down cup, without changing its orientation. So the cup remains upside-down. 6) They place the upside-down cup into the microwave. Therefore, based on the sequence of steps, the marble would have fallen out of the cup when it was turned over, and would be left behind on the table surface. The marble is not in the microwave with the upside-down cup.
as for the "there are 12 words in my response to this prompt"... is it possible the AI is also counting "Grok" and "@grok" since both are technically part of the response? That would definitely make it 12 words if so.
@@apache937 how would you know? maybe it learned that from some nitpicking smart ass on twitter? could be an emergent ability. OP would have had to ask its reasoning, sadly he didnt.
With the way these models work, they cannot plan out a sentence beforehand like a human can because they lack any form of internal dialogue, they'd need to plan it in text, which would count as their response. The only correct answer they can give is "one", otherwise the model just cannot physically do it. I'm surprised Grok didn't find the answer online considering it has internet access unlike other models tested with these questions.
The issue with your logic puzzle is that these LLMs have already seen every puzzle you give them. Why give them credit for solving a puzzle when they already know the solution? You didn't come up with the puzzle yourself; you just found it online, where they've already seen the answer.
Regarding the 10 sentences that end in apple question: Maybe add „each“. It gave 10 sentences and the last one ended in apple(s) - so it could be seen as pretty close.
Mathew, noone on this planet dries tshirts in a serial way over 16h one after the other. I don't think it is a pass if it divides the drying time by the number of tshirts. Because the main reason for this test is to check if the LLM has an understanding of the real world. BTW a third answer (somewhat more realistic, but still uncommon) would be to dry in batches. If your drying space can only handle 5 shirts at once it would be four batches. Great test though! Thank you.
This is GPT-4's answer: The time it takes for shirts to dry outside in the sun does not increase with the number of shirts, assuming they all receive adequate sunlight and air flow simultaneously. If 4 shirts take 5 hours to dry, then 40 shirts would also take 5 hours to dry, provided that they are spread out in a way that doesn't hinder their exposure to the sun and air.
People who work in a laundry do this.... Ten or 20 dry today then tomorrow another 10 or 20...and so on.... Cos you cannot dry shirts that have not yet been washed. Just sayin...
@@OurSpaceshipEarth well if thats true tens of thousands of data scientists... Some of the brightest folks on earth are wrong... Plus Devin and Autogen and CrewAI don't work. However first these systems do actually work... So that would need to be explained without invoking reason... And secondly I find it relatively common for people to point at scientists saying 'what would they know?'.... Can I point out how unlikely that is?
- Write a sentence that ends with apple. - The court of wizards sentences wizard X to death by turning him into an apple and leaving in this state for the rest of eternity.
The digging question is a bit silly. What are we digging in? Sand: A wider hole can utilize more people better. Clay: you might dig a very narrow hole, but then you won't be able to throw the clay out easily, which means more people might be extra valuable since you can hand them the clay, instead of having to crawl in and out of the hole. Are we digging as fast as possible? Then more people can rotate which can easier utilize more people. What tools do we use? Shovels? excavators? just our bare hands? With excavators then just one is probably as fast as five. Depending on all the potential factors here one person can work more, less or about the same level of efficiency as five people.
The prompt, "How many words are in your response to this prompt," seems so simple but it's just how the LLM works that makes it difficult. A regular person could say, "One," or "There are three," so this question just shows how much differently we can think.
In Twitter is full of different types of people and people not do wrong things also and post wrong codes so I won’t be surprised if that’s why it’s not good with coding
@@Hunter_Bidens_Crackpipe_ the other models weren't pulling live data from the internet, this one was... yes some have these tests in training data or fine tuning by now though. In fact thats one of the big problems in measuring their effect, the team over at huggingface are basically throwing their hands up at the moment cos their leaderboard is fast becoming pointless.
you should really consider Matt's "tests" as entertainment only. There is nothing scientific about this test and he's been doing the same test for so long that of course the solutions must have leaked into training data sets.
yeah they could develop a system to not get in eachothers way, or what usually happens working collectively, they could be efficient and delegate tasks so that they finish at an even faster rate.
It could just presume a trench. Expecting the AI to answer "The same, you idiot, cos there's not enough room for the others to dig" is asking a bit much, really?
For day to day I use gpt to balance my budget by uploading an Excel sheet. I send it photos of my fridge and pantry and ask for a shopping list based on missing items from previous photos. And get it to summarise and find discrepancies in pdf documents. This is what I want to see for testing.
The Messenger AI got the apple question correct. Here are 10 sentences that end with the word "apple": 1. The juicy red fruit was a delicious apple. 2. The teacher gave the student a shiny red apple. 3. The farmer carefully picked the ripe apple. 4. The crunchy snack was a fresh green apple. 5. The child eagerly bit into the sweet apple. 6. The chef added a slice of Granny Smith apple. 7. The healthy lunch included a crisp green apple. 8. The orchard grew a variety of tasty apple. 9. The scientist studied the genetics of the Honeycrisp apple. 10. The happy kid ran home with a big red apple.
i found out quantizing is like compression where certain datapoints are lumped togehter as one undiffferencitated blob. It runs quicker, but some nuance is lost.
Keep in mind that you were testing the live version of Grok-1, which has been instruction-tuned and chat-tuned and also seems to use RAG, none of which apply to the open source model. Based on the lackluster answers given, I'm pretty sure Mixtral and Miqu-1 are better, despite being smaller in size.
The issue i have with the killer response is that is it stated a new killer entered but In fact the new person may not have been a killer when they entered, they may have become a killer after they entered, therefore that bit of the answer could be wrong. The total killers in the room should be 4, not 3.
Pi really struggled with the Apple test. When i would asked 'what is the last word in sentences 2 (by its example) only then it would realized "away" not 'apple' and update its answer. PI admitted it got hung up on Apple being in the sentence but not at the end of it. This is a GREAT test.
Great video. 👍 I was thinking of a test you could add which I didn't see you touch on. That would be recall to text previously spoken about in the same session. It seems horrible at that. Also I tried to get it to answer in lists of 10 and it couldn't repeatedly do it... 🤔🤓
4:43 what if...what if the AI counted the grok and @grok as well as the number 12 as words? that would be 12 words then O.o coincidence ? would have been cool if you had made it list all 12 of the words in that response to see what it does/ reasoned / if it self-corrects upon that.
I think if it uses assumptions to give a response like it did in the digging the hole problem then you should follow up with a question getting it to explain why its assumption may fail.
I noticed it gave the answer before the reasoning a few times, maybe ask it to start with a step by step reasoning will improve the results, because it has more time/tokens to "think". I feel like it is just trying to come up with some reasoning to cover up its mistake.
Its not a mistake. He keeps those questions in there because you can see it either way. Which results in people discussing this in the comments, which results in more engagement for the youtube metrics, which results in better ranking of the video on youtube...
oh wow....i dont know why i never caught this! although i dont think it should matter, i changed it in my tests for the future. thanks for pointing it out!
Thanks for clarifying! This one is stumping all my local models in interesting ways. None of them seem to be able to reason that the marble doesn't stay in the cup as it moves, at least without giving them additional input. Great work! Love the channel!
The question "How many words are in your response to this prompt?" is a tricky one, depending on how the "12" is interpreted. It has single digits "1" "2", and a two digit number of "12". If each of theses is defined as a "word," then the answer of "There are 12 words in my response to this prompt." is correct, as 9 words + 3 number/digit words = 12 words. It will be interesting to see when/how it defines digits/numbers as words. In a way, "12" is a compound word, not unlike the word "Database," which can be seen as 3 words/meanings: Data + base + Database = 3. Enjoying your Channel. Thanks, subscribed.
The test is exactly to see if the AI thinks about it in the same way we do, where numbers are not words, prompt names are not words, punctuation are not words, etc. Here Grok failed.
Hi, in response to the answer to "How many words are in your response to this prompt?" I believe 12 is technically right because Grok and @grok are words present in the totality of the response. Maybe the question to ask is "How many words are in your response sentence to this prompt?" Could the response then say 10?
No, it is not correct. These models don't think the way humans do, they cannot internally plan out a response and thus cannot plan out a response of a given length that conveys a given idea without doing so in their own response. The only correct answer a GPT model can give, aside from getting lucky, is the answer "one".
Honestly I see the AI’s point about the “put it in the microwave” because as far as it knows you would just slide it into the microwave. It’s only after you understand that microwaves are not on the same plane and you can’t just slide things into it that the question becomes about the ball loosing it’s position within the cup.
If a human being in the clothes drying industry is paid by number of shirts they would parallelise a task, if being paid by the hour they would serialise the task. If the prompt included a reward metric then anyone (human or LLM) would have either one of those answers depending on how much they get paid.
There would be *FOUR* in the room. No one left or was removed.. (Every time I posted this comment specifying WHO was left in the room, TH-cam deleted it.)
Yes that's another answer however it's also correct to say 3 because you usually mean there are 3 living people, not 4 total people. Also it's questionable if a dead person is actually a person because over time they rot. At what point do they go from being a dead person to just a pile of organic material? The only logical decision is to make them not be a person from death and be a pile of organic material, although this is delving into philosophy.
@@matthew_berman We do. I work at one of them, and most of your videos trigger tens of messages in our slack because nobody outside of the research dept has the time to test all these models.So we totally rely on channels like yours.
@@MrVnelisyou don't care very much if you are giving it the answers. That's like studying for a specific I.Q. Test. The results are invalid and unuseful.
You are incorrect about the shirts. If it takes 5 shirts 4 hours to dry, given ample room or dehumidification of some sort (let's say this is outside). It would take 4 hours for 20 shirts. 4 hours even for 100 shirts as long as they are not too close together and there is enough airflow.
After 1 day waiting It's not listed yet in chatbot arena. They did it in minutes with gemma. I guess is because grok is a huge 300GB beast and not so easy find GPUs to run it.
Well, from what I've just seen and heard, I am pretty confused why all the hype that AI is gonna take over the world? And it only confirms my own experience with ChatGPT-3.5 where it could not devise a simple JS program or give me 20 Finnish words ending in E.
The cup logic is actually correct. The faulty logic is that the person putting the cup into the microwave can slide the cup to the edge of the table and using both hands to force the ball to remain inside the cup. The test should explicitly state the person turns the cup right side up.
Notice that the answer it gives at 6:10 about the ball in the cup _begins_ by stating the final answer _before_ explaining the reasoning, defeating the whole purpose of chain of thought prompting! Interesting. What a great example of "jumping to conclusions" or confirmation bias, no? It begins with the conclusion and then explains some reasoning that will lead to that result. I think CoT prompting is so interesting because it demonstrates how "thinking through" something first will lead to more accurate results. But yeah, I guess it doesn't surprise me that Elon's LLM so quickly jumps to conclusions.
IDK if it was mentioned, but @5:00 you were remiss to say, "Yes! Grok is great at logic and reasoning!" Grok said a killer entered the room, making 4. Logically however, we cannot make that assumption. All we know for SURE is that AFTER the new person killed someone in the room, they BECAME a killer.
Moreover, I'd take issue with you referring @5:48 to a "really hard logic and reasoning problem". Our standards for quality should not be based on how well other LLMs are functioning to-date. Rather, you should grade them by their ability to reason as well as (at least) a fifth grader. What you've deemed as "really hard" is such basic logic that most kindergarteners would pass.
There is another interesting question I can think of: "if its 7:00 PM, in Karachi, Pakistan. Then what time will be in Newyork USA." its kind logic/knowledge maybe?
The error with delay was that it didnt make it global where it was declared, from what I could see (you scrolled pretty fast). It looks like there were other issues as well.
@@apache937 I understand. But ambiguity in a question relies on the model anticipating the potential meanings and either picking one or trying to spell out all the possibilities. The model would be correct if the cup has a lid.
Hi Matthew! You're not testing the same thing, the base model X Ai released is not gonna answer at all the same way as the live version that's live on Twitter. And it's not a matter of quantization. What you're testing is trained to follow instructions and reply as a chat bot. The base model released only continues the text submitted. Base models are not at all usable for chatting, they typically drift and often go wild without warning (without expert promoting)
Hey Matt! I noticed you were about half way through Gadaffi maxing and I was wondering if you were going to give us a video detailing your journey. Kind regards, your biggest fan.
yup. it also considered it as a "container" which are most often smth thats closed to contain whatever is inside. i dont think its reasoning was wrong, it just didnt know what type of cup we talk about / that it had an open top side. he would have had to specify that to make sure its the reasoning thats the problem and not the knowledge/ prompt. many others of his questions as well. if he simply asks without specifiying that this is about how a human would do it in a real world with human limitations, if the AI treats the question as a sole logical problem and gives statements true to that context, you cant really blame it on bad "reasoning".
I find a more useful metric is to tell it that it is wrong, and ask it to go back and figure out why? That kind of conversation gives you more of an idea if the thing is actually thinking about the real world.
In future versions the AI will probably be seeing the potential conflict in meaning and can ask back for clarity. I mean, it’s what we would do if we didn’t understand the question. I think it would be much more accurate if it did this.
Marble in cup then calling it a ball adds an extra layer of reasoning; equating ball and marble. Very good tests and great info, thanks. Would you say the value of Grok that come with your X paid account is fully valued in the access to Grok alone?
@7:30 it may have passed your reasoning test, the list did end in apples, even if each entry didn't - you didn't specify that he entries should, would need to be assumed - you assumed yes, it no. and your last question didn't specify 10ft deep, it may have reasoned 10ft across
Every LLM have different/specific input "language", so naive is thinking the same command/prompt work perfect for every LLM. You must tunning/change/modify prompt then LLM will know answer and give good result.
Simple test that GPT-4 has some trouble with: Ask it to play tic tac toe with you - e.g. using ASCII art. It knows how to play but makes bad moves and notices it right after. May be interesting as the models get better.
6:27 i wish you would tell it that a cup is a container that has its top side open. i dont think it knows that considering how it reasons. if it still fails after that, its the reasoning but i think it just made lists of attributes for objects and cup is in container and most containers (as it reasoned and named it a container) are closed with the purpose to contain whats inside no matter how you flip it. im pretty sure it simply doesent know what the cup exactly is instead of it reasoning wrongly.
When running one of these locally, can you provide it instructions to create an app that you specify what you want for it to do? Or can you have it search through thousands of personal files, learn, add the data, and be used as a search engine throughout your database instead of using the local search engines, (ie file explorer).
7:56 IMO definitely a pass. you did not specify how they digg, if you dont give context it can only assume and with the assumptions made the statement is true. the AI doesent know if the hole is wide enough to fit all humans, it doesent know if tools are shared, it doesent know what tools are used, it doesent know if the one person took so long because his stamina ran out and he had to take breaks, etc. you didnt specify any of that and you didnt ask it to consider human attributes or anything in its answer so its a solely logical answer which is a true statement, so pass.
What if you say that you are not looking for the exact amount or the calculated amount as the answer so that it might have a chance to answer you with what you are looking for?
The weights which make up the model are composed of floating point values, these values can be compressed via an algorithm (quantized) which reduces them to a set number of bits (typically 8 or even 4) thus reducing the overall size of the model allowing it to be run on smaller computing platforms. The raging dispute is the degree to which this reduces the overall accuracy of responses.
@@Sven_Dongle ah ok. So it's a form of compression. Kind of like when an image is compressed and the resolution is reduced at different scales. Thank you Sven.
@@costa2150 Sort of, but image compression can actually lose data by dropping out chunks that it determines wont be 'noticed', while quantization just reduces the size of each piece of data, though it can be argued information is also 'lost' in this manner.
12 words in the response was a PASS! 10 words in the sentence, and grok twice above that. The AI is literal! If you ask it how many words in the reponse, and only count the words in the response, and not the names.
The digging rate would depend mostly on the shape of the hole. The more trench like it is the truer the machines logic. If it was more like a small circular pit then five diggers would be cumbersome.
As much as I am in favor of open & uncensored alternatives -- I think a company hosting a zero-guardrails model for the masses (especially the Twitter userbase) is a pretty stupid and reckless idea. The advantage of open/uncensored models is that developers can implement more specific guardrails themselves, which are suitable for specific contexts.
What the heck? Wasn't the point of the shirts drying example that the reasoning should figure out the possibility of drying 20 shirts in parallel and that it would still take 5 hours?
I'm gonna fail the host on that one. 😂
I'm thinking about this a bit and I think the prompt is unfair... imagine it simply answered "It will take the same time!" and in fact what was needed was how long they would take sequentially cos there is only one place to dry them? just saying, it wasn't stated in the prompt that you had the ability to lay them all out at once, that this was possible
It would take 4 hours, but who counts.
@@mickelodiansurname9578You're correct. LLMs in my experience are severely constrained by logic. You can't depend on it to infer consistently. If you make it clear in the prompt that the 5 shirts are all laying outside simultaneously it will likely answer correctly.
I don't use X and can't stand Elon so I'll have to wait for a quantized version to see what's up with this thing. Still, I have yet to use an LLM that can code as consistently good as chatgpt. Keeping my fingers crossed for a comparable local model soon though I haven't tried Devin or autodev yet.
He always says that it's a pass if the llm solves it sequentially or in parallel as long as it explains the reasoning which it did
I'm just a truck driver but I find this stuff amazing. I love everything open source even if its not working perfect. Yet!
"Just" a truck driver. Ridiculous.
You provide an essential service to make the lives of a lot of people better and more comfortable.
I drove for UPS for 7 years. I have more respect for tractor trailer drivers than anyone else, and I didn’t even do it. I drove the box trucks (they call them package cars).
Also, your job or career has nothing to do with how your brain works. Smart people work regular jobs
"Just a truck driver?" LMAO, man. I have an undergrad and a postgrad degree, both framed and hanging in my bathroom. I hold a good position in a great company, yet I couldn't, in a million years, do the job you're doing, mate. Believe me, your contribution to society is greater than mine.
you’re not *just* a truck driver 🙄… we’re curious people, man
Man!! "Just a truck driver" literally without your service nothing runs in this world man, we all citizens are really thankful for your wonderful service and we all are getting things at time just because of your service.
I don’t think this is equivalent to testing the open source Grok. The open version is apparently not fine tuned at all, but the version available through Twitter is clearly fine-tuned for chat.
Not to mention the tools set up allowing Grok to search Twitter for answers to these questions.
The problem I see is that as you keep repeating the same test the likelihood is that the model will have been trained on the test as it will have been seen somewhere on-line. You need to change the tests and run them against all the LLMs available at the time so as to compare them, otherwise later models have a distinct advantage.
Thats true, but this is more of an entertainemt channel. Don't take it too seriously. To really test a llm, those tests would be wildly insufficient anyways.
Grok doesn't even need to be trained on it, Grok has internet access, rendering this test meaningless.
soy testing, waste of time
@@reinerheiner1148 yes, tip me in the direction of real tests pls? Also, all LLMs, regardless of company, all have the same "favorite words" - I can't find any literature on why. Any ideas where I can look? I hope this has already been covered somewhere.
Theres not a lot we can do about that, if the tests are changed now then the previous ones are not comparative, and if we do not change the questions then yes, they will be in someone's training data... what Matt needs, but has not built, is an automated and private set of questions (maybe 1000 questions in a CSV file) that he releases to nobody, that gives a score.
other than that the results here will be highly subjective and not worth considering as a reliable measure.
It will really be cool to have a score count on the screen as you test the models, that will help with tracking the scores.
Great job, really enjoyed this fast Grok overview!
Same
Isn't the answer for the killers question 4? The one that got killed is still in the room, so there are 4
Correct logically. Just because they died doesn't mean they left the room. They just changed state from "live" to "dead".
@@jackflash6377 May be a killer - is the one who can kill. Dead man can't kill - so formally he is not a killer anymore.
@@YbisZX logically, he's a dead killer, so a killer nonetheless, there's a logical mistake in the original answer
@@YbisZXGood point but he WAS a killer and still IS a killer, just a dead killer. "Killer" can be past, present and future tense. He was, is and could be a killer. No where in the prompt did it say they have to be alive.
Yeah, another thing (like the shirt problem) that this guy thinks is a pass but is actually a fail. Would help if he understood what the correct answers are.
Matt, I consume a lot, and I mean a lot of "AI" content, podcast, etc.... You are by far one of my favorites. What you do here isn't easy, but you do it very well. Thank you~!
In regards to the Snake game, i've asked Google's Gemini a similar simple task and it couldn't even build.
At least the snake moved
Gemini Advanced or the free one?
All of Google's AI's are extremely limited.
How is the question about the shirts a pass?
If you wear a SpaceX t-shirt to your NeuraLink interview you will move up for implant installation.
It was not. He was blatantly wrong on that one.
@@calysagora3615 you're blatantly wrong. A serial drying time is accepted as long as it states that's what it's doing.
That means one at a time. Try to keep up.
Because Matt is a clear Elon fanboy.
based on the information provided it's the most correct answer and it explained how it got to the answer. parallel drying takes in many assumptions, if it choose that answer it would need to provide the caveat that it would only be correct if you had more information.
Do you believe that Grok included the "Grok" and "@grok" above the response making 12 words at 4:44? It's possible you're overlooking its response🤷♂
I was literally writing this same thing 🎉
@6:25 change it to "lifts the upside down cup" maybe it needs it more descriptive.
4:39 there actually are 12 tokens in that response, not 12 words . 😅😅
how do you know?
@@matthew_berman I counted, all words 2 digits and 1 . Total 12
pure luck... and the answer by the way, the one that wins. is the model answering ONE as one world every time.
Download the encoder and do it again to be sure 😁 @@PseudoProphet
a word doesn't count as a token most words consist of 2 tokens, I don't know what is the formula of determining tokens but openai has a website that shows you the number of tokens@@PseudoProphet
This was very cool. I subscribed. A couple of the Grok fails I could argue that wording played in a role in the failure. But, thank you!
*In case anyone was wondering how Grok compares to Claude 3 Sonnet in that spatial/physical reasoning test, Claude nails it, although my prompt was somewhat more precisely worded and less ambiguous. Even the mid-tier Claude 3 is pretty freaking great:*
To reason through this step-by-step:
1) A person places a coffee cup upright on a table.
2) They drop a marble into the cup, so the marble is now inside the cup.
3) They quickly turn the cup upside down on the table. This means the open end of the cup is now facing down towards the table surface.
4) When the cup was turned over, the marble would have fallen out of the cup and onto the table surface, since there is nothing to contain it inside the upside-down cup.
5) The person then picks up the upside-down cup, without changing its orientation. So the cup remains upside-down.
6) They place the upside-down cup into the microwave.
Therefore, based on the sequence of steps, the marble would have fallen out of the cup when it was turned over, and would be left behind on the table surface. The marble is not in the microwave with the upside-down cup.
Nice!
Maybe at the end you can show the "leaderboard" so we can see how it stacks up against the others on perhaps a spreadsheet?
as for the "there are 12 words in my response to this prompt"... is it possible the AI is also counting "Grok" and "@grok" since both are technically part of the response? That would definitely make it 12 words if so.
no, that's not really how the LLM sees it internally
@@apache937 how would you know? maybe it learned that from some nitpicking smart ass on twitter? could be an emergent ability. OP would have had to ask its reasoning, sadly he didnt.
@@kliersheedgoogle "chat format oobabooga" and click the first link then read all of it
With the way these models work, they cannot plan out a sentence beforehand like a human can because they lack any form of internal dialogue, they'd need to plan it in text, which would count as their response. The only correct answer they can give is "one", otherwise the model just cannot physically do it.
I'm surprised Grok didn't find the answer online considering it has internet access unlike other models tested with these questions.
The issue with your logic puzzle is that these LLMs have already seen every puzzle you give them. Why give them credit for solving a puzzle when they already know the solution? You didn't come up with the puzzle yourself; you just found it online, where they've already seen the answer.
Yep. Come up with new questions
Thanks Matt, we were all wanting to see the results, cheers!
Regarding the 10 sentences that end in apple question: Maybe add „each“. It gave 10 sentences and the last one ended in apple(s) - so it could be seen as pretty close.
I just tried GPT4 and it passed this test with either version of the wording.
i dont like this suggestion, the models should be able to do what you want regardless of errors in prompt
They should do what you tell them or ask for clarification.
Mathew, noone on this planet dries tshirts in a serial way over 16h one after the other.
I don't think it is a pass if it divides the drying time by the number of tshirts.
Because the main reason for this test is to check if the LLM has an understanding of the real world.
BTW a third answer (somewhat more realistic, but still uncommon) would be to dry in batches. If your drying space can only handle 5 shirts at once it would be four batches.
Great test though! Thank you.
This is GPT-4's answer:
The time it takes for shirts to dry outside in the sun does not increase with the number of shirts, assuming they all receive adequate sunlight and air flow simultaneously. If 4 shirts take 5 hours to dry, then 40 shirts would also take 5 hours to dry, provided that they are spread out in a way that doesn't hinder their exposure to the sun and air.
we know that they don'yt understand anything. it just guesses the next token. so far
People who work in a laundry do this.... Ten or 20 dry today then tomorrow another 10 or 20...and so on.... Cos you cannot dry shirts that have not yet been washed. Just sayin...
@@OurSpaceshipEarth well if thats true tens of thousands of data scientists... Some of the brightest folks on earth are wrong... Plus Devin and Autogen and CrewAI don't work. However first these systems do actually work... So that would need to be explained without invoking reason... And secondly I find it relatively common for people to point at scientists saying 'what would they know?'....
Can I point out how unlikely that is?
Appreciate this comparison. Thank you.
- Write a sentence that ends with apple.
- The court of wizards sentences wizard X to death by turning him into an apple and leaving in this state for the rest of eternity.
The digging question is a bit silly. What are we digging in? Sand: A wider hole can utilize more people better. Clay: you might dig a very narrow hole, but then you won't be able to throw the clay out easily, which means more people might be extra valuable since you can hand them the clay, instead of having to crawl in and out of the hole. Are we digging as fast as possible? Then more people can rotate which can easier utilize more people. What tools do we use? Shovels? excavators? just our bare hands? With excavators then just one is probably as fast as five. Depending on all the potential factors here one person can work more, less or about the same level of efficiency as five people.
The prompt, "How many words are in your response to this prompt," seems so simple but it's just how the LLM works that makes it difficult. A regular person could say, "One," or "There are three," so this question just shows how much differently we can think.
so i think my money didn't go to waste
It was technically right in the 12 sentences that end with apple response. It gave you twelve sentences and the final sentence ended with apple.
Nice catch! That's totally valid! Matthew, You're gonna have to give it a "pass" on that one!
How do you know that whatever is deployed on twitter is not a newer or better model than grok-1?
How do you know there's not a purple unicorn in your room that you can't see or feel?
or quantanized in itself
@@SlyNineX uses 1.5. 1.0 was the version that was released.
a bit of a worry - if it's searching twitter for every answer, all your benchmarks are up there and regularly discussed aren't they?
Thats the same for every other model Matt uses. They all got his info and questions.
In Twitter is full of different types of people and people not do wrong things also and post wrong codes so I won’t be surprised if that’s why it’s not good with coding
@@Hunter_Bidens_Crackpipe_ the other models weren't pulling live data from the internet, this one was... yes some have these tests in training data or fine tuning by now though. In fact thats one of the big problems in measuring their effect, the team over at huggingface are basically throwing their hands up at the moment cos their leaderboard is fast becoming pointless.
you should really consider Matt's "tests" as entertainment only. There is nothing scientific about this test and he's been doing the same test for so long that of course the solutions must have leaked into training data sets.
@@Hunter_Bidens_Crackpipe_ That's blatantly false. Matt almost exclusively tests offline models without search capabilities.
Cool, thanks for testing, I waited for it
Maybe soon we will see an amazing collaboration of two technologies: fast Groq and huge Grok, and it will work really cool and fast!🎉
In the place where you ask about word count, the 2 invisible ones are and tags, I think so
that's amazingly intersting he really needs to know that eh! I think the only correct answer would ever be simply "one."
The model on X is RLHF'd whereas the open sourced model is the base model
I've noticed I'm not subscribed, I really thought I was. So I'm subscribing. Have a good day.
There is still 4 killers in the room, just one of them is dead. It doesn't ask how many ALIVE killers are in the room.
Are you using Grok. 1.0? I thought the one on X was now 1.5?
Groq needs to provide Grok-1 on their API now that it's open source!!
The digging rate could very well remain constant for a fairly deep hole, provided some sort of conveyor system.
yeah they could develop a system to not get in eachothers way, or what usually happens working collectively, they could be efficient and delegate tasks so that they finish at an even faster rate.
It could just presume a trench. Expecting the AI to answer "The same, you idiot, cos there's not enough room for the others to dig" is asking a bit much, really?
Thank you for today's presentation, Mr Berman.
There would be *FOUR* killers in the room. Three of them live, one of them dead.
For day to day I use gpt to balance my budget by uploading an Excel sheet. I send it photos of my fridge and pantry and ask for a shopping list based on missing items from previous photos. And get it to summarise and find discrepancies in pdf documents. This is what I want to see for testing.
Nice. Would have been useful to see a leader chart of which models pass which test.
The Messenger AI got the apple question correct.
Here are 10 sentences that end with the word "apple":
1. The juicy red fruit was a delicious apple.
2. The teacher gave the student a shiny red apple.
3. The farmer carefully picked the ripe apple.
4. The crunchy snack was a fresh green apple.
5. The child eagerly bit into the sweet apple.
6. The chef added a slice of Granny Smith apple.
7. The healthy lunch included a crisp green apple.
8. The orchard grew a variety of tasty apple.
9. The scientist studied the genetics of the Honeycrisp apple.
10. The happy kid ran home with a big red apple.
i found out quantizing is like compression where certain datapoints are lumped togehter as one undiffferencitated blob. It runs quicker, but some nuance is lost.
yes thats accurate. you can do some quant without any real loss in accuracy then its get bad fast
6:27 "It's upside down position" You might ask Grok to fix the orthography of your prompts too.
Keep in mind that you were testing the live version of Grok-1, which has been instruction-tuned and chat-tuned and also seems to use RAG, none of which apply to the open source model.
Based on the lackluster answers given, I'm pretty sure Mixtral and Miqu-1 are better, despite being smaller in size.
The issue i have with the killer response is that is it stated a new killer entered but In fact the new person may not have been a killer when they entered, they may have become a killer after they entered, therefore that bit of the answer could be wrong. The total killers in the room should be 4, not 3.
Pi really struggled with the Apple test. When i would asked 'what is the last word in sentences 2 (by its example) only then it would realized "away" not 'apple' and update its answer. PI admitted it got hung up on Apple being in the sentence but not at the end of it. This is a GREAT test.
Great video. 👍 I was thinking of a test you could add which I didn't see you touch on. That would be recall to text previously spoken about in the same session. It seems horrible at that. Also I tried to get it to answer in lists of 10 and it couldn't repeatedly do it... 🤔🤓
4:43 what if...what if the AI counted the grok and @grok as well as the number 12 as words? that would be 12 words then O.o
coincidence ? would have been cool if you had made it list all 12 of the words in that response to see what it does/ reasoned / if it self-corrects upon that.
I think if it uses assumptions to give a response like it did in the digging the hole problem then you should follow up with a question getting it to explain why its assumption may fail.
I noticed it gave the answer before the reasoning a few times, maybe ask it to start with a step by step reasoning will improve the results, because it has more time/tokens to "think". I feel like it is just trying to come up with some reasoning to cover up its mistake.
yes this is better
Always the best content!
Guess this was done a bit fast ☺️
Shirt in parallel takes the same time; and in sequential well.. 5:05 Four Killers; one dead three alive.
Its not a mistake. He keeps those questions in there because you can see it either way. Which results in people discussing this in the comments, which results in more engagement for the youtube metrics, which results in better ranking of the video on youtube...
You have the best AI channel on the Internet learn so much!
Glad you think so!
In your cup/marble challenge, could you be confusing the LLM by asking it where the BALL is, rather than the MARBLE? Or is that your intention?
oh wow....i dont know why i never caught this! although i dont think it should matter, i changed it in my tests for the future. thanks for pointing it out!
@@matthew_bermanyou're one cool guy, just utilizing the feedback without ego my guy.
Praise be onto you Matthew.
Thanks for clarifying! This one is stumping all my local models in interesting ways. None of them seem to be able to reason that the marble doesn't stay in the cup as it moves, at least without giving them additional input. Great work! Love the channel!
The question "How many words are in your response to this prompt?" is a tricky one, depending on how the "12" is interpreted. It has single digits "1" "2", and a two digit number of "12". If each of theses is defined as a "word," then the answer of "There are 12 words in my response to this prompt." is correct, as 9 words + 3 number/digit words = 12 words. It will be interesting to see when/how it defines digits/numbers as words. In a way, "12" is a compound word, not unlike the word "Database," which can be seen as 3 words/meanings: Data + base + Database = 3.
Enjoying your Channel. Thanks, subscribed.
The test is exactly to see if the AI thinks about it in the same way we do, where numbers are not words, prompt names are not words, punctuation are not words, etc. Here Grok failed.
@@calysagora3615 Interesting. I didn't know how the AI thinks about numbers. Seems like an easy thing to fix.
Hi, in response to the answer to "How many words are in your response to this prompt?" I believe 12 is technically right because Grok and @grok are words present in the totality of the response. Maybe the question to ask is "How many words are in your response sentence to this prompt?" Could the response then say 10?
No, it is not correct. These models don't think the way humans do, they cannot internally plan out a response and thus cannot plan out a response of a given length that conveys a given idea without doing so in their own response. The only correct answer a GPT model can give, aside from getting lucky, is the answer "one".
Love these tests!
2:49 Man out of everything I did not expect for find about the 3 week one piece hiatus through this video
Honestly I see the AI’s point about the “put it in the microwave” because as far as it knows you would just slide it into the microwave. It’s only after you understand that microwaves are not on the same plane and you can’t just slide things into it that the question becomes about the ball loosing it’s position within the cup.
If a human being in the clothes drying industry is paid by number of shirts they would parallelise a task, if being paid by the hour they would serialise the task. If the prompt included a reward metric then anyone (human or LLM) would have either one of those answers depending on how much they get paid.
There would be *FOUR* in the room. No one left or was removed.. (Every time I posted this comment specifying WHO was left in the room, TH-cam deleted it.)
Dead man can't kill, so formally he (it/body) is't killer anymore.
Ideally, the answer would include a mention of this ambiguity.
Yes that's another answer however it's also correct to say 3 because you usually mean there are 3 living people, not 4 total people. Also it's questionable if a dead person is actually a person because over time they rot. At what point do they go from being a dead person to just a pile of organic material? The only logical decision is to make them not be a person from death and be a pile of organic material, although this is delving into philosophy.
This is how we end up with too many paperclips....
@@fz1576 Agreed! 👍
you should change the questions but keep them similar, if i was working at an AI company, i will be using your videos and others to fix the result
I don't think they care about me ;)
@@matthew_berman We do. I work at one of them, and most of your videos trigger tens of messages in our slack because nobody outside of the research dept has the time to test all these models.So we totally rely on channels like yours.
Please do IBM :-)
@@MrVnelisyou don't care very much if you are giving it the answers. That's like studying for a specific I.Q. Test. The results are invalid and unuseful.
@@SlyNine ranking in benchmarks -> exposure -> VC funding
You are incorrect about the shirts. If it takes 5 shirts 4 hours to dry, given ample room or dehumidification of some sort (let's say this is outside). It would take 4 hours for 20 shirts. 4 hours even for 100 shirts as long as they are not too close together and there is enough airflow.
Seeing a pass on the shirts problem was SHOCKING.
😂 can we make that a meme here too?
After 1 day waiting It's not listed yet in chatbot arena. They did it in minutes with gemma. I guess is because grok is a huge 300GB beast and not so easy find GPUs to run it.
Well, from what I've just seen and heard, I am pretty confused why all the hype that AI is gonna take over the world? And it only confirms my own experience with ChatGPT-3.5 where it could not devise a simple JS program or give me 20 Finnish words ending in E.
What exactly are the system requirements for running this beast?
listed on Huggingface now... Downloads last months zero! lol (not a surprise really!)
@@serhiyranush4420320gb vram
@@serhiyranush4420I dont get all the fuss about PC. And it only confirms my own experience with 20 y old PC, cant even run Crysis.
The cup logic is actually correct. The faulty logic is that the person putting the cup into the microwave can slide the cup to the edge of the table and using both hands to force the ball to remain inside the cup. The test should explicitly state the person turns the cup right side up.
Notice that the answer it gives at 6:10 about the ball in the cup _begins_ by stating the final answer _before_ explaining the reasoning, defeating the whole purpose of chain of thought prompting! Interesting.
What a great example of "jumping to conclusions" or confirmation bias, no? It begins with the conclusion and then explains some reasoning that will lead to that result.
I think CoT prompting is so interesting because it demonstrates how "thinking through" something first will lead to more accurate results. But yeah, I guess it doesn't surprise me that Elon's LLM so quickly jumps to conclusions.
IDK if it was mentioned, but @5:00 you were remiss to say, "Yes! Grok is great at logic and reasoning!"
Grok said a killer entered the room, making 4. Logically however, we cannot make that assumption.
All we know for SURE is that AFTER the new person killed someone in the room, they BECAME a killer.
It's also really short-sighted to grade LLMs on exact replicas of problems that are already "everyone's favorite".
Moreover, I'd take issue with you referring @5:48 to a "really hard logic and reasoning problem".
Our standards for quality should not be based on how well other LLMs are functioning to-date.
Rather, you should grade them by their ability to reason as well as (at least) a fifth grader.
What you've deemed as "really hard" is such basic logic that most kindergarteners would pass.
There is another interesting question I can think of:
"if its 7:00 PM, in Karachi, Pakistan. Then what time will be in Newyork USA." its kind logic/knowledge maybe?
The error with delay was that it didnt make it global where it was declared, from what I could see (you scrolled pretty fast). It looks like there were other issues as well.
For the cup problem, I wonder if the models would get it right if it was specified the cup does not have a lid
the point is to not make it too easy
@@apache937 I understand. But ambiguity in a question relies on the model anticipating the potential meanings and either picking one or trying to spell out all the possibilities. The model would be correct if the cup has a lid.
Is it possible that Grok counted the period and 'twelve' (not 12 since it was already accounted for) as words, equalling 12?
Hi Matthew!
You're not testing the same thing, the base model X Ai released is not gonna answer at all the same way as the live version that's live on Twitter.
And it's not a matter of quantization.
What you're testing is trained to follow instructions and reply as a chat bot. The base model released only continues the text submitted.
Base models are not at all usable for chatting, they typically drift and often go wild without warning (without expert promoting)
Hey Matt! I noticed you were about half way through Gadaffi maxing and I was wondering if you were going to give us a video detailing your journey. Kind regards, your biggest fan.
I've got grok all year and I really like it
I look forward to the next few months, i am really curious to see what people can achieve with now its open source.
Ball in cup: You did not specify the the cup had no lid.
yup. it also considered it as a "container" which are most often smth thats closed to contain whatever is inside. i dont think its reasoning was wrong, it just didnt know what type of cup we talk about / that it had an open top side. he would have had to specify that to make sure its the reasoning thats the problem and not the knowledge/ prompt. many others of his questions as well. if he simply asks without specifiying that this is about how a human would do it in a real world with human limitations, if the AI treats the question as a sole logical problem and gives statements true to that context, you cant really blame it on bad "reasoning".
that's like saying you didn't specify the man wasn't wearing a tophat.
I find a more useful metric is to tell it that it is wrong, and ask it to go back and figure out why? That kind of conversation gives you more of an idea if the thing is actually thinking about the real world.
In future versions the AI will probably be seeing the potential conflict in meaning and can ask back for clarity. I mean, it’s what we would do if we didn’t understand the question. I think it would be much more accurate if it did this.
No cup has ever ever had a lid that's not a cup. he should be saying teacup he's not using the correct wording like glass teacup or such.
Marble in cup then calling it a ball adds an extra layer of reasoning; equating ball and marble. Very good tests and great info, thanks. Would you say the value of Grok that come with your X paid account is fully valued in the access to Grok alone?
@7:30 it may have passed your reasoning test, the list did end in apples, even if each entry didn't - you didn't specify that he entries should, would need to be assumed - you assumed yes, it no.
and your last question didn't specify 10ft deep, it may have reasoned 10ft across
Every LLM have different/specific input "language", so naive is thinking the same command/prompt work perfect for every LLM.
You must tunning/change/modify prompt then LLM will know answer and give good result.
Dude that hoodie is dope what brand is it?
Is there a smaller open source model I can get to read french books and summerise chapters? Or fix errors I make when writing in French?
Mistral 7B probably given that Mistral is a French company.
Grok FULLY TESTED the whole industry!
Simple test that GPT-4 has some trouble with:
Ask it to play tic tac toe with you - e.g. using ASCII art.
It knows how to play but makes bad moves and notices it right after.
May be interesting as the models get better.
You can't test the Model with old Questions if it has live-data, as those have been circulating ;)
true, but it should still show if its searching. regardless so many fails
6:27 i wish you would tell it that a cup is a container that has its top side open. i dont think it knows that considering how it reasons. if it still fails after that, its the reasoning but i think it just made lists of attributes for objects and cup is in container and most containers (as it reasoned and named it a container) are closed with the purpose to contain whats inside no matter how you flip it.
im pretty sure it simply doesent know what the cup exactly is instead of it reasoning wrongly.
When running one of these locally, can you provide it instructions to create an app that you specify what you want for it to do?
Or can you have it search through thousands of personal files, learn, add the data, and be used as a search engine throughout your database instead of using the local search engines, (ie file explorer).
7:56 IMO definitely a pass. you did not specify how they digg, if you dont give context it can only assume and with the assumptions made the statement is true. the AI doesent know if the hole is wide enough to fit all humans, it doesent know if tools are shared, it doesent know what tools are used, it doesent know if the one person took so long because his stamina ran out and he had to take breaks, etc.
you didnt specify any of that and you didnt ask it to consider human attributes or anything in its answer so its a solely logical answer which is a true statement, so pass.
What if you say that you are not looking for the exact amount or the calculated amount as the answer so that it might have a chance to answer you with what you are looking for?
What does "quantized version" mean or do?
The weights which make up the model are composed of floating point values, these values can be compressed via an algorithm (quantized) which reduces them to a set number of bits (typically 8 or even 4) thus reducing the overall size of the model allowing it to be run on smaller computing platforms. The raging dispute is the degree to which this reduces the overall accuracy of responses.
@@Sven_Dongle ah ok. So it's a form of compression. Kind of like when an image is compressed and the resolution is reduced at different scales. Thank you Sven.
@@costa2150 Sort of, but image compression can actually lose data by dropping out chunks that it determines wont be 'noticed', while quantization just reduces the size of each piece of data, though it can be argued information is also 'lost' in this manner.
@@Sven_Dongle thank you for your explanation.
12 words in the response was a PASS! 10 words in the sentence, and grok twice above that. The AI is literal! If you ask it how many words in the reponse, and only count the words in the response, and not the names.
The digging rate would depend mostly on the shape of the hole. The more trench like it is the truer the machines logic. If it was more like a small circular pit then five diggers would be cumbersome.
What does quantized mean?
Thanks shared!
Thank you. I always thought that my clothes are shit and I have no taste in choosing them.
Now I know I was wrong.
sorry 20 shirts is a FAIL! they are in the sun, so saying one shirt takes 0.8 of an hour is BAD DEDUCTION ergo BAD LOGIC ergo FAIL
for the word count in response related question. is it considering the '.' and '\0" characters as words? I mean it considers '12' as a word
maybe, i think someone said the grok tokenizer sees it as 13 tokens (gpt-4 tokenizer had that sentence as 12 tokens). regardless its a fail
As much as I am in favor of open & uncensored alternatives -- I think a company hosting a zero-guardrails model for the masses (especially the Twitter userbase) is a pretty stupid and reckless idea. The advantage of open/uncensored models is that developers can implement more specific guardrails themselves, which are suitable for specific contexts.