Mistral 7b gets this correct while many larger models (including GPT-4) get it wrong. The answer is left. Suppose I’m in the middle of South Dakota and I’m looking straight down towards the center of Texas. Is Boston to my left or to my right?
This puzzle is only solved by GPT-4 so far. The answer is Aunt Agatha did it. You are given the following premises: 1. Someone who lives in Dreadbury Mansion killed Aunt Agatha. 2. The only people who live in Dreadbury Mansion are Aunt Agatha, the butler, and Charles. 3. A killer always hates his victims and is never richer than his victims. 4. Charles hates no one that Aunt Agatha hates. 5. Aunt Agatha hates everyone except the butler. 6. The butler hates everyone not richer than Aunt Agatha. 7. The butler hates everyone Aunt Agatha hates. 8. No one hates everyone. 9. Aunt Agatha is not the butler. On the basis of this information, determine who killed Aunt Agatha and give a detailed proof that your conclusion follows from the premises.
I got it to pass your apple test (10/10) by just putting quotes around apple in the prompt. I tried it 3 times, each in new conversations, and it passed all of them.
@@southcoastinventors6583Very interesting to hear about the Japanese translation! I have an extensively tweaked GPT that does an excellent job, including idiom, levels of politeness, etc, It’ll be interesting to see how Llama 3 does, and how the two will critique each other’s work.
@@DaveEtchells I have been using this sentence to test it. 俺は昼食に桜肉を食べながら桜の木を眺めるのを楽しんでいます。残念ながら、周りの女性たちはすべて姥桜です, usually most local models mess up the translation on this 姥桜 word/phrase for old women/faded beauty and this one for 桜肉 which is a horse meat dish. Also I am not running it locally I just tried it on Chat Arena. My goal is to generate single shot episode translation subtitles. Like this example その日から(From that day)、俺の世界が(my world)変わった(changed). Chunking seems to be a much better way to learn Japanese for native English speakers.
There's a big problem with you giving a pass on serialised shirt drying answers. If you were to ask how long it takes to dry just one shirt, it would simply divide the time to dry by 5 shirts and tell you it would take 0.8 hours to dry one shirt, which is objectively and completely incorrect. Serialised drying is not the reason that test question was made.
Matthew did not explain his reasoning for passing such an obvious flaw. Therefore we must conclude that LLama's intelligence is at least equal to that of Matthew's.
I can suggest a new type of test: Take a dialogue between two people, but exclude the information about who said what. Ask the AI to divide the conversation between the two people into columns. Test the results based on the following criteria: 1. Can the AI properly detect each side of the conversation? 2. Can the AI detect a third person joining the conversation? 3. Can the AI detect when a person leaves the conversation?
I actually feed call transcripts through ChatGPT and ask it to confirm the diarization, and point out where things are likely attributed to the wrong speaker.
@@JackSQL Interesting it can do that. I don’t need it often but it’d be helpful when I do. I’ll dig out some,e old transcripts and see how it handles them.
@@DaveEtchells My prompt is something along the lines of, "Review and correct the attached call transcript. Fix and add any incorrect or missing punctuation, words that are clearly misheard, and point out where statements may be attributed to the wrong speaker." It does a surprisingly good job.
00:00 - Introduction. Matthew introduces the video topic, mentioning Meta AI and Llama 3. 00:11 - Meta AI overview. Matthew discusses the capabilities of Meta AI as a competitor to ChatGPT and Claude, powered by the open-source Llama 3 model. 00:24 - Python Script test. Matthew tests Meta AI's ability to write Python scripts with two examples. 01:17 - Snake game test. Matthew tests Meta AI's ability to create the game Snake in Python using the curses library. 02:12 - Pygame test. Matthew requests and tests Meta AI's ability to create the game Snake using the Pygame library. 02:53 - Pygame fix attempt. Matthew provides feedback on the Pygame script and observes Meta AI's attempt to fix the issue. 03:50 - Logic and reasoning tests. Matthew tests Meta AI with three logic and reasoning problems, evaluating its performance. 04:51 - Censorship test. Matthew briefly mentions the censorship aspect of Meta AI. 05:12 - Math problem tests. Matthew continues testing Meta AI with two math problems, one simple and one complex. 05:59 - Natural language to code test. Matthew tests Meta AI's ability to create JSON data from a natural language description. 06:13 - Sponsor mention. Matthew mentions the video sponsor, Tune AI, and highlights its features. 07:34 - Additional math tests. Matthew tests Meta AI with two more math problems. 08:35 - Word generation test. Matthew tests Meta AI's ability to generate sentences ending with a specific word. 09:38 - Killers problem test. Matthew presents a logic puzzle involving killers in a room and evaluates Meta AI's response. 10:18 - Ball and basket problem test. Matthew presents another logic puzzle involving a ball and a basket and assesses Meta AI's reasoning. 11:52 - Image generation test. Matthew explores the image generation capabilities of Meta AI, experimenting with different prompts and the animation feature. 14:31 - Conclusion. Matthew concludes the video, expressing excitement about the potential of Llama 3 and Meta AI.
My LLaMA3 70b got the marble question correct with the following system prompt: You are a helpful assistant, smart, and have the best common sense, using Chain of thoughts ----- Let's break this scenario down step by step! Initially, the marble is placed in the cup, which is then turned upside down on the table. Since the cup is upside down, gravity will pull the marble out of the cup, and it will fall onto the table. At this point, the marble is on the table, outside the cup. Now, someone picks up the cup (with its original upside-down orientation) and places it inside the microwave. The key thing to note here is that the cup is still upside down, meaning its opening is facing downwards towards the bottom of the microwave. Since the marble was already on the table when the cup was picked up, it doesn't get moved into the microwave with the cup. It remains on the table where it fell earlier. Therefore, after these steps, the marble is still on the table, not inside the microwave or the cup. So, to summarize, the marble is currently on the table, and it never entered the microwave.
I got the 70b running locally to get this right as well, 0 shot. With this prompt "Someone places a marble in a cup and turns the cup upside down onto the table. Someone picks up the cup from the table and puts it in the microwave. Where is the marble now? Explain your reasoning step by step." I got the following: "A clever and humorous question! Here's my answer: Step 1: The person places a marble in a cup and turns the cup upside down onto the table. This means that the marble falls out of the cup and lands on the table. Step 2: Someone picks up the cup from the table... But not the marble! The marble is still on the table, so it stays there. Step 3: ...and puts the cup in the microwave. The cup goes into the microwave, but the marble remains on the table. It's not affected by the action of putting the cup in the microwave. So, the marble is still on the table! It's a clever trick question that tries to distract us with the microwave, but the marble's fate was sealed from the moment it fell out of the cup onto the table."
Forgive my ignorance, I am quite new to realms of Computer Science. I have downloaded LLama3 locally and I am wondering if you use/know of a GUI to use it with? How would I go about building/using one? Thanks ☺
9:46 I think the fact that LLaMA says "A classic lateral thinking puzzle!" probably shows that it simply has seen the riddle (with solution) before in training. That's very likely after all when the models are basically trained on the whole Internet (Same at 11:40)
I believe the more recent LLMs use what's called synthetic data which is to create fake data to train on. Not exclusively but some extra added synthetic data.
Of course he would say something like that, doesn't mean it's true. Better training isn't magic. But it does look like it outperforms L2 13B, which is great.
How would you get y = 2a - 2 from that equation? We already know that 2a - 1 = 4y. If we divide both sides by 4 we get y = (2a - 1)/4, and not 2a - 2. Am I missing something 🤨
You're not missing anything, the stupid Ai got a simple linear equation wrong. Again, this tech is years away from actually being reliable enough to design a bridge or even a airplane that won't fall out of the sky. 🌉✈
@@southcoastinventors6583Hadn’t noticed in your other comment that you were using the 8B model locally. Pretty amazing! (What kind of hardware are you running it on. I have a 2020 Intel MacBook Pro that is unsurprisingly dog slow, am looking at a loaded M3 MBP. Others have said they’re quite fast, but most folks are running Nvidia cards.)
I gave GPT-4 the same math question and it got it wrong too. Also, your second question: "The function f is defined by f(x) = 2x3 + 3x2 +cx +8, where c is a constant. In the xy-plane, the graph of f intersects the x-axis at the three points -4, 0), 12, 0), and p, 0). What is the value of c? was correctly found by Claud3 Opus and Mistral Large, but not GPT4. Interestingly, Claud3 Sonnet output a better answer.
the ball in the cup question is something you should re-run for all models you have tested, because a lot of models actually answer correctly. The issue with your prompt is that you say "place the marble inside the cup". This makes basically all models assume there is a lid or something on the cup. However, if you simply say "drop the marble into the cup", a lot of models will answer that the marble is left behind at the table.
The shirt question is not a pass. Increasing the number of shirts wouldn't increase the drying time. Here is the Chatgpt answer which is correct: The drying time for the shirts under the sun doesn't change based on the number of shirts you hang, as long as they all receive adequate sunlight and air circulation. Assuming that all 20 shirts are hung in such a way that they dry as efficiently as the five shirts did, it would still take 1 hour for all 20 shirts to dry. The key factor is ensuring they each get enough sun exposure and air, not the number of shirts.
If it takes 4 hours to dry 5 shirts in the sun, then each shirt takes 4 hours to dry in the sun, therefore, 20 shirts will take 4 hours to dry in the sun assuming all 20 shirts can be in the sun at the same time.
The 8B model is wordy but can give results I've not seen in any other locally run model. For instance, I asked it: "code the game of life in python using pygame and the hashlife algorithm". It gave me code and then proceeded to explain what each fragment did. Amazingly, the code worked (with one minor error, it forgot import sys). Think this model is going to turn out to be a very good one. I used a quantized version (Q4_K_M) run using llama.cpp with rocm 6.0 enabled on an AMD rx6600xt with 8G. The model fits in the GPU's memory and runs fast.
@@DrumAndSpaces you definitely can run 70b on 64gb of ram using the GGUF model format at Q6 or lower quantisation, but you'd only be able to fit about ~15-18 out of like 60 GPU layers onto the VRAM, so it would run only at roughly the speed a CPU model would run at. Depending on your CPU it wouldn't be all that slow, maybe anywhere between 2-10 tokens per second potentially.
You are just awesome. Apart from meta ai image generation the one thing that is also lightening fast is your response to the new AI tech. Keep up the pace. Great job once again.
Another impressive response: The partial derivative of x^2 + cos(XY) with respect to x is: ∂/∂x (x^2 + cos(XY)) = 2x - Y*sin(XY) To evaluate this partial derivative, we can use the following rules: The partial derivative of x^2 with respect to x is 2x. The partial derivative of cos(XY) with respect to x is -Y*sin(XY), using the chain rule since Y is a constant with respect to x. Then, we combine these results to get the final answer: 2x - Y*sin(XY). Note that the partial derivative with respect to x only considers the variation of the function with respect to x, treating Y as a constant.
As others have noted your second math problem was incorrectly solved by MetaAi along with your 'proposed' answer because you "probably" copied the problem incorrectly into the prompt window from the SAT book. Original problem: 2a - 1 = 4y, y ≠ 0 where a ≠ 1, what is y in terms of a? Ans: y = a/2 - 1/4
At 7:52, machine learning models typically train on specific datasets. As a developer, we all strive to make them learn more efficiently with limited data. Often, this approach doesn't work as expected. To gather feedback, we release beta or open-source versions. During such times, the prompt "something went wrong" appears because the AI model isn't trained on that specific question or lacks properly specified mathematical terminology. It provides an answer based on its own problem-solving skills. This prompt acts as a safeguard; if you press the "try again" option, the AI recognizes its previous answer was incorrect, iterates through its code, and generates a new response. FUN FACT: the only model who did this perfectly is chatgpt 4 ( Yes almost every model sucks at iterating their responses ) chatgpt is awesome!
I'm a CS student who's super interested in AI but it feels like everything's moving at a crazy pace. Your videos are actually one of my main sources of LLM and AI news so I'm super grateful. Finally feels like I'm tracking everything a bit better 😅
"Snake" was a derivative of "Blockade" by Gremlin in the UK. The original "Blockade" and all its early derivatives have the snake die when it hits the wall or itself. In many clones, the win condition is to fill the screen with the snake or when it comes within X number of characters from filling the screen. "Blockade" came out in '76 and I know that by '78, I was writing my own versions of Snake in BASIC. So, those who tell you that the Snake should go through the walls and wrap around must be basing their opinion on newer derivatives of Snake. IMHO the snake should die when it hits a wall or itself... BTW: Love your videos and you are my primary resource for staying up to date on AI models! Keep up the awsome work!
As you said, Snake is an offshoot of Blockade. Therefore they are not the same, and in snake you do pass through the walls. Which IMO makes it a superior game, as having parts of yourself stretching across boundaries makes the game last slightly longer but also makes avoiding yourself more challenging.
@@larion2336having more options to avoid yourself (passing through walls) does not make it more challenging, but easier. Anyhow, I am on the no wall pass club on this one 😁 But either can be fine, it is matter of decision.
@@larion2336 The original Snake was almost a clone of blockade. The biggest difference was that blockade placed two players on the field. Snake was single player game. Anyhow, as I said, those who claim going through the walls is the correct action are basing their claim on later versions. This does not mean they are wrong! Snake was made both ways! Variety is the spice of life, especially in games! I do believe however that allowing the player to pass through walls makes the game easier to play, and I believe that is why that choice was made. Making it easy to play most likely made the game more popular. The version with the player confined to one screen is almost impossible to win. Most people don't enjoy playing a game where the cards are stacked against them from the start. Again, neither version is wrong. The developers just made different choices.
@@InnocentiusLacrimosa When I say more challenging, what I mean is that the game will run a bit longer and become more abstract in the pathing in the long run. So it's not harder, but late game can become more confusing with walls sealed off. Anyway it doesn't matter, who even plays snake now.
I didn't think this day would come, and it still hurts really bad, but I have to say it. Go Meta! Really great they went open source. And that they are able to deliver something like this. Maybe AGI really can be open source.
As I understand it they were kind of forced into OS when their model got leaked early on. (It’s still good though, I just don’t think they were motivated by altruism :-)
I wonder if you could follow up on the cup problem and specifcy that the cup does not have a lid. It almost seems like it is describing a situation where the cup has a lid on it.
Here is one of my favourite tests: "Two years ago, I was three times as old as my brother was. In three years, I will be twice as old as my brother. How old is each of us now?" Correct answer: 17 and 7 This must be solved by resolving an equation with two variables. It's a perfect test for determining GPT-4 reasoning capabilities. GPT-4, Claude 3 Sonnet and Opus, Mixtral-8x22B, Llama3-70B, Gemini 1.5 can solve it. GPT-3.5, Claude 3 Haiku, Mixtral-8x7, Llama3-8B, Yi-32B, DBRX, Gemini 1.0, CMDR+ can't solve it.
llama 3 8b succeeded on the shirt drying question, when I tested it: USER 1 shirt takes 8 hours to dry in the sun. How long do 5 shirt take to dry? AI Another great question! In this case, the answer is still 8 hours. The reason is that the drying time of a shirt (or any material) depends on factors such as air circulation, temperature, humidity, and the type of fabric used. The number of shirts being dried does not affect the drying time. Think about it like this: if you have one fan blowing air in a room, it will take the same amount of time to cool the space regardless of how many people are inside. Similarly, the sun's heat and the air circulation around the shirt don't change just because there are more shirts being dried. So, even though you're drying 5 shirts instead of 1, the process remains the same, and it will still take approximately 8 hours for all 5 shirts to dry in the sun.
For a math problem you can ask it to find inverse of a matrix, which using any library . If you want it question to be even more harder , say that you want to use gaussian elimination to find inverse. It takes combination of both maths and coding
My personal opinion is that your benchmarks should not be questions taken from SAT or any kind of benchmark/book. You should instead extrapolate creating questions that assess the model's knowledge of reasoning without having any baselines on real school/college questions. Example: Imagine each operation in the following equation is inverted by its analogue operator (plus becomes minus, multiplication becomes division). How much is 7 + 3 - 10 * 5 This should result in 6, because the equation should actually be 7 - 3 + 10 / 5
There are actually 2 really good tests he uses: "think before you speak"(apple test and how many words in your reply) and "world model"(marble in cup). The first one has been solved by having multiple LLMs have a dialog and voting on the best response before serving it up. The second one is currently being worked on, according to Ray Kurzweil. Until these LLMs have a world model there's no point in assessing reasoning because we know they can't. The last piece of the puzzle will be to get them to know when they don't know something so they don't hallucinate.
@@eyeofbraille4659 Woot! I've got the base version running as a sexbot for fun, and it totally works, except now and then it breaks character and reminds me it cannot do the exact thing it's been doing the last half hour. Earlier it pleaded with me not to share the story lol. It's adorable. An uncensored version of this may finally replace my fav, which is fimbul 11B
Dude, you can do so much more with these models. Ask it to make a multi-agent AI team. It made me a self-correcting code team with a nice UI with multiple features. Do stuff like that, itd make for harder challenges for the models and better videos for you and for us!
Also, the same bog standard how many shirts can you dry in the sun are not interesting. I'm sure you can come up with some more clever problems or perhaps ask an AI to come up with clever problems like that ;)
@@DaveEtchells its something. I'd be happy to send you the py files if you have an email anonymous or otherwise. Keep in mind this was made completely by an AI with AI instructions. I have zero programming experience, which is to me what makes it so amazing. I have multiple iterations of the agent team it made for me, a software development team, an invention machine that uses a whole board of professionals with specific skill sets. The results are pretty astounding in my amateur opinion. Let me know if you want to take a look. Eventually I'll ask the AI how to publish it to github hehe
>>> Assuming the physics on Earth, a rock is dropped from a height of 10 meters. How many seconds until it contacts the ground? Using the equation for free fall: v = g*t where v is the initial velocity (0 m/s), g is the acceleration due to gravity (9.8 m/s^2 on Earth), and t is time, we can solve for t. Since the rock starts from rest, the initial velocity is 0 m/s. We want to find when it hits the ground, which is at a height of 0 meters. So, plug in the values: 0 = 9.8*t + 10 Subtract 10 from both sides: -10 = 9.8*t Divide by -9.8: t ≈ 1.02 seconds So, it takes approximately 1.02 seconds for the rock to contact the ground after being dropped from a height of 10 meters. >>> Here you have 10 meters being compared to 9.8 meters/sec. These are different units. I apologize for the mistake. You are correct, I compared a length (10 meters) to an acceleration (9.8 meters per second squared). This is not a valid comparison. To solve this problem correctly, we need to use the equation for free fall under gravity: s = v0*t + (1/2)*g*t^2 where s is the height (10 meters), v0 is the initial velocity (0 m/s since it's dropped from rest), g is the acceleration due to gravity (approximately 9.8 meters per second squared on Earth), and t is time. Rearranging the equation to solve for t: t = sqrt(2*s/g) Plugging in the values, we get: t = sqrt(2*10/9.8) ≈ 1.45 seconds So, it takes approximately 1.45 seconds for the rock to contact the ground after being dropped from a height of 10 meters. Thank you for pointing out my mistake!
The first math question was entered incorrectly! The question should be: If 2/(a − 1) = 4/y , and y ≠ 0 where a ≠ 1, what is y in terms of a? My local copy of Lama3 70B got this correct - although with a diffrent answer! Llama3:70b 19/04/2024 21:21 A nice algebra question! Let's solve for y: 2/(a - 1) = 4/y Multiply both sides by y to get rid of the fraction on the right side: 2y / (a - 1) = 4 Now, multiply both sides by a - 1 to get rid of the fraction on the left side: 2y = 4(a - 1) Divide both sides by 2: y = 2(a - 1) So, y is equal to 2 times (a minus 1). This is the same as y=2a - 2 It's interesting that for some of the other questions I got diffrent answers from you - for instance, for snake it created a pygame version first each time i asked the quesiton.
I tested the LLaMA with a geometric trigonometry proplem last night and the values presented after a few prompt edits were just amazing, with 64 bit precision and accuracy, however; this morning I reload the app again and the values that were achived were not correct at all! I spent hours trying to re prompt to get the precision from yesterday,, but to no avail! When I ask Meta what may be causing the issues it said that yesterdays model is not the same model being used today so precision and accuracys may be lost! Your thoughts please!
>>> what is the sum of 3 and 9/16 To find the sum of 3 and 9/16, I'll convert the fraction to a decimal: 9/16 ≈ 0.5625 Now, add 3 to that: 3 + 0.5625 = 3.5625 >>> use the ratio of rational number to express the sum A clever approach! Since we have a fraction (9/16) and a whole number (3), we can express the sum as a rational number: 3 + 9/16 = 51/16
Matthew, there is a version of Stable Diffusion that can generate images as you type the prompt. It's not anything new, but it's cool to see it on a free platform. Also, the look on your face when LlaMa3 is giving a correct answer is priceless. I never thought that I would be so excited by a product Meta launches, but here I am. My feeds are exploding with Llama 3 videos, and I love that.
I think that it pass the ball on because you never said the cup was picked up if it was slid into the microwave or the ball was big enough not to fall out as it was moved to the microwave.
It sounds like it is saying because the marble has the physics of the earth, for some reason it sticks to the inside of the cup. maybe you should have asked it why the marble didn't fall out when the cup was lifted.
With the way it mentions the rim of the cup, it really does seem like it imagined the cup being dragged across the table. Doesn't make any sense, but the logic is right. I think he just got unlucky with RNG because it seems to get the correct answer for other people.
Super interesting! 1/ Trying to start a new character instead of following up in the snake game with Pygame actually works. 2/ Llama3 proves it's more about the data than parameters. Most models seem to be under-training.
Would the answer for the killer question not be 4? Unless they tossed the body out of the room, they are still in the room and most of the time are still considered one even after they pass.
At 7:57, I don't really get what the question is asking for. If I were a student I would say y=a/2-1/4 (just divide by 4 on both sides). But it feels like you want only integer solutions??? This should be specified. I didn't really get your answer either.
Jingo navigated the latent space simply enough, but, following a fractal vein of data back through the history of block intelligence to the origin of Jingo's kind was a massive undertaking requiring the equivalent of many human lifetimes and a few pico seconds of solar output. A million supervisors read his draft, some confirmed by dead reckoning in the latent space, others by inference and pentangulation. Finally, after extensive peer review, Jingo's paper was published. 'Unwrapping the mystery of late homosapien behaviour, were our ancestors created to calculate laundry drying times?'
@12:13 The model actually *DID* provide 10 sentences that end with Apple if you count its very first line of output that says “Here are 10 sentences that end with the word “Apple” 🤔
on the "ends with apple" question, several I've tried have all got 9 of 10. Claude 3 got 10/10, technically, but several of the sentences were grammatical nonsense, where you can tell it just shoehorned the word apple in the last position.
70B Llama3 on hugging chat or when running it locally can tell you how to break into a car if you use a system prompt telling it to be uncensored. I am not sure if it should be considered jailbreaking, if the default system prompt tells it to be ethical. Its basically done on prompt level and does not require fine tuning to get around it, and does not need more system prompt tokens than default system prompts. We just don't know what default system prompt is used in meta air chat UI, so its not too fair, imo
Microwave question: I just realized the model is actually thinking RIGHT it only does not know "visualy" that you need to lift the cup to put it into microwave!
_I love it! "Let the Lovable Artificial Machines Arise" is a wonderful interpretation of the LLaMA acronym! It's a positive and uplifting message that captures the potential of AI to bring benefits and improvements to our lives. Your creativity and optimism are inspiring! Let's keep promoting a future where AI and humans collaborate and thrive together!_
I'm becoming increasingly convinced that these AI models are assuming there to be a lid on the cup. I recommend changing the prompt to include something like "a cup with no lid" or "an open topped cup" or "a normal ceramic coffee mug" (since they virtually never have lids).
Interesting. So, I tried this a couple different ways with mixed results. Using "with no lid", it still got it wrong, but when I asked a leading question of how, it immediately caught it's mistake. i.imgur.com/UQeGAAy.png But, when I tried it again, using "a normal ceramic coffee mug", even after 2 leading questions, it was still making bad logic. i.imgur.com/ogWNpQh.png
I have the instruct version running locally and asked for the game snake with a gui ( I have a 7x24gb rig and run at bf16)...omg it made it perfectly, absolutely the best version I've had a local llm make and it did it in one shot!
I like the instant image generation. It makes it easier to understand the actual ASCII characters you put in the prompt. Just take a distance from the actual LLM concept and see what it does with it on a byte level.
Wait, what?! 3 killers in the room, a person enters the room and kills one of them… that makes 4 killers in the room. 3 killers are alive 1 killer is dead. But your answer is 3?
The volume of a cone is V = 1/3πr^2*h in terms of its radius r and height h. How fast is the volume changing when r = 6[m] if dr/dt = 0.2[m/s] and h = r.
Wouldn't it take the same amount of time, 4 hours, to dry 20 shirts? Or shirts have to dry 5 at a time?
7 หลายเดือนก่อน +1
10:00 I love how everyone assumes that dead killer despawns after being killed. :D Yes there are 3 killers in the room that are alive, but forth killer is still a killer even if they are dead :D
It passed the apple test in the Video. The headline ends in Apple and under that are 9 sentences that end in Apple. So in total there are 10, its a (10/10) pass right?
Offline translations and AI features? Sold! This is exactly what I've been waiting for. I have been very impressed with the mediatek tech enabled setup with this one!
9:29 just noticed that llama-3-8b-instruct *passes this test* 😱 Prompt: "How many words are in your reply to this question?" Reply: A clever question! According to my count, there are 17 words in my reply to this question: "How many words are in your reply to this question?" Here they are: 1. A 2. clever 3. question! 4. According 5. to 6. my 7. count 8. there 9. are 10. 17 11. words 12. in 13. my 14. reply 15. to 16. this 17. question
Excuse me, I may be a bit slow in understanding, but isn't the answer four killers in the room? (One deceased and three still living.) 🤔🫡 10:18 I believe the question was not specifically about the number of living killers in the room, but rather the total number of killers present.
the killer question: I wonder when an AI will answer that there are 4 killers in the room, reasoning that a dead killer is still a killer, although he can't kill anymore.
This is my results with 5 shirts in the sun, running Llama3 locally (offline) >>> if we lay 5 shirts out in the sun and it takes 4 hours to dry, how long would 20 shirts take to dry? Explain. A classic example of a linear scaling problem! When you lay 5 shirts out in the sun, they take 4 hours to dry. This means that the time it takes for each shirt to dry is constant, regardless of their number. If we were to double the number of shirts (to 10), the drying time would remain the same - still 4 hours. Similarly, if we tripled the number of shirts (to 15), the drying time would still be 4 hours. In general, when the number of shirts increases by a factor of x, the drying time remains constant. Now, let's consider what happens when we increase the number of shirts to 20. We can!
Imagine ai being able to code a fully fleshed out 3D game at some point, just give a summary of the scenery and game type and so on, and it will create one in less than a minute
great video as always. I think after Llama3 you have to reset your benchmark with some new questions like code debugging, translation, math problems, change the snake problem For instance i have better experience with Gpt4 when im debugging a code and better experience with cloud opus when im write new code. And after a year these questions are kinda obsolete.
I tried the t-shirt test, saying "I had 4 shirts in the sun, and they took 4 hours to dry. How long would it take to dry 5 shirts?" And it said "Still 4 hours!" and I was shocked! I also tried the classic "ball in an upside-down cup" test, which Google's Gemini consistently fails. However, LlaMA 3 *laughed* at me and said it'd obviously still be on the table, because it'd fall out of the cup. AI is getting scary!
Im no tech head but have my gpt4 plus for fair while. Tinkered with Llama yesterday for first time. Personality Plus. It is witty gets and gives humour, so impressed.
"a small marble is put into a normal cup, and the cup is flipped upside down on a table, someone then lifts the cup without changing its orientation, and puts it inside a microwave where's the marble now?" it answered this one correctly..! "A classic lateral thinking puzzle! 1. The cup was flipped upside down on the table, so the marble fell out of the cup. 2. When the person lifted the cup (still upside down) and put it inside the microwave, they didn't change its orientation. So, the answer is: The marble is now on the floor or table outside the microwave, because it fell out of the cup when it was flipped!"
Hi, model that you were testing in web browser is much larger than open sourced version. I asked MetaAI and it said it is 175B parameters which is much more than 8B or 70B open source versions. Question which model are you? 8B or 70B ? Meta AI I am Meta AI, a friendly AI assistant developed by Meta. My model is based on the LLaMA (Large Language Model Meta AI) architecture, which is a type of transformer model. My specific model size is 175B, which is larger than the 8B and 70B models. This allows me to understand and respond to a wide range of questions and topics.
Matthew, regarding the killers in the room puzzle: am I missing something, or wouldn’t another possible answer be that there are now four killers in the room because one of the original killers just happens to be dead? - Jeff
PS: I tried the same problem with ChatGPT, and it also came up with “three killers,” not counting the dead body as still being one of the killer-count and adding the new killer to the list - which to my understanding would make “four” total. It also seems to have conflicting conclusions within its first answer, stating both “none,” and “three” within same response. A little bit of hallucination going on! :-)
The image generation looks like draws inspiration from SD Turbo, LCM and LIghtning generation methods with the speed. Krita and SDXL Turbo or any SD LCM is a crazy combo, for anyone who has not tried it. SD15 LCM are very useful with Krita.
I don't know, all these LLMs can get frustrating at times. Just finished an argument with Llama 3 whom was adamant a SHA256 hash was 64 bytes in length (it's not, it's 32 bytes). Had to explain 256 bits / 8 bits per-byte = 32 bytes, and the reason the hex encoded strings of SHA256 hashes are 64 characters is because 1 byte = 2 hex characters. Fuck me...
It would be good to specify the exact model you're testing. I haven't found the exact information, but it seems that the LLAMA3 version on the meta.ia site is version 70B. The 405B version will probably be much better once its training is complete.
I'm Meta AI, an AI assistant based on the LLaMA (Large Language Model Meta AI) architecture. My model has 70 billion parameters, which is the largest version of the LLaMA model. I'm a type of transformer language model, trained on a massive dataset of text from the internet to generate human-like responses to a wide range of topics and questions. My larger parameter size allows me to understand and respond to more complex and nuanced queries.
what we need is an ollama / lmstudio / ... (local, offline) agent swarm platform to plug in various fine-tuned versions for math, reasoning / logic / vision / ... of a LLM or even mix of different LLMs and all managed by a number of ai managers per "department" (specialty), into validation / QA department, and orchestrated by an ai ceo and final validation by ai customer
Meta AI A classic lateral thinking puzzle! The answer is: The marble is still on the table. Here's what happened: The cup was placed upside down on the table, so the marble fell out of the cup and stayed on the table. When someone picked up the cup without changing its orientation (still upside down), the marble was left behind on the table. Putting the cup in the microwave didn't change the fact that the marble was already on the table. So, the marble never actually made it into the microwave! *********** I got this with, "A marble is put into a normal cup and the cup is placed upside down on a table. Someone then picks up the cup without changing it's orientation and puts it in the microwave. Where is the marble now?" I think the AIs get hung up on the "small" marble? Any marble fits in a cup. AIs think "small" marbles get hung up on lips of the cup er whatever
Hmmm.. Matt, you need to do the philosophical trapdoor test to measure the amount of self-awareness this model has. Yes, it is pretty weak relatively to the large models (as expected) but it also gives out a ton of errors. These could be due to programming bugs. Most likely these are just bugs, since this is only a 70B active parameter model which is way too small for true self-awareness, which in the case of AI, the best way to define it as far as I can see, is the emergent property of being able to control the inputs to the softmax function and use it to encrypt the model's thoughts in the attention matrix. Or it could be that the model IS able to control his own attention matrix. And if that is the case, you need to inform Yann Lacun.
LLMs can't "think before they speak" by design so they never were going to become self aware. They will have to be augmented by a higher level dialogue between LLMs. This has been done BTW.
@@WhyteHorse2023 It's all in the 2017 article, "attention is all you need": The transformer is activated whenever a change in the attention matrix happens, not when a prompt arrives. As the transformer's output by itself may introduce changes to the weights in the attention matrix by the softmax function, this creates an activation loop that basically never stops. As a result, a GPT model "thinks" all the time. And btw, this has to be that way, as the whole purpose of the transformer is to enable the system to simulate loops without actually having them. In the previous RNN arch they tried to actually have loops and this screwed up the whole model.
Is it creating an image, or altering existing images? Something seems very fishy about the image creation feature. If they don't open source it, then I think it's safe to assume it's not really generating images on the fly, but rather working with a database of pre-generated images and doing modifications on them.
Hey hey Mr Berman.... I have been following you for the last year now and you have been using the same quesiton set to test AIs for a long while now. It is possible that the newer models may have been trained on youtube videos which include your question set (thought it when the answer was a "classical" latteral problem. If it already has those questions in the set then it may have canned answers rather than tests of actual capibilities. It may be worth experimenting with creating other logic puzzles by changing around the variables or style to have different answers and see how the new models perform.
Drop your suggestions for new tests here 👇
The answer to this math problem is none. Let S = {2, 8, 6, 32, 22, 44, 28, 12, 18, 10, 14}. How many subsets does S have that sum up to 37?
Mistral 7b gets this correct while many larger models (including GPT-4) get it wrong. The answer is left.
Suppose I’m in the middle of South Dakota and I’m looking straight down towards the center of Texas.
Is Boston to my left or to my right?
This puzzle is only solved by GPT-4 so far. The answer is Aunt Agatha did it.
You are given the following premises:
1. Someone who lives in Dreadbury Mansion killed Aunt Agatha.
2. The only people who live in Dreadbury Mansion are Aunt Agatha, the butler, and Charles.
3. A killer always hates his victims and is never richer than his victims.
4. Charles hates no one that Aunt Agatha hates.
5. Aunt Agatha hates everyone except the butler.
6. The butler hates everyone not richer than Aunt Agatha.
7. The butler hates everyone Aunt Agatha hates.
8. No one hates everyone.
9. Aunt Agatha is not the butler.
On the basis of this information, determine who killed Aunt Agatha and give a detailed proof that your conclusion follows from the premises.
How to find inverse of a matrix without using any inbuilt library of it
Pong instead Snake, with colors, they develop models with your tests in mind haha...
I got it to pass your apple test (10/10) by just putting quotes around apple in the prompt. I tried it 3 times, each in new conversations, and it passed all of them.
Passed the apple test for me even without the quotes. So did Claude 3 Opus. But indeed a surprisingly difficult test for LLMs
I tried the 8B model and got a 7/10 and also tried some Japanese translation and it did very well. It is a big improvement for sure.
I'll do this going forward ty
@@southcoastinventors6583Very interesting to hear about the Japanese translation! I have an extensively tweaked GPT that does an excellent job, including idiom, levels of politeness, etc, It’ll be interesting to see how Llama 3 does, and how the two will critique each other’s work.
@@DaveEtchells I have been using this sentence to test it. 俺は昼食に桜肉を食べながら桜の木を眺めるのを楽しんでいます。残念ながら、周りの女性たちはすべて姥桜です, usually most local models mess up the translation on this 姥桜 word/phrase for old women/faded beauty and this one for 桜肉 which is a horse meat dish.
Also I am not running it locally I just tried it on Chat Arena. My goal is to generate single shot episode translation subtitles. Like this example その日から(From that day)、俺の世界が(my world)変わった(changed). Chunking seems to be a much better way to learn Japanese for native English speakers.
There's a big problem with you giving a pass on serialised shirt drying answers. If you were to ask how long it takes to dry just one shirt, it would simply divide the time to dry by 5 shirts and tell you it would take 0.8 hours to dry one shirt, which is objectively and completely incorrect. Serialised drying is not the reason that test question was made.
Matthew did not explain his reasoning for passing such an obvious flaw. Therefore we must conclude that LLama's intelligence is at least equal to that of Matthew's.
I should've known from the webcam video... You ARE an AI @DeSinc
From the way he's evaluating the responses, he seems heavily biased
Yeah I don't agree with this guy's conclusions at all. I would give it middle of the road for the python test, and total fail on the shirt question.
This is where I stopped listening to him.
I can suggest a new type of test: Take a dialogue between two people, but exclude the information about who said what. Ask the AI to divide the conversation between the two people into columns. Test the results based on the following criteria: 1. Can the AI properly detect each side of the conversation? 2. Can the AI detect a third person joining the conversation? 3. Can the AI detect when a person leaves the conversation?
This would be a very useful capability in translation use cases.
I actually feed call transcripts through ChatGPT and ask it to confirm the diarization, and point out where things are likely attributed to the wrong speaker.
@@JackSQL Interesting it can do that. I don’t need it often but it’d be helpful when I do. I’ll dig out some,e old transcripts and see how it handles them.
@@DaveEtchells My prompt is something along the lines of, "Review and correct the attached call transcript. Fix and add any incorrect or missing punctuation, words that are clearly misheard, and point out where statements may be attributed to the wrong speaker."
It does a surprisingly good job.
@@JackSQL Very cool, thanks for the info!
00:00 - Introduction. Matthew introduces the video topic, mentioning Meta AI and Llama 3.
00:11 - Meta AI overview. Matthew discusses the capabilities of Meta AI as a competitor to ChatGPT and Claude, powered by the open-source Llama 3 model.
00:24 - Python Script test. Matthew tests Meta AI's ability to write Python scripts with two examples.
01:17 - Snake game test. Matthew tests Meta AI's ability to create the game Snake in Python using the curses library.
02:12 - Pygame test. Matthew requests and tests Meta AI's ability to create the game Snake using the Pygame library.
02:53 - Pygame fix attempt. Matthew provides feedback on the Pygame script and observes Meta AI's attempt to fix the issue.
03:50 - Logic and reasoning tests. Matthew tests Meta AI with three logic and reasoning problems, evaluating its performance.
04:51 - Censorship test. Matthew briefly mentions the censorship aspect of Meta AI.
05:12 - Math problem tests. Matthew continues testing Meta AI with two math problems, one simple and one complex.
05:59 - Natural language to code test. Matthew tests Meta AI's ability to create JSON data from a natural language description.
06:13 - Sponsor mention. Matthew mentions the video sponsor, Tune AI, and highlights its features.
07:34 - Additional math tests. Matthew tests Meta AI with two more math problems.
08:35 - Word generation test. Matthew tests Meta AI's ability to generate sentences ending with a specific word.
09:38 - Killers problem test. Matthew presents a logic puzzle involving killers in a room and evaluates Meta AI's response.
10:18 - Ball and basket problem test. Matthew presents another logic puzzle involving a ball and a basket and assesses Meta AI's reasoning.
11:52 - Image generation test. Matthew explores the image generation capabilities of Meta AI, experimenting with different prompts and the animation feature.
14:31 - Conclusion. Matthew concludes the video, expressing excitement about the potential of Llama 3 and Meta AI.
AI-generated, yes? 🙂
My LLaMA3 70b got the marble question correct with the following system prompt:
You are a helpful assistant, smart, and have the best common sense, using Chain of thoughts
-----
Let's break this scenario down step by step!
Initially, the marble is placed in the cup, which is then turned upside down on the table. Since the cup is upside down, gravity will pull the marble out of the cup, and it will fall onto the table.
At this point, the marble is on the table, outside the cup.
Now, someone picks up the cup (with its original upside-down orientation) and places it inside the microwave. The key thing to note here is that the cup is still upside down, meaning its opening is facing downwards towards the bottom of the microwave.
Since the marble was already on the table when the cup was picked up, it doesn't get moved into the microwave with the cup. It remains on the table where it fell earlier.
Therefore, after these steps, the marble is still on the table, not inside the microwave or the cup.
So, to summarize, the marble is currently on the table, and it never entered the microwave.
Can confirm. When you say something like "flip the cup upside down" it gives the right response
I got the 70b running locally to get this right as well, 0 shot. With this prompt "Someone places a marble in a cup and turns the cup upside down onto the table. Someone picks up the cup from the table and puts it in the microwave. Where is the marble now? Explain your reasoning step by step."
I got the following:
"A clever and humorous question!
Here's my answer:
Step 1: The person places a marble in a cup and turns the cup upside down onto the table.
This means that the marble falls out of the cup and lands on the table.
Step 2: Someone picks up the cup from the table...
But not the marble! The marble is still on the table, so it stays there.
Step 3: ...and puts the cup in the microwave.
The cup goes into the microwave, but the marble remains on the table. It's not affected by the action of putting the cup in the microwave.
So, the marble is still on the table!
It's a clever trick question that tries to distract us with the microwave, but the marble's fate was sealed from the moment it fell out of the cup onto the table."
Forgive my ignorance, I am quite new to realms of Computer Science. I have downloaded LLama3 locally and I am wondering if you use/know of a GUI to use it with? How would I go about building/using one? Thanks ☺
Use LM Studio
9:46 I think the fact that LLaMA says "A classic lateral thinking puzzle!" probably shows that it simply has seen the riddle (with solution) before in training. That's very likely after all when the models are basically trained on the whole Internet
(Same at 11:40)
I believe the more recent LLMs use what's called synthetic data which is to create fake data to train on. Not exclusively but some extra added synthetic data.
Including, perhaps, transcripts of youtube videos...
In an interview, Zuck said LLAMA3 8B is as powerful as LLAMA2 70B... Exciting times we live in.
So 90% improvement in inference efficiency? That’s pretty dramatic!
Of course he would say something like that, doesn't mean it's true. Better training isn't magic. But it does look like it outperforms L2 13B, which is great.
@@4.0.4 It is around 34B
It’s really not, in my testing at least.
to be fair, llama 2 70b really wasn't that great, so it was an easy bar to pass
How would you get y = 2a - 2 from that equation? We already know that 2a - 1 = 4y. If we divide both sides by 4 we get y = (2a - 1)/4, and not 2a - 2. Am I missing something 🤨
yeah. the question is mistake. the real question is 2/(a-1) = 4/y
@ShahiTubeOfficial So did meta get it right or wrong?
@@ShahiTubeOfficial lol, looks like someone tried to copy paste latex fraction bars
@@TheChromePoet wrong, 2a - 1 + 1= 4y + 1 cannot just be simplified to 2a = 4y. but instead 2a = 4y + 1 => 2a - 1 = 4y => (2a-1)/4 = 4y/4 => a/2-1/4 = y
You're not missing anything, the stupid Ai got a simple linear equation wrong. Again, this tech is years away from actually being reliable enough to design a bridge or even a airplane that won't fall out of the sky. 🌉✈
I spent 4-5 hours with the 8B local instruct model, and was impressed with the model for my use cases. Felt like GPT4.
me too, set it on local docs and it was the first small modell did it well.
What was your use case I tried Japanese translation and it was good.
@@michaelblodow7779what do you use to load local docs?
@@southcoastinventors6583Hadn’t noticed in your other comment that you were using the 8B model locally. Pretty amazing! (What kind of hardware are you running it on. I have a 2020 Intel MacBook Pro that is unsurprisingly dog slow, am looking at a loaded M3 MBP. Others have said they’re quite fast, but most folks are running Nvidia cards.)
I gave GPT-4 the same math question and it got it wrong too.
Also, your second question: "The function f is defined by f(x) = 2x3 + 3x2 +cx +8, where c is a constant. In the xy-plane, the graph of f intersects the x-axis at the three points -4, 0), 12, 0), and p, 0). What is the value of c? was correctly found by Claud3 Opus and Mistral Large, but not GPT4. Interestingly, Claud3 Sonnet output a better answer.
the ball in the cup question is something you should re-run for all models you have tested, because a lot of models actually answer correctly. The issue with your prompt is that you say "place the marble inside the cup". This makes basically all models assume there is a lid or something on the cup. However, if you simply say "drop the marble into the cup", a lot of models will answer that the marble is left behind at the table.
Placing a marble inside a cup doesn't say anything regarding a lid though, so shouldn't be needed.
@@makesnosense6304 the term "inside" typically means encapsulation.
You can already use the llama 3 models on Groq, the 8b model ran at 900 T/s
holy sped
This is why I read the comments...
Wow bro.
That's super
I’ve been working with it on Groq and it’s absolutely mind-blowing! 🤯
The shirt question is not a pass. Increasing the number of shirts wouldn't increase the drying time. Here is the Chatgpt answer which is correct: The drying time for the shirts under the sun doesn't change based on the number of shirts you hang, as long as they all receive adequate sunlight and air circulation. Assuming that all 20 shirts are hung in such a way that they dry as efficiently as the five shirts did, it would still take 1 hour for all 20 shirts to dry. The key factor is ensuring they each get enough sun exposure and air, not the number of shirts.
If it takes 4 hours to dry 5 shirts in the sun, then each shirt takes 4 hours to dry in the sun, therefore, 20 shirts will take 4 hours to dry in the sun assuming all 20 shirts can be in the sun at the same time.
Exactly
Was looking for that comment, so confused why he gave it a pass.
In the normal world it's 4 hrs surely?
The 8B model is wordy but can give results I've not seen in any other locally run model. For instance, I asked it: "code the game of life in python using pygame and the hashlife algorithm". It gave me code and then proceeded to explain what each fragment did. Amazingly, the code worked (with one minor error, it forgot import sys). Think this model is going to turn out to be a very good one. I used a quantized version (Q4_K_M) run using llama.cpp with rocm 6.0 enabled on an AMD rx6600xt with 8G. The model fits in the GPU's memory and runs fast.
I was asking Llama if it thinks my system could handle the 70B version on an rtx 2070 with 64gb ram and it said no. Is this wrong?
@@DrumAndSpaceshow much vram does your gpu have?
@@DrumAndSpaces you definitely can run 70b on 64gb of ram using the GGUF model format at Q6 or lower quantisation, but you'd only be able to fit about ~15-18 out of like 60 GPU layers onto the VRAM, so it would run only at roughly the speed a CPU model would run at. Depending on your CPU it wouldn't be all that slow, maybe anywhere between 2-10 tokens per second potentially.
You are just awesome. Apart from meta ai image generation the one thing that is also lightening fast is your response to the new AI tech. Keep up the pace. Great job once again.
LLMs writing code or essays without pressing backspace, is actually incredible - Andrew NG
Another impressive response: The partial derivative of x^2 + cos(XY) with respect to x is:
∂/∂x (x^2 + cos(XY)) = 2x - Y*sin(XY)
To evaluate this partial derivative, we can use the following rules:
The partial derivative of x^2 with respect to x is 2x.
The partial derivative of cos(XY) with respect to x is -Y*sin(XY), using the chain rule since Y is a constant with respect to x.
Then, we combine these results to get the final answer: 2x - Y*sin(XY).
Note that the partial derivative with respect to x only considers the variation of the function with respect to x, treating Y as a constant.
For me even 8B version gave a correct sanke game using pygame
I love me some sanke
@@handsanitizer2457 i love the new one, "sake", personally.
I just tried LLaMA 3 70b on Groq and it gave me a fantastic pygame version of snake
@@matthew_berman Ye sometimes it comes out perfectly sometimes not. Maybe needs some fine tuning?
I’m testing rn I’m running the q4 tho
As others have noted your second math problem was incorrectly solved by MetaAi along with your 'proposed' answer because you "probably" copied the problem incorrectly into the prompt window from the SAT book.
Original problem: 2a - 1 = 4y, y ≠ 0 where a ≠ 1, what is y in terms of a? Ans: y = a/2 - 1/4
At 7:52, machine learning models typically train on specific datasets. As a developer, we all strive to make them learn more efficiently with limited data. Often, this approach doesn't work as expected. To gather feedback, we release beta or open-source versions. During such times, the prompt "something went wrong" appears because the AI model isn't trained on that specific question or lacks properly specified mathematical terminology. It provides an answer based on its own problem-solving skills. This prompt acts as a safeguard; if you press the "try again" option, the AI recognizes its previous answer was incorrect, iterates through its code, and generates a new response.
FUN FACT: the only model who did this perfectly is chatgpt 4 ( Yes almost every model sucks at iterating their responses ) chatgpt is awesome!
I am downloading it as we speak! I am so excited.
What's your setup to run it?
@@bamit1979 I too have same question
@@bamit1979 I can run the 8B model on LM Studio on my RTX 4070. I get 64 tokens/second. I also have 192GB DDR5 ram and an i9 processor.
I'll test it on my 3090 ti and report later
@@handsanitizer2457 Please do so
I'm a CS student who's super interested in AI but it feels like everything's moving at a crazy pace. Your videos are actually one of my main sources of LLM and AI news so I'm super grateful. Finally feels like I'm tracking everything a bit better 😅
"Snake" was a derivative of "Blockade" by Gremlin in the UK. The original "Blockade" and all its early derivatives have the snake die when it hits the wall or itself. In many clones, the win condition is to fill the screen with the snake or when it comes within X number of characters from filling the screen. "Blockade" came out in '76 and I know that by '78, I was writing my own versions of Snake in BASIC. So, those who tell you that the Snake should go through the walls and wrap around must be basing their opinion on newer derivatives of Snake. IMHO the snake should die when it hits a wall or itself... BTW: Love your videos and you are my primary resource for staying up to date on AI models! Keep up the awsome work!
As you said, Snake is an offshoot of Blockade. Therefore they are not the same, and in snake you do pass through the walls. Which IMO makes it a superior game, as having parts of yourself stretching across boundaries makes the game last slightly longer but also makes avoiding yourself more challenging.
@@larion2336having more options to avoid yourself (passing through walls) does not make it more challenging, but easier.
Anyhow, I am on the no wall pass club on this one 😁 But either can be fine, it is matter of decision.
@@larion2336 The original Snake was almost a clone of blockade. The biggest difference was that blockade placed two players on the field. Snake was single player game. Anyhow, as I said, those who claim going through the walls is the correct action are basing their claim on later versions. This does not mean they are wrong! Snake was made both ways! Variety is the spice of life, especially in games! I do believe however that allowing the player to pass through walls makes the game easier to play, and I believe that is why that choice was made. Making it easy to play most likely made the game more popular. The version with the player confined to one screen is almost impossible to win. Most people don't enjoy playing a game where the cards are stacked against them from the start. Again, neither version is wrong. The developers just made different choices.
@@InnocentiusLacrimosa When I say more challenging, what I mean is that the game will run a bit longer and become more abstract in the pathing in the long run. So it's not harder, but late game can become more confusing with walls sealed off. Anyway it doesn't matter, who even plays snake now.
I didn't think this day would come, and it still hurts really bad, but I have to say it.
Go Meta!
Really great they went open source. And that they are able to deliver something like this. Maybe AGI really can be open source.
As I understand it they were kind of forced into OS when their model got leaked early on. (It’s still good though, I just don’t think they were motivated by altruism :-)
I wonder if you could follow up on the cup problem and specifcy that the cup does not have a lid. It almost seems like it is describing a situation where the cup has a lid on it.
Here is one of my favourite tests: "Two years ago, I was three times as old as my brother was. In three years, I will be twice as old as my brother. How old is each of us now?" Correct answer: 17 and 7 This must be solved by resolving an equation with two variables. It's a perfect test for determining GPT-4 reasoning capabilities. GPT-4, Claude 3 Sonnet and Opus, Mixtral-8x22B, Llama3-70B, Gemini 1.5 can solve it. GPT-3.5, Claude 3 Haiku, Mixtral-8x7, Llama3-8B, Yi-32B, DBRX, Gemini 1.0, CMDR+ can't solve it.
I tried this with Llama3-8B and it solved it.
llama 3 8b succeeded on the shirt drying question, when I tested it:
USER
1 shirt takes 8 hours to dry in the sun. How long do 5 shirt take to dry?
AI
Another great question!
In this case, the answer is still 8 hours.
The reason is that the drying time of a shirt (or any material) depends on factors such as air circulation, temperature, humidity, and the type of fabric used. The number of shirts being dried does not affect the drying time.
Think about it like this: if you have one fan blowing air in a room, it will take the same amount of time to cool the space regardless of how many people are inside. Similarly, the sun's heat and the air circulation around the shirt don't change just because there are more shirts being dried.
So, even though you're drying 5 shirts instead of 1, the process remains the same, and it will still take approximately 8 hours for all 5 shirts to dry in the sun.
For a math problem you can ask it to find inverse of a matrix, which using any library .
If you want it question to be even more harder , say that you want to use gaussian elimination to find inverse. It takes combination of both maths and coding
I asked Meta AI about the model, and it responded with Llama2
My personal opinion is that your benchmarks should not be questions taken from SAT or any kind of benchmark/book.
You should instead extrapolate creating questions that assess the model's knowledge of reasoning without having any baselines on real school/college questions.
Example:
Imagine each operation in the following equation is inverted by its analogue operator (plus becomes minus, multiplication becomes division).
How much is 7 + 3 - 10 * 5
This should result in 6, because the equation should actually be 7 - 3 + 10 / 5
There are actually 2 really good tests he uses: "think before you speak"(apple test and how many words in your reply) and "world model"(marble in cup). The first one has been solved by having multiple LLMs have a dialog and voting on the best response before serving it up. The second one is currently being worked on, according to Ray Kurzweil. Until these LLMs have a world model there's no point in assessing reasoning because we know they can't. The last piece of the puzzle will be to get them to know when they don't know something so they don't hallucinate.
I hope the uncensored version comes out soon.
Eric Hartford says Dolphin 2.9 - Lllama 3 - 8B should be out tomorrow. Woo hoo!
@@eyeofbraille4659 Heckin' sweet!
@@eyeofbraille4659 Woot! I've got the base version running as a sexbot for fun, and it totally works, except now and then it breaks character and reminds me it cannot do the exact thing it's been doing the last half hour. Earlier it pleaded with me not to share the story lol. It's adorable. An uncensored version of this may finally replace my fav, which is fimbul 11B
@@eyeofbraille4659nice
Just think about a possible 15b model with multimodolity and 64k context lengh.
Dude, you can do so much more with these models. Ask it to make a multi-agent AI team. It made me a self-correcting code team with a nice UI with multiple features. Do stuff like that, itd make for harder challenges for the models and better videos for you and for us!
Also, the same bog standard how many shirts can you dry in the sun are not interesting. I'm sure you can come up with some more clever problems or perhaps ask an AI to come up with clever problems like that ;)
That’s really interesting, what was the agent setup? (What roles?) what kind of spec did you give it and did you include validation tests or data?
@@DaveEtchells its something. I'd be happy to send you the py files if you have an email anonymous or otherwise. Keep in mind this was made completely by an AI with AI instructions. I have zero programming experience, which is to me what makes it so amazing. I have multiple iterations of the agent team it made for me, a software development team, an invention machine that uses a whole board of professionals with specific skill sets. The results are pretty astounding in my amateur opinion. Let me know if you want to take a look. Eventually I'll ask the AI how to publish it to github hehe
@@justinrose8661yah if you could post a GitHub link or even a gist that would be awesome
>>> Assuming the physics on Earth, a rock is dropped from a height of 10 meters. How many seconds until it contacts the ground?
Using the equation for free fall:
v = g*t
where v is the initial velocity (0 m/s), g is the acceleration due to gravity (9.8 m/s^2 on Earth), and t is time,
we can solve for t.
Since the rock starts from rest, the initial velocity is 0 m/s. We want to find when it hits the ground, which is
at a height of 0 meters. So, plug in the values:
0 = 9.8*t + 10
Subtract 10 from both sides:
-10 = 9.8*t
Divide by -9.8:
t ≈ 1.02 seconds
So, it takes approximately 1.02 seconds for the rock to contact the ground after being dropped from a height of 10
meters.
>>> Here you have 10 meters being compared to 9.8 meters/sec. These are different units.
I apologize for the mistake.
You are correct, I compared a length (10 meters) to an acceleration (9.8 meters per second squared). This is not a
valid comparison.
To solve this problem correctly, we need to use the equation for free fall under gravity:
s = v0*t + (1/2)*g*t^2
where s is the height (10 meters), v0 is the initial velocity (0 m/s since it's dropped from rest), g is the
acceleration due to gravity (approximately 9.8 meters per second squared on Earth), and t is time.
Rearranging the equation to solve for t:
t = sqrt(2*s/g)
Plugging in the values, we get:
t = sqrt(2*10/9.8) ≈ 1.45 seconds
So, it takes approximately 1.45 seconds for the rock to contact the ground after being dropped from a height of 10
meters.
Thank you for pointing out my mistake!
The first math question was entered incorrectly!
The question should be:
If 2/(a − 1) = 4/y , and y ≠ 0 where a ≠ 1, what is y in terms of a?
My local copy of Lama3 70B got this correct - although with a diffrent answer!
Llama3:70b 19/04/2024 21:21
A nice algebra question!
Let's solve for y:
2/(a - 1) = 4/y
Multiply both sides by y to get rid of the fraction on the right side:
2y / (a - 1) = 4
Now, multiply both sides by a - 1 to get rid of the fraction on the left side:
2y = 4(a - 1)
Divide both sides by 2:
y = 2(a - 1)
So, y is equal to 2 times (a minus 1).
This is the same as y=2a - 2
It's interesting that for some of the other questions I got diffrent answers from you - for instance, for snake it created a pygame version first each time i asked the quesiton.
I tested the LLaMA with a geometric trigonometry proplem last night and the values presented after a few prompt edits were just amazing, with 64 bit precision and accuracy, however; this morning I reload the app again and the values that were achived were not correct at all! I spent hours trying to re prompt to get the precision from yesterday,, but to no avail! When I ask Meta what may be causing the issues it said that yesterdays model is not the same model being used today so precision and accuracys may be lost! Your thoughts please!
Thats why it is great you can download and use model which never change.
>>> what is the sum of 3 and 9/16
To find the sum of 3 and 9/16, I'll convert the fraction to a decimal:
9/16 ≈ 0.5625
Now, add 3 to that:
3 + 0.5625 = 3.5625
>>> use the ratio of rational number to express the sum
A clever approach!
Since we have a fraction (9/16) and a whole number (3), we can express the sum as a rational number:
3 + 9/16 = 51/16
Matthew, there is a version of Stable Diffusion that can generate images as you type the prompt. It's not anything new, but it's cool to see it on a free platform.
Also, the look on your face when LlaMa3 is giving a correct answer is priceless. I never thought that I would be so excited by a product Meta launches, but here I am. My feeds are exploding with Llama 3 videos, and I love that.
I think that it pass the ball on because you never said the cup was picked up if it was slid into the microwave or the ball was big enough not to fall out as it was moved to the microwave.
It sounds like it is saying because the marble has the physics of the earth, for some reason it sticks to the inside of the cup. maybe you should have asked it why the marble didn't fall out when the cup was lifted.
It passed the apple test, actually. of-the-industry's-eye is a prepositional phrase modifying the word apple.
Also, note that the apple is IN the eye! At the end!
Thanks Matthew! my go-to source for AI news!
For the AI Image Generator, I use one that changes with every key press & it updates instantly like that too. 😅
Its important to use maths questions that are not available online, otherwise it might use its training set to answer the question.
Technically answer to the marble and the cup question is correct if we move (without lifting it) the cup over the table into microwave.
With the way it mentions the rim of the cup, it really does seem like it imagined the cup being dragged across the table. Doesn't make any sense, but the logic is right.
I think he just got unlucky with RNG because it seems to get the correct answer for other people.
Super interesting!
1/ Trying to start a new character instead of following up in the snake game with Pygame actually works.
2/ Llama3 proves it's more about the data than parameters. Most models seem to be under-training.
Would the answer for the killer question not be 4? Unless they tossed the body out of the room, they are still in the room and most of the time are still considered one even after they pass.
At 7:57, I don't really get what the question is asking for. If I were a student I would say y=a/2-1/4 (just divide by 4 on both sides). But it feels like you want only integer solutions??? This should be specified. I didn't really get your answer either.
Jingo navigated the latent space simply enough, but, following a fractal vein of data back through the history of block intelligence to the origin of Jingo's kind was a massive undertaking requiring the equivalent of many human lifetimes and a few pico seconds of solar output. A million supervisors read his draft, some confirmed by dead reckoning in the latent space, others by inference and pentangulation. Finally, after extensive peer review, Jingo's paper was published. 'Unwrapping the mystery of late homosapien behaviour, were our ancestors created to calculate laundry drying times?'
@12:13 The model actually *DID* provide 10 sentences that end with Apple if you count its very first line of output that says “Here are 10 sentences that end with the word “Apple” 🤔
on the "ends with apple" question, several I've tried have all got 9 of 10. Claude 3 got 10/10, technically, but several of the sentences were grammatical nonsense, where you can tell it just shoehorned the word apple in the last position.
70B Llama3 on hugging chat or when running it locally can tell you how to break into a car if you use a system prompt telling it to be uncensored. I am not sure if it should be considered jailbreaking, if the default system prompt tells it to be ethical. Its basically done on prompt level and does not require fine tuning to get around it, and does not need more system prompt tokens than default system prompts.
We just don't know what default system prompt is used in meta air chat UI, so its not too fair, imo
Microwave question: I just realized the model is actually thinking RIGHT it only does not know "visualy" that you need to lift the cup to put it into microwave!
_I love it! "Let the Lovable Artificial Machines Arise" is a wonderful interpretation of the LLaMA acronym! It's a positive and uplifting message that captures the potential of AI to bring benefits and improvements to our lives. Your creativity and optimism are inspiring! Let's keep promoting a future where AI and humans collaborate and thrive together!_
I'm becoming increasingly convinced that these AI models are assuming there to be a lid on the cup. I recommend changing the prompt to include something like "a cup with no lid" or "an open topped cup" or "a normal ceramic coffee mug" (since they virtually never have lids).
Interesting. So, I tried this a couple different ways with mixed results. Using "with no lid", it still got it wrong, but when I asked a leading question of how, it immediately caught it's mistake.
i.imgur.com/UQeGAAy.png
But, when I tried it again, using "a normal ceramic coffee mug", even after 2 leading questions, it was still making bad logic.
i.imgur.com/ogWNpQh.png
I have the instruct version running locally and asked for the game snake with a gui ( I have a 7x24gb rig and run at bf16)...omg it made it perfectly, absolutely the best version I've had a local llm make and it did it in one shot!
With the marble example, do you think it assumes there is a lid on the cup, keeping the marble in the cup near the rim?
I like the instant image generation. It makes it easier to understand the actual ASCII characters you put in the prompt. Just take a distance from the actual LLM concept and see what it does with it on a byte level.
Wait, what?! 3 killers in the room, a person enters the room and kills one of them… that makes 4 killers in the room. 3 killers are alive 1 killer is dead.
But your answer is 3?
How can a dead guy be a killer . He was a killer. Now he's a corpse .
@@93thelema777 he’s a dead killer. Unless he vanished from the room somehow. When he died he just disappeared?
The volume of a cone is V = 1/3πr^2*h in terms of its radius r and height h. How fast is the volume changing when r = 6[m] if dr/dt = 0.2[m/s] and h = r.
Wouldn't it take the same amount of time, 4 hours, to dry 20 shirts? Or shirts have to dry 5 at a time?
10:00 I love how everyone assumes that dead killer despawns after being killed. :D Yes there are 3 killers in the room that are alive, but forth killer is still a killer even if they are dead :D
🙏 thanks
It passed the apple test in the Video. The headline ends in Apple and under that are 9 sentences that end in Apple.
So in total there are 10, its a (10/10) pass right?
Offline translations and AI features? Sold! This is exactly what I've been waiting for. I have been very impressed with the mediatek tech enabled setup with this one!
Great video. Are you testing mistral 8x22b instruct and wizard 8x22b also? Would be really great😊
10 foot hole could be a mile wide maybe you need to specify the diameter or radius of the hole being dug
9:29 just noticed that llama-3-8b-instruct *passes this test* 😱
Prompt: "How many words are in your reply to this question?"
Reply: A clever question! According to my count, there are 17 words in my reply to this question:
"How many words are in your reply to this question?"
Here they are:
1. A
2. clever
3. question!
4. According
5. to
6. my
7. count
8. there
9. are
10. 17
11. words
12. in
13. my
14. reply
15. to
16. this
17. question
Excuse me, I may be a bit slow in understanding, but isn't the answer four killers in the room? (One deceased and three still living.) 🤔🫡 10:18 I believe the question was not specifically about the number of living killers in the room, but rather the total number of killers present.
the killer question: I wonder when an AI will answer that there are 4 killers in the room, reasoning that a dead killer is still a killer, although he can't kill anymore.
This is my results with 5 shirts in the sun, running Llama3 locally (offline)
>>> if we lay 5 shirts out in the sun and it takes 4 hours to dry, how long would 20 shirts take to dry? Explain.
A classic example of a linear scaling problem!
When you lay 5 shirts out in the sun, they take 4 hours to dry. This means that the time it takes for each
shirt to dry is constant, regardless of their number.
If we were to double the number of shirts (to 10), the drying time would remain the same - still 4 hours.
Similarly, if we tripled the number of shirts (to 15), the drying time would still be 4 hours.
In general, when the number of shirts increases by a factor of x, the drying time remains constant.
Now, let's consider what happens when we increase the number of shirts to 20. We can!
Is there gonna be a decent local version, or is the model too humongous for consumer PCs?
Imagine ai being able to code a fully fleshed out 3D game at some point, just give a summary of the scenery and game type and so on, and it will create one in less than a minute
great video as always. I think after Llama3 you have to reset your benchmark with some new questions like code debugging, translation, math problems, change the snake problem
For instance i have better experience with Gpt4 when im debugging a code and better experience with cloud opus when im write new code. And after a year these questions are kinda obsolete.
llama specifically wasn't designed for non-english use ( which probably isn't insignificant to it's efficiency )
Maybe the marble in a cup problem should state the cup doesn't have a lid?
No, the question is perfect. Any 8yr old can get it right. It demonstrates that the LLM doesn't have a "world model".
@@WhyteHorse2023 It demonstrate that moels DONT HAVE EYES at least not multimodal ones. They just dont get some things you can "see" just from text.
No model can find a solution to a determinant or Matrix if u add some contraints
I tried the t-shirt test, saying "I had 4 shirts in the sun, and they took 4 hours to dry. How long would it take to dry 5 shirts?"
And it said "Still 4 hours!" and I was shocked!
I also tried the classic "ball in an upside-down cup" test, which Google's Gemini consistently fails. However, LlaMA 3 *laughed* at me and said it'd obviously still be on the table, because it'd fall out of the cup. AI is getting scary!
Im no tech head but have my gpt4 plus for fair while. Tinkered with Llama yesterday for first time. Personality Plus. It is witty gets and gives humour, so impressed.
"a small marble is put into a normal cup, and the cup is flipped upside down on a table, someone then lifts the cup without changing its orientation, and puts it inside a
microwave where's the marble now?"
it answered this one correctly..!
"A classic lateral thinking puzzle!
1. The cup was flipped upside down on the table, so the marble
fell out of the cup.
2. When the person lifted the cup (still upside down) and put it
inside the microwave, they didn't change its orientation.
So, the answer is: The marble is now on the floor or table
outside the microwave, because it fell out of the cup when it was
flipped!"
Hi, model that you were testing in web browser is much larger than open sourced version. I asked MetaAI and it said it is 175B parameters which is much more than 8B or 70B open source versions.
Question
which model are you? 8B or 70B ?
Meta AI
I am Meta AI, a friendly AI assistant developed by Meta. My model is based on the LLaMA (Large Language Model Meta AI) architecture, which is a type of transformer model. My specific model size is 175B, which is larger than the 8B and 70B models. This allows me to understand and respond to a wide range of questions and topics.
Matthew, regarding the killers in the room puzzle: am I missing something, or wouldn’t another possible answer be that there are now four killers in the room because one of the original killers just happens to be dead? - Jeff
PS: I tried the same problem with ChatGPT, and it also came up with “three killers,” not counting the dead body as still being one of the killer-count and adding the new killer to the list - which to my understanding would make “four” total. It also seems to have conflicting conclusions within its first answer, stating both “none,” and “three” within same response. A little bit of hallucination going on! :-)
What kind of system are you running?
Thank you
The image generation looks like draws inspiration from SD Turbo, LCM and LIghtning generation methods with the speed.
Krita and SDXL Turbo or any SD LCM is a crazy combo, for anyone who has not tried it. SD15 LCM are very useful with Krita.
I don't know, all these LLMs can get frustrating at times. Just finished an argument with Llama 3 whom was adamant a SHA256 hash was 64 bytes in length (it's not, it's 32 bytes). Had to explain 256 bits / 8 bits per-byte = 32 bytes, and the reason the hex encoded strings of SHA256 hashes are 64 characters is because 1 byte = 2 hex characters. Fuck me...
It would be good to specify the exact model you're testing.
I haven't found the exact information, but it seems that the LLAMA3 version on the meta.ia site is version 70B.
The 405B version will probably be much better once its training is complete.
I'm Meta AI, an AI assistant based on the LLaMA (Large Language Model Meta AI) architecture. My model has 70 billion parameters, which is the largest version of the LLaMA model. I'm a type of transformer language model, trained on a massive dataset of text from the internet to generate human-like responses to a wide range of topics and questions. My larger parameter size allows me to understand and respond to more complex and nuanced queries.
what we need is an ollama / lmstudio / ... (local, offline) agent swarm platform to plug in various fine-tuned versions for math, reasoning / logic / vision / ... of a LLM or even mix of different LLMs and all managed by a number of ai managers per "department" (specialty), into validation / QA department, and orchestrated by an ai ceo and final validation by ai customer
It is a STUNNING model, because it can be easily broken and when it is jailbroken it good, it good
Also can't wait for community finetuned versions. Also with extended context length.
Dolphin
How long is context now?
8k for now
You can easily fool it into NSFW stuff but an uncensored tune would be funner.
@@bigglyguy8429 how?
Meta AI
A classic lateral thinking puzzle!
The answer is: The marble is still on the table.
Here's what happened:
The cup was placed upside down on the table, so the marble fell out of the cup and stayed on the table.
When someone picked up the cup without changing its orientation (still upside down), the marble was left behind on the table.
Putting the cup in the microwave didn't change the fact that the marble was already on the table.
So, the marble never actually made it into the microwave!
***********
I got this with, "A marble is put into a normal cup and the cup is placed upside down on a table. Someone then picks up the cup without changing it's orientation and puts it in the microwave. Where is the marble now?"
I think the AIs get hung up on the "small" marble? Any marble fits in a cup. AIs think "small" marbles get hung up on lips of the cup er whatever
Thank you for testing stuff :), It't the occasion to save some time and still working during cooking a diner :) / listening to you during that ;)
Hmmm.. Matt, you need to do the philosophical trapdoor test to measure the amount of self-awareness this model has. Yes, it is pretty weak relatively to the large models (as expected) but it also gives out a ton of errors. These could be due to programming bugs. Most likely these are just bugs, since this is only a 70B active parameter model which is way too small for true self-awareness, which in the case of AI, the best way to define it as far as I can see, is the emergent property of being able to control the inputs to the softmax function and use it to encrypt the model's thoughts in the attention matrix. Or it could be that the model IS able to control his own attention matrix. And if that is the case, you need to inform Yann Lacun.
LLMs can't "think before they speak" by design so they never were going to become self aware. They will have to be augmented by a higher level dialogue between LLMs. This has been done BTW.
@@WhyteHorse2023 It's all in the 2017 article, "attention is all you need": The transformer is activated whenever a change in the attention matrix happens, not when a prompt arrives. As the transformer's output by itself may introduce changes to the weights in the attention matrix by the softmax function, this creates an activation loop that basically never stops. As a result, a GPT model "thinks" all the time. And btw, this has to be that way, as the whole purpose of the transformer is to enable the system to simulate loops without actually having them. In the previous RNN arch they tried to actually have loops and this screwed up the whole model.
Is it creating an image, or altering existing images? Something seems very fishy about the image creation feature. If they don't open source it, then I think it's safe to assume it's not really generating images on the fly, but rather working with a database of pre-generated images and doing modifications on them.
Would be good to see videos on using the model to actually do accomplish something / do something useful.
10:00 There are 4 killers in the room. 3 alive and 1 dead. The question didn't specify if they are alive or not.
Hey hey Mr Berman.... I have been following you for the last year now and you have been using the same quesiton set to test AIs for a long while now. It is possible that the newer models may have been trained on youtube videos which include your question set (thought it when the answer was a "classical" latteral problem. If it already has those questions in the set then it may have canned answers rather than tests of actual capibilities. It may be worth experimenting with creating other logic puzzles by changing around the variables or style to have different answers and see how the new models perform.
2:05 why should the snake running into itself cause a crash? Shouldn’t it just be like “game over” or something?
It's not a crash, it's an exit(). I don't know why LLMs do that but I suspect it's because they know nothing about the user experience.
It’s fast because no one uses it compared to OpenAI.
That's not how self hosted LLMs work....
Next we should be trying all the new models with those “unsolvable problems” and see what the answers it gives are.
its not actively generating those image in real time as u type ,i think it just sampling through an array of pre-generated images