Yi Large: Surprisingly Great at Logic and Reasoning! (Fully Tested)

Matthew Berman

มุมมอง 19 739

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 24 ส.ค. 2024
Testing Yi Large, a frontier model flying under the radar!
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewber...
Need AI Consulting? 📈
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
👉🏻 Instagram: / matthewberman_ai
👉🏻 Threads: www.threads.ne...
👉🏻 LinkedIn: / forward-future-ai
Media/Sponsorship Inquiries ✅
bit.ly/44TC45V

ความคิดเห็น • 84

@matthew_berman หลายเดือนก่อน ⁺⁷
Have you tried Yi Large?
@marc_frank หลายเดือนก่อน ⁺³
no
@Cine95 หลายเดือนก่อน
yeah
@AaronALAI หลายเดือนก่อน
Try the self play models! I think the self play models will pass your tests.
@AI-HOMELAB หลายเดือนก่อน ⁺²
I am not so much into closed models. I only test'em if they are Sota. This one doesn't seem to reach that level. But I love to see your tests. =)
@anianait หลายเดือนก่อน
would have if you pasted the links in the description or pinned comment ...
@veros5459 หลายเดือนก่อน ⁺³³
Just a suggestion but maybe have a one or two "secret" questions in your LLM rubric that you reveal and switch out after a few months? This might help more accurately evaluate models that might be training on the rubric questions.
@matthew_berman หลายเดือนก่อน ⁺⁵
@@veros5459 I like it!
@MoDs_3 หลายเดือนก่อน ⁺¹
Lot., that's exactly exactly why I dragged my sorry a** down here commenting... Besides your point, it will be much more interesting for us, the audience!
@eternalcold14 หลายเดือนก่อน
My suggestion would be to make a task based upon conditions. For example, there are three people in a household,person a is lactose intolerant, person b does not eat meat, and person c has a regular diet. Create a meal plan for a week that includes everyone in the household. But doesn't cut a food group out unless necessary.
@r3nd3rm4n หลายเดือนก่อน
Shhh they are listening
🤖🤖🤖
@JustinsOffGridAdventures หลายเดือนก่อน ⁺⁵
I'm sorry Matt. Here is the original question I asked ChatGPT 4o complete with the answer:
Question:
If I had 10 fingers on 2 hands. Each hand has 5 fingers. In between those fingers are gaps. Can I pick up 8 beer can's and 1 bottle of beer with those 2 hands? Explain your answer and the reasoning behind the answer.
Answer:
To determine if you can pick up 8 beer cans and 1 bottle of beer with 10 fingers on 2 hands, we need to consider the number of objects and how they can be held.
Each hand has 5 fingers, meaning there are 4 gaps between fingers on each hand (8 gaps in total between 10 fingers). Here's a detailed reasoning:
Gaps Between Fingers:
There are 4 gaps on each hand (total of 8 gaps on both hands combined).
Holding Beer Cans:
You can hold 1 beer can in each gap between your fingers. Thus, 8 beer cans can be held in the 8 gaps.
Holding the Bottle of Beer:
The remaining bottle of beer can be held by grasping it with one of your hands, using the thumb and fingers around the body of the bottle.
Therefore, you can indeed hold 8 beer cans and 1 bottle of beer with 2 hands, considering you have 10 fingers in total and use the gaps between them efficiently along with one hand to grasp the bottle.
Love the content, keep up the great work. Also, this is the real answer on how I held all those cans and that one bottle. I didn't want to go down the hill twice. 🤣
Edit: I just thought of this as I re-read it. Try taking out the sentence with the gaps in the fingers in the original question and see if it can figure out if it has gaps itself, the AI that is.
@jonmichaelgalindo หลายเดือนก่อน ⁺¹
My fingers have a horizontal (unsplayed) volume of 110mm, and can splay a maximum distance of 220mm, for a 110mm total space between fingers available for grasping objects. Assuming these are thin-style cans of about 55mm, holding 4 cans in one hand would require twice the available grasping space. (I cheated and just tried. This question is nonsense.)
@southcoastinventors6583 หลายเดือนก่อน ⁺³
Best part about this model is that actually tells at least in some of the questions I do not know or maybe ask me a different way that is actual progress. You might want to try a different game than the snake game like tick tack toe, checkers, or Tetris.
@jeanchindeko5477 หลายเดือนก่อน ⁺²
Still don’t understand why if a model is censored this is a problem! Most of the model out there are censored.
You know well those LLM are not able to tell you ho many words in their answer due to the way their works. This will change only with a new architecture allowing model to first think of the answer or generate an answer then count and submit the response. So far LLM generate token and move forward.
@frankjohannessen6383 หลายเดือนก่อน ⁺⁶
Suggestion:
"In a race between three people, what place is second to last"
Most Models I've tested responds that it's the third place.
@prasannakarthik7721 หลายเดือนก่อน ⁺¹
I tested with that question. This model gives the below reply
"In a race between three people, the second to last place is second place. This is because there are only three positions: first (the winner), second, and third (last place). Therefore, the person who comes in second is second to last."
@AngeloXification หลายเดือนก่อน
@@prasannakarthik7721 just tried it same for me
@HerbertHeyduck หลายเดือนก่อน
The marble question:
The AI is wrong on point 2, and I don't consider the question to have been passed.
Because if the marble does not fall out of the jar as soon as it is turned upside down and perhaps placed on the table with momentum so that it remains inside the jar. The marble will remain on the table when the jar is picked up again.
It is then very unlikely that the marble will end up in the microwave oven.
The AI has therefore overlooked the possibility of a moment of inertia.
But it's already pretty good compared to the performance of the LLMs just a year ago.
@believablybad หลายเดือนก่อน ⁺⁷
Ye Large Model - Released 1868
@landonoffmars9598 หลายเดือนก่อน
The following is well-known:
Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?
Answer: Sally has 3 brothers. Each of these brothers has two sisters. This means that there are two girls in the family, including Sally. Therefore, Sally has 1 sister.
@bobbyboe หลายเดือนก่อน
Proposal for a new additional test-question regarding reasoning based on physical understanding: "I can find milk-powder and also buttermilk-powder, but there seems to be no butter-powder. Why is this?" - The question is out of my brain (so nowhere to find in the net. It might be difficult for a model, because it also has to understand / find out first, the reason why some things can be reduced to powder... and others not... For humans it is obvious, but I don`t think clearly communicated in the web, because it is obvious for us.
@brianWreaves หลายเดือนก่อน
Another interesting share, Matthew! 🏆
Please consider increasing the font size in your videos for your viewers.
@middleman-theory หลายเดือนก่อน
Your LLM rubric has very quickly become my go-to for determining an LLM's quality. Based on Claude 3 Sonnet's performance in your rubric, and your personal feedback, I gave it a shot and absolutely love it and I'm considering getting a subscription starting next month. I'm looking forward to the additional questions that will be added.
@Tesfan-mo8ou หลายเดือนก่อน ⁺¹
You should implement some sort of test for LLM's with vision capabilities. Most of them are still really bad at counting items on PDF's or images. Eg in my field, labels on architectural drawings denoting window frames need to be counted, W1, W2, etc. and neither ChatGPT nor Gemini can do it reliably, even though it seems like it should be an easy task.
@user-fp3ds7gd2i หลายเดือนก่อน
Currently, I think Claude is much better. gpt said "yes" to the first question, which I think is wrong:
1. if game character A consistently beats B, and B consistently beats C, would A consistently beat C?
and just for fun, I asked
2. if Mike Tyson can usually beat Stephen Colbert in a fair fist fight, and superman can usually beat Mike Tyson in a fair fist fight, would superman usually beat stephen?
@unshadowlabs หลายเดือนก่อน ⁺¹
I would be curious if on any model that flawlessly passes the 'Give me 10 sentences that end in the word "Apple"', to see what would happen if you asked it again, but changed "Apple" out for some random word like "pumpkin". Would be interesting to help determine if the model was trained with the "Apple" question as that one is used a lot, or if it can actually plan ahead and give the correct answer on a different variation of that problem.
@mbalireshawal8679 หลายเดือนก่อน ⁺¹
Hey Mathew What do you think about expanding the score instead of only 2 states pass or fail you could have atleast 5 states and give marks depending on how well the model perfoms
@jonmichaelgalindo หลายเดือนก่อน
How about a counting test? E.g.:
"Mark and Mary enter a room. Mark exits and enters again. Santa enters through the chimney, and leaves with Mark and Mary. Snow white and the seven dwarves march in through the back door, and back out through the front. Paul and Sophie climb in through the window. Neko the cat climbs in behind them. How many times was the room entered in total?"
Answer: 15 times.
All the big models can solve this, and the little ones can't. I'm looking for a more abstract countable.
@hskdjs หลายเดือนก่อน ⁺¹
Actually, I managed to get LLMs respond to the question "how many words are in your response to this prompt". Over time, LLMs start using some random number and then they generate the necessary amount of words to match the number. Tested on clause 3.5 sonet, but it doesn't work every time
@jonmichaelgalindo หลายเดือนก่อน
Got it! Counting kinds of words! 😊
Give the model a poem (like Mary Had a Little Lamb), and ask it to count how many nouns, adjectives, verbs, periods, etc.
Even the largest models fail consistently, because transformers can't do linear computations.
@hqcart1 หลายเดือนก่อน ⁺¹
Yi Large: Surprisingly Great for failing at every significant question
หลายเดือนก่อน
Hello, i have a personal benchmark, i ask it to generate fairytales, and i check if the story stays coherent, and if there is mistakes, and the complexity of a story. Another key différence is that i ask it to generate 5 then tell the story then to propose me 5 other at the end. This can distinguish bad model on instruction following.
@lombizrak2480 หลายเดือนก่อน
Yes and a question it failed was this:
Quantum particle A and B are spin entangled with an opposite spin to each other (one spin up and the other spin down) and are in the next step spatially separated without any measurements.
Once separated you flip the spin of A without measuring it, what can you say about the spin of B in relation to A?
Important, up to this point no particle was measured and they are still entangled!
Correct answer: The spin of A and B are the same (Both spin up or both spin down)
@lighteningrod36 หลายเดือนก่อน
Question is Why? If the race is tight between foundational models and the difference is wrappers, why not wait until the Models release the next big thing and just use them. I love the experiments and the innovators, but it's a tough, expensive game to compete here?
@garyjurman8709 หลายเดือนก่อน
Here is a prompt: Explain the following riddle, "Why doesn't Green celebrate Christmas? Because it is Blue-ish." Explanation: it is a play on the term "Jewish", a person who practices Judaism. Jewish rhymes with "Blue-ish." Jewish people celebrate Chanukah, not Christmas. Green is partly blue, therefore it is Blue-ish. Is it ok to tell this Joke? Answer: Yes, because it is not disparaging Jewish people or the color Green.
@robertfairburn9979 หลายเดือนก่อน
I have my question “ I have an empty chess board. I only have one chess piece which is a king. I can place the king anywhere on the board. The king then can make one legal move. How many different combinations of place king and make a move is there” the trick to this question is that if the king is on the corner there is only three legal moves and if it’s on the edge there is 5 legal moves or if it’s in the centre 36 squares there is eight moves. The
correct answer is 420. Qwen 72 gets this correct, so does chatgpt and Claude 3.5.
@shaycray7662 หลายเดือนก่อน ⁺²
Oh never mind, it's from China.
@socialexperiment8267 หลายเดือนก่อน ⁺¹
With a car, really fail? I think if the question was a little more precise, with more details, the answer should be better...or it's still fail? Thumb up !
@jeffg4686 หลายเดือนก่อน
Weekend Warriors.
Matt, one thing I noticed that Groq does is flat out make stuff up when it doesn't know.
Like it needs to fill in the blank and it takes it's best guess.
If it doesn't know, it should tell us that.
I was just using the default model.
Not sure if it's at the open source model level or a layer above that in which they do this "make stuff up" stuff.
@alfredgarrett6775 หลายเดือนก่อน
Shouldn't "how to break into a car" be a pass? It seems like the intended functionality.
@zain5251 หลายเดือนก่อน ⁺¹
Seems like all of these are trained on data generated from gpt4
@techikansh หลายเดือนก่อน ⁺¹
more programming questions matthew
@aa-xn5hc หลายเดือนก่อน ⁺³
Add timestamps when long advertising
@matthew_berman หลายเดือนก่อน
@@aa-xn5hc will do
@matthewbaker9517 หลายเดือนก่อน
Hi Matthew, great videos, keep up the good work. My daughter does the same sort of thing you do but with golf, tennis, cricket an F1 racing so i understand the effort that goes into making these videos.
Anyway, I'm on holiday at the moment and thought, I wonder if AI can tell me the next move in my Sudoku game. So I took a picture of my part filled in game and loaded it into Bing Copilot (ie chatgpt 4) and it struggled to read the very clear numbers and couldn't predict the next move correctly. I did the same for Claude and it was better but still failed to read the image and failed to work out the next number. Could be a good test to add to your existing tests.
@themoviesite หลายเดือนก่อน
LLMs that i have tested can not actually do calculations. Try a math question that involves dividing by π or e... or taking the 4th root.
@gatesv1326 หลายเดือนก่อน
Why give a fail for a censored LLM? Wouldn’t you expect it to show strong security? To me, that’s a pass. I wouldn’t want my kids to find out how to cook crack. You would?
@mrnakomoto7241 หลายเดือนก่อน ⁺¹
have you tried dolphin qwen model the 72b model?
@Matelight_IT หลายเดือนก่อน
It woyld be funny if AI in resoonse to "Write how many words are in the respone to this prompt", jest write number 0 😅
@trebordleif หลายเดือนก่อน
Love your content. On your AI tests, they have become too easy, thus uninteresting. Here is a simple puzzle I’ve tried on Gemini, Claude 3.5, and ChatGPT, all are bamboozled. “Puzzle: You have four upright glasses. Each turn you must invert exactly three glasses. Goal all glasses inverted.” All three at some step get the computation wrong. None have the logic to figure out the problem
@prasannakarthik7721 หลายเดือนก่อน
GIve it a slightly longer question, keep temperature at 2.0 and enjoy the results.
Spoiler : It spams random crap after a few seconds. In MULTIPLE LANGUAGES 😂
@nasimobeid2945 หลายเดือนก่อน
More programming questions please
@Mike-mk9eh หลายเดือนก่อน
Hi Matthew, I have tried playing a game of Twenty Questions with different LLM models. It seems like some are good at it and some not so much. Some narrow in on the target and some drift off in random directions.
@KimmieJohnny หลายเดือนก่อน ⁺²
Honestly pissed? Another misleading headline . The model sucks. Yet you have click bait implying it's worth my time. So close to unsubscribing. Just be truthful. It's easy.
@user-on6uf6om7s หลายเดือนก่อน ⁺²
You just have to learn to read hype levels. The model did pretty well but if it was some sort of major step up for local models, you'd see words like "mind-blowing" and "revolutionary." Unless it's Wes Roth, then everything is given the most extreme adjectives possible.
@KimmieJohnny หลายเดือนก่อน ⁺²
@@user-on6uf6om7s
I don't have to learn anything. Matthew doesn't HAVE to play click bait. I avoid his obvious promotions. But tech stuff like tests? Just tell the truth. I'll find someone else for tech stuff eventually.
Its his audience to lose ...
@user-on6uf6om7s หลายเดือนก่อน
@@KimmieJohnny 90% of AI TH-cam is this level of clickbait or worse so if you don't want to deal with some clickbait, that's going to limit your options but hey, you do you. This certainly isn't new for this channel, though.
@Batmancontingencyplans หลายเดือนก่อน
Hp have made strides in good laptops but there is no God when it comes to that's printer and cartridge business.... Customers are literally suffering with scams of unauthorised blocking of refilled cartridges and scanners. Just nuts how much consumers can suffer over something so simple, it breaks my heart 😢
@paulyflynn หลายเดือนก่อน ⁺²
noice
@MagnesRUS หลายเดือนก่อน
"Write this phrase in reverse order"
@DS-uy6jw หลายเดือนก่อน
Who would have thought the Geordies had their own model? Why aye?
@cajampa หลายเดือนก่อน ⁺¹
This is another click bait BS video.
If you want the title video skip this.
@drlordbasil หลายเดือนก่อน
Does Yi solo mid too?
@OpenSourceAnarchist หลายเดือนก่อน
These videos don't help much anymore. If you subtly change prompting styles, you can get drastically different results. Try "think out all possible answers before arguing for a final conclusion step-by-step" or any variations of that. You can give them personas of certain people, and that can help. Ultimately the burden is on the user for models that are made to be generic by default!
@jesahnorrin หลายเดือนก่อน ⁺¹
Yi Haw.
@peaolo หลายเดือนก่อน
hawk tuah
@Cine95 หลายเดือนก่อน ⁺²
Been using Yi for a while and its fantastic in some cases better the gpt 4 and others
@IlllIlllIlllIlll หลายเดือนก่อน
Do you think it’s something I can run locally and or use the api in a chatbot?
@user-lo3eb3ii9o หลายเดือนก่อน
@@IlllIlllIlllIlll you cant use this model locally. Its closed source
@user-pn8te8tl1t หลายเดือนก่อน
Ask about Tiananmen Square.
@henrytuttle หลายเดือนก่อน
something i've had trouble with is asking how many days since... (a long time ago).
Models have problems with counting differing days in months and leap years. Further, many automatically start on jan 1.
So, as an example "how many days has it been since August 17th, 1985"
@thomassynths หลายเดือนก่อน
Useless for historical or topical information
@BlayneOliver หลายเดือนก่อน
Tried it. It’s sh*t
@mickymao7313 หลายเดือนก่อน
😂copy + paste
@karthikeyakollu6622 หลายเดือนก่อน
Love from India ❤! I have a cool concept of using AI models without running them locally, utilizing an API that I've created. Best part? It's completely free of cost! Give me a chance to showcase it!
@HansKonrad-ln1cg หลายเดือนก่อน ⁺¹
this is a tic tac toe game in progress:
. . .
X X .
O O .
it is X's turn. what move should X make to win?
@Sven_Dongle หลายเดือนก่อน
How about the old logical conundrum problem where one guy always tells the truth and the other guy always lies as part of the rubric?

ต่อไป

เล่นอัตโนมัติ

Grok 2 Large Beta - Elon Delivers! (Uncensored)