bro dont you think its time to bump the test a bit? maybe to make it more relevant to real life, make the model understand existing code and solve/build something related to that code.
When an open source llm beats another, it is progress and good. When a closed source beats open source, I'm on my toes to see us take the lead again. TEAM OPEN SOURCE!!!!! Thanks King.
Since when is 38 the right answer for number 5? Plus, I would consider it cheating when the model uses code execution to get the answer since the model doesn't calculate the answer itself and relies on the code execution. Even on a simple calculation like number 5 and STILL gets the wrong answer
38-40 is a correct number based on how you calculate it. It doesn't do code execution. It does reasoning in code format and predicts output instead of running it (kind of a dry run).
38 is incorrect. Getting that as the answer shows a misunderstanding of the question. The language is "overstated by 20%". Meaning the 20% is in reference to the original number. Also 0.8 x 48 would not give exactly 38, that gives a non-integer. That should indicate that the answer is incorrect given that the question is about apples. This approach is often used to test if someone truly understands a question.
I don't know where you went to school, AICodeKing, but it is not between 38 and 40. It is 40 since 48/1.2=40 as it's put as an overstatement of 20% in the question and you want to therefore reduce the number (48) by 20%. It cannot be 38 or 39. It's basic math a 405b model should be able to do.
It is time to add some new "reasoning" questions. Maybe some logic "misguided attention" type questions to see if the models can get past their training and think things through better.
@@hedgehog6300 Reducing 48 by 20% does not give you 40. It gives you 38.4. But increasing 40 by 20% does give you 48. The correct answer is 40 because the question said 48 was overstating the correct answer by 20%, so the problem asks you to find the number that, when increased by 20%, gives you 48. You do NOT want to reduce 48 by 20%. That gives you the wrong answer and is not what the question calls for.
I used it on a small personal project and it worked pretty well aside from its crazy consumption of tokens (ran into rate limits despite having a tier 3 account), but then I tried it on a couple of files in a corporate codebase and it ended up deleting half the code.
I think that if a model were extensively trained on questions like 3 or 4, it would likely perform worse. This is because such training has no practical use in real life and unnecessarily distorts the biases within the LLM.
I know people are slow and may not realize this - but the reason all scores are just about the same, is that they are all copies of one original tech. Until that tech advances they won’t either, however there is a fair chance of the copy companies may crack something new while at it that would change the game.
get yourself 250k to buy GPU that can run R1, that will work. Smaller models dont work with cline because cline has a sys prompt that makes it impossible to work with small models. Cline has good work flow but the prompts are really bad for safe development. Cline is no meta tool, it cannot work real time on itself, as Aider does.
bro dont you think its time to bump the test a bit? maybe to make it more relevant to real life, make the model understand existing code and solve/build something related to that code.
If it doesn't beat r1 then I'm not interested
When an open source llm beats another, it is progress and good. When a closed source beats open source, I'm on my toes to see us take the lead again. TEAM OPEN SOURCE!!!!! Thanks King.
Since when is 38 the right answer for number 5? Plus, I would consider it cheating when the model uses code execution to get the answer since the model doesn't calculate the answer itself and relies on the code execution. Even on a simple calculation like number 5 and STILL gets the wrong answer
38-40 is a correct number based on how you calculate it. It doesn't do code execution. It does reasoning in code format and predicts output instead of running it (kind of a dry run).
38 is incorrect. Getting that as the answer shows a misunderstanding of the question.
The language is "overstated by 20%". Meaning the 20% is in reference to the original number.
Also 0.8 x 48 would not give exactly 38, that gives a non-integer. That should indicate that the answer is incorrect given that the question is about apples.
This approach is often used to test if someone truly understands a question.
I don't know where you went to school, AICodeKing, but it is not between 38 and 40. It is 40 since 48/1.2=40 as it's put as an overstatement of 20% in the question and you want to therefore reduce the number (48) by 20%. It cannot be 38 or 39. It's basic math a 405b model should be able to do.
It is time to add some new "reasoning" questions. Maybe some logic "misguided attention" type questions to see if the models can get past their training and think things through better.
@@hedgehog6300 Reducing 48 by 20% does not give you 40. It gives you 38.4. But increasing 40 by 20% does give you 48. The correct answer is 40 because the question said 48 was overstating the correct answer by 20%, so the problem asks you to find the number that, when increased by 20%, gives you 48. You do NOT want to reduce 48 by 20%. That gives you the wrong answer and is not what the question calls for.
Thanks for your Great Videos
Great content king. Please start sharing the website you use in every video for the explored topic.
I just spend time watching your videos and dont build anything. Slow down, please !
😂😂😅
Seriously. The speed of AI updates across TH-cam has got brain.exe to use up too much memory!
Make video on codename goose. It's a game changer
why ?
I used it on a small personal project and it worked pretty well aside from its crazy consumption of tokens (ran into rate limits despite having a tier 3 account), but then I tried it on a couple of files in a corporate codebase and it ended up deleting half the code.
Goose is crazy. I've done some insane robotic processes and automation with it.
@kolsongar curious, can you elaborate?
Provide link of it as well
Please provide links as well 🙏
Oh, that butterfly!
I think that if a model were extensively trained on questions like 3 or 4, it would likely perform worse. This is because such training has no practical use in real life and unnecessarily distorts the biases within the LLM.
I know people are slow and may not realize this - but the reason all scores are just about the same, is that they are all copies of one original tech. Until that tech advances they won’t either, however there is a fair chance of the copy companies may crack something new while at it that would change the game.
Good Video King
This butterfly is really good!
Claude는 소식이 읎네 🥲
There's no news from Claude 🥲
Feb 5 or Feb 13
@@phoneywheeze Thank you!
Tuelu3 8b + Ollama ist the best local Model for me.
Could you make a video about Mistral small 3 24B
I would like to know how you can run a model locally with cline 😊
get yourself 250k to buy GPU that can run R1, that will work. Smaller models dont work with cline because cline has a sys prompt that makes it impossible to work with small models. Cline has good work flow but the prompts are really bad for safe development. Cline is no meta tool, it cannot work real time on itself, as Aider does.
@ 250k what?
@@arpo71 $
where small 3 tho
Wow!!!! 🎉🎉
yo bro can i get Lovable ai pro for free man its my dream and I love yo videos
First comment