This LLM Is WAY BETTER than I thought - Mistral Small 24B - 2501 Fully Tested
ฝัง
- เผยแพร่เมื่อ 10 ก.พ. 2025
- Mistral Small 24B 2501 is a crazy good model at coding, much better than I originally realized. This LLM requires a much lower temperature setting. I put the model against itself at different temperature settings. I found that temperatures do matter a lot for this model.
Note my scoring at the top of the app currently has a bug in it, that I didn’t catch until after all the footage was done.
My Links 🔗
👉🏻 Subscribe: / @gosucoder
👉🏻 Twitter: x.com/adamwlarson
👉🏻 LinkedIn: / adamwilliamlarson
My computer specs
GPU: 7900xtx
CPU: 7800x3d
RAM: DDR5 6000Mhz
Media/Sponsorship Inquiries ✅
gosucoderyt@gmail.com
Links:
huggingface.co...
I am using a low temperature of 0.1 also for DeepSeek and Qwen. For my coding purposes it makes a huge positive difference. I am using Mistral Small 3 on a mini PC ASRock DeskMeet with the Ryzen 5 8600G CPU, 64GB RAM (6,000 MHz), a Samsung 1TB NVMe drive and a 4TB HDD. That system runs LLMs up to 48 GB and has cost me only $900. Its max. power consumption is 65 watts. Inference is not the fastest (with 16 TOPS) but sufficient for my coding and authoring purposes. The coding results are clean and of good quality - saving me a lot of time.
which model 7b or?
@@sentinel-q6j small 3 is new, it's 24b. Not sure if they really mean the whole thing?
No additional GPU?
thx for the temp! Nice channel...keep it up.
love your tests
I'm looking forward to the Roo code video with this model
Temperature matters, prompt matters, time of day also matters somehow ... We see so many people just testing one prompt and concluding a model is good or bad but you really have to dive down the statistics of it and experiment with ideal temperature and prompt engineering.
I've noticed that too.
I had it code that standard letters falling and bouncing demo and it failed, but not by too much.
I use a very low temp that vllm let me to do so: 0.01.
I am curious is it worse than 0.15?
I need consistent output. 0.15 will not make it deterministic.
Great work.
How does this compare with Deepseek R1.?
DeepSeek R1 is definitely better but it’s a lot bigger model.
I personally find higher temperature should only be used with roleplaying or story writing to allow for random and less focussed responses. IE imaginative .. emotive .. but for tasks I use a low temperature setting. If it is a untrained task I will slowly adjust upwards until perfect .
please post more about this model specially for agents
It would be interesting to see what would happen if you gave each model three chances at each problem.
Also, in my experience LLM responses vary greatly by how good your prompt is, and the same exact prompt could be great on one model but be lackluster on another. Of course it would be very hard to test this objectively because the number of potential prompts one could use for any given problem is infinite, but I would be willing to bet you could get much better results from both models by using better prompts, and of course by having more of a back-and-forth conversation rather than relying on just one single prompt.
I prompted it to triple check my physics and calculus homework. I don't know how they demonstrate these AI models with the toughest math problems that exist when it doesn't get sophomore level University physics correctly. It's about 1 in 5 that is accurate. So I have to use several different models to see how close they agree and then swap the answers from one model to another so they're checking each other's work. I'd give it about 90% accuracy when I'm done, which is still not acceptable in my opinion for a calculator essentially.
Yes I definitely agree with the prompt mattering a lot.
Snake game is actually nothing special. First model i remember that could do this was llama 3.1 - 8b! Only 8b model. And it was many months ago. Now many other models can provide code for snake game. qwen 7/14b definitely. At the moment i'm not imressed with this model. My favorite is Qwen2.5-14B-Instruct-1M. But i will test it a bit more to make sure.
but can it play crysis
Hahaha!
Maybe give a bit of context?
Do you mean on the model itself or ?