This LLM Is WAY BETTER than I thought - Mistral Small 24B - 2501 Fully Tested

แชร์
ฝัง
  • เผยแพร่เมื่อ 10 ก.พ. 2025
  • Mistral Small 24B 2501 is a crazy good model at coding, much better than I originally realized. This LLM requires a much lower temperature setting. I put the model against itself at different temperature settings. I found that temperatures do matter a lot for this model.
    Note my scoring at the top of the app currently has a bug in it, that I didn’t catch until after all the footage was done.
    My Links 🔗
    👉🏻 Subscribe: / @gosucoder
    👉🏻 Twitter: x.com/adamwlarson
    👉🏻 LinkedIn: / adamwilliamlarson
    My computer specs
    GPU: 7900xtx
    CPU: 7800x3d
    RAM: DDR5 6000Mhz
    Media/Sponsorship Inquiries ✅
    gosucoderyt@gmail.com
    Links:
    huggingface.co...

ความคิดเห็น • 24

  • @HaraldEngels
    @HaraldEngels 7 วันที่ผ่านมา +16

    I am using a low temperature of 0.1 also for DeepSeek and Qwen. For my coding purposes it makes a huge positive difference. I am using Mistral Small 3 on a mini PC ASRock DeskMeet with the Ryzen 5 8600G CPU, 64GB RAM (6,000 MHz), a Samsung 1TB NVMe drive and a 4TB HDD. That system runs LLMs up to 48 GB and has cost me only $900. Its max. power consumption is 65 watts. Inference is not the fastest (with 16 TOPS) but sufficient for my coding and authoring purposes. The coding results are clean and of good quality - saving me a lot of time.

    • @sentinel-q6j
      @sentinel-q6j 7 วันที่ผ่านมา

      which model 7b or?

    • @changer1285
      @changer1285 5 วันที่ผ่านมา

      ​@@sentinel-q6j small 3 is new, it's 24b. Not sure if they really mean the whole thing?

    • @changer1285
      @changer1285 5 วันที่ผ่านมา

      No additional GPU?

  • @florianstephan5745
    @florianstephan5745 7 วันที่ผ่านมา +4

    thx for the temp! Nice channel...keep it up.

  • @DoppsPkin
    @DoppsPkin 7 วันที่ผ่านมา +1

    love your tests

  • @tteokl
    @tteokl 7 วันที่ผ่านมา +3

    I'm looking forward to the Roo code video with this model

  • @jwickerszh
    @jwickerszh 7 วันที่ผ่านมา +5

    Temperature matters, prompt matters, time of day also matters somehow ... We see so many people just testing one prompt and concluding a model is good or bad but you really have to dive down the statistics of it and experiment with ideal temperature and prompt engineering.

    • @GosuCoder
      @GosuCoder  7 วันที่ผ่านมา +1

      I've noticed that too.

  • @jeffwads
    @jeffwads 5 วันที่ผ่านมา

    I had it code that standard letters falling and bouncing demo and it failed, but not by too much.

  • @dataprospect
    @dataprospect 2 วันที่ผ่านมา

    I use a very low temp that vllm let me to do so: 0.01.
    I am curious is it worse than 0.15?
    I need consistent output. 0.15 will not make it deterministic.

  • @srinidhihebbar196
    @srinidhihebbar196 7 วันที่ผ่านมา +2

    Great work.
    How does this compare with Deepseek R1.?

    • @GosuCoder
      @GosuCoder  7 วันที่ผ่านมา

      DeepSeek R1 is definitely better but it’s a lot bigger model.

  • @xspydazx
    @xspydazx 6 วันที่ผ่านมา

    I personally find higher temperature should only be used with roleplaying or story writing to allow for random and less focussed responses. IE imaginative .. emotive .. but for tasks I use a low temperature setting. If it is a untrained task I will slowly adjust upwards until perfect .

  • @sfl1986
    @sfl1986 7 วันที่ผ่านมา

    please post more about this model specially for agents

  • @glyph6757
    @glyph6757 7 วันที่ผ่านมา +1

    It would be interesting to see what would happen if you gave each model three chances at each problem.
    Also, in my experience LLM responses vary greatly by how good your prompt is, and the same exact prompt could be great on one model but be lackluster on another. Of course it would be very hard to test this objectively because the number of potential prompts one could use for any given problem is infinite, but I would be willing to bet you could get much better results from both models by using better prompts, and of course by having more of a back-and-forth conversation rather than relying on just one single prompt.

    • @ThePolarOpposite
      @ThePolarOpposite 7 วันที่ผ่านมา

      I prompted it to triple check my physics and calculus homework. I don't know how they demonstrate these AI models with the toughest math problems that exist when it doesn't get sophomore level University physics correctly. It's about 1 in 5 that is accurate. So I have to use several different models to see how close they agree and then swap the answers from one model to another so they're checking each other's work. I'd give it about 90% accuracy when I'm done, which is still not acceptable in my opinion for a calculator essentially.

    • @GosuCoder
      @GosuCoder  7 วันที่ผ่านมา +1

      Yes I definitely agree with the prompt mattering a lot.

  • @mlsterlous
    @mlsterlous 7 วันที่ผ่านมา +2

    Snake game is actually nothing special. First model i remember that could do this was llama 3.1 - 8b! Only 8b model. And it was many months ago. Now many other models can provide code for snake game. qwen 7/14b definitely. At the moment i'm not imressed with this model. My favorite is Qwen2.5-14B-Instruct-1M. But i will test it a bit more to make sure.

  • @kingsuperbus
    @kingsuperbus 5 วันที่ผ่านมา

    but can it play crysis

    • @GosuCoder
      @GosuCoder  5 วันที่ผ่านมา

      Hahaha!

  • @armiman123
    @armiman123 8 วันที่ผ่านมา +1

    Maybe give a bit of context?

    • @GosuCoder
      @GosuCoder  7 วันที่ผ่านมา

      Do you mean on the model itself or ?