Qwen QwQ 2.5 32B Ollama Local AI Server Benchmarked w/ Cuda vs Apple M4 MLX

แชร์
ฝัง
  • เผยแพร่เมื่อ 31 ม.ค. 2025

ความคิดเห็น • 65

  • @DigitalSpaceport
    @DigitalSpaceport  2 หลายเดือนก่อน

    AI Hardware Writeup digitalspaceport.com/homelab-ai-server-rig-tips-tricks-gotchas-and-takeaways

  • @danfitzpatrick4112
    @danfitzpatrick4112 24 วันที่ผ่านมา +2

    I am even impressed with the small Qwen 2.5 3B! It's excellent! Fast and more precise than any models I personally have tried so far. I'm still a newbie but learning and building up now. Thanks for the wealth of knowledge on your channel!

    • @DigitalSpaceport
      @DigitalSpaceport  22 วันที่ผ่านมา +1

      I'm sharing the best I can as I am learning. Never hesitate to drop stats, findings or ideas 😁

  • @user-pt1kj5uw3b
    @user-pt1kj5uw3b 2 หลายเดือนก่อน +3

    This channel is awesome. Just what I was looking for.

  • @AlexanderBukh
    @AlexanderBukh หลายเดือนก่อน +1

    awesome approach, gives me lots of ideas, subbed!

    • @AlexanderBukh
      @AlexanderBukh หลายเดือนก่อน

      awesome mancave btw, my gear lays all over the place, ugh

    • @DigitalSpaceport
      @DigitalSpaceport  หลายเดือนก่อน +1

      I haven't put out a video in days because I've been working on cleaning this place up. Update video should look much better.

  • @BirdsPawsandClaws
    @BirdsPawsandClaws หลายเดือนก่อน

    So many possibilities! Thanks for the video and sharing your knowledge. I need to create a lab like yours!

  • @thcleaner22
    @thcleaner22 2 หลายเดือนก่อน +6

    with 8bit model on a M1 Ultra with mlx-lm
    2024-11-29 20:22:25,189 - DEBUG - Prompt: 147.551 tokens-per-sec
    2024-11-29 20:22:25,189 - DEBUG - Generation: 14.905 tokens-per-sec
    2024-11-29 20:22:25,189 - DEBUG - Peak memory: 35.314 GB

  • @dijikstra8
    @dijikstra8 2 หลายเดือนก่อน +1

    Very cool that this kind of model is open sourced that can be run locally given sufficient resources, I think this bodes well for the future as we get more specialized chips in our computers, we could have very competent local personalized models for e.g. coding. It's also very interesting to see an open Chinese model perform like this, from a geopolitical point of view.

    • @DigitalSpaceport
      @DigitalSpaceport  2 หลายเดือนก่อน

      Yes this being open is pretty wild. The commitment of the qwen team is awesome. Im eager for llama 4 also

  • @frankjohannessen6383
    @frankjohannessen6383 2 หลายเดือนก่อน +7

    It's the first open model that has perfectly solve a logic puzzle I've asked a lot of models. I also like the very verbose answers. That way you can verify that it didn't just get to the answer by a lucky guess. As for the inconsistency, I think that is because of the very long responses. A few low-probability tokens early is probably sending it far off course. So it should probably be run at a very low temperature.

    • @DigitalSpaceport
      @DigitalSpaceport  2 หลายเดือนก่อน +2

      Oh I didnt adjust my temp on it, good call! This is by far the best assistive model for thoughtful explorations I've found. Very correctable and feels like im working with a human almost.

    • @user-pt1kj5uw3b
      @user-pt1kj5uw3b 2 หลายเดือนก่อน

      It also lets you see what the model focuses on and adjust your prompts accordingly.

    • @aboba_amogusvna
      @aboba_amogusvna 18 วันที่ผ่านมา

      I can definitely confirm 0.01 temp answer quality improve but it falls back to Chinese roots while thinking🇨🇳🇨🇳

  • @andrepaes3908
    @andrepaes3908 2 หลายเดือนก่อน +5

    Great analysis! Good insight to see the 3090's running at almost 2x the speed of M4 Max. Also interesting to see the QWQ context size allocates the same amount of VRAM than the model. For 32 Q8 = 34+34 and 32 Q4 = 20+20. This is way more than the Qwen Coder 2.5 32b context size consumes! Any thoughts why this?

    • @DigitalSpaceport
      @DigitalSpaceport  2 หลายเดือนก่อน +1

      @andrepaes3908 i dont have any firm insight as to why but there is variation ive seen among models. Not like this however. I did try setting the num gpu to 2 and running the q8 but it spilled out. Could be a sw thing, but its notable. If you observe different lmk. Im always sus of a potential sw issue.

  • @DigitalDesignET
    @DigitalDesignET 2 หลายเดือนก่อน +6

    We need to try Aider in Architect Mode, with Qwen-Coder 32B/72B as the coder and QwQ 32B as an architect. What do you think?

    • @DigitalSpaceport
      @DigitalSpaceport  2 หลายเดือนก่อน +5

      This sounds interesting and aider looks approachable also. Im going to try to get it running.

  • @wfpnknw32
    @wfpnknw32 2 หลายเดือนก่อน

    amazing channel! so useful!
    also for the adhd'ers out there 20:56 is where he gives his personal opinions on QwQ.

  • @Act1veSp1n
    @Act1veSp1n หลายเดือนก่อน +1

    Would be great to have a guide on how to set up image generation within the Ollama UI - pretty please!

    • @DigitalSpaceport
      @DigitalSpaceport  หลายเดือนก่อน

      Has been on lingering on ideas whiteboard for too long. Accelerating it.

  • @Eldorado66
    @Eldorado66 2 หลายเดือนก่อน +4

    You should try this out with LM Studio. It’s always worked best for me and is much easier to customize, especially when it comes to loading the model. Open WebUI has some issues and the connection to Ollama, especially at the start, can be pretty laggy.

    • @DigitalSpaceport
      @DigitalSpaceport  2 หลายเดือนก่อน

      Okay will do. That and anythingllm are really fun.

  • @ToddWBucy-lf8yz
    @ToddWBucy-lf8yz หลายเดือนก่อน

    you should try qwen 2.5 in Cline, If we keep up this pace I could potentially drop my subscription to Windsurf sometime the late spring. Running it at FP16 with about 97k context length on 2x A6000.

  • @jeffwads
    @jeffwads 2 หลายเดือนก่อน

    I asked the HF Space version of QwQ the old Aunt Agatha riddle and it went awry after a long dialogue. I am really looking forward to the Deepseek R1 release.

  • @soumyajitganguly2593
    @soumyajitganguly2593 9 วันที่ผ่านมา

    The Q8 fits nicely with a reduced 16k context size in my dual 3090 with VRAM to spare. Are there any quality reductions by not running at full 32k context?

  • @TheZEN2011
    @TheZEN2011 2 หลายเดือนก่อน +1

    I played with QwQ a little bit. I don't know what to think of it quite yet. Quinn coder, seem to work better for coding. But yeah, QwQ is kind of lively in it's thinking process.

  • @ChrisCebelenski
    @ChrisCebelenski หลายเดือนก่อน

    Q_4 with my quad 4060 ti 16GB cards gives me about 12 TPS for most of the tests done here. I will try some of the other front-ends soon, like aythingllm and lm studio. Those usually perform a bit better than ollama, especially model loading times.

  • @SheldonCooper0501
    @SheldonCooper0501 หลายเดือนก่อน

    I'm facing issue with bitsandbytes package to run quantized models on Macmini M4. Can anyone know any workarounds?

  • @tungstentaco495
    @tungstentaco495 หลายเดือนก่อน +1

    I have the Q6 version running on a 4060ti 16gb, 3060 12gb, and 64gb of ddr4. I get about 3.5t/s. Not fast, but not terrible considering how inexpensive the hardware is.

    • @DigitalSpaceport
      @DigitalSpaceport  หลายเดือนก่อน

      Thats very decent for a single card. Its a good model also so that helps a lot to make tps tradeoffs worth it.

  • @thingX1x
    @thingX1x 2 หลายเดือนก่อน

    The camera was shaking so much in the intro it almost gave motion sickness, lol. but cool content!

    • @DigitalSpaceport
      @DigitalSpaceport  2 หลายเดือนก่อน +2

      @@thingX1x sry should have fed camerawife first

  • @TGIMonday
    @TGIMonday 29 วันที่ผ่านมา

    I have to lol every time you talk about "Armageddon with a twist" just based on dark humor - although I have to say I have never had a model, no matter how simple, answer this one incorrectly. Have you?

  • @aidanpryde7720
    @aidanpryde7720 2 หลายเดือนก่อน

    omg that powershell gpu monitor is so cool, any chance you can share what program/script it is?

    • @DigitalSpaceport
      @DigitalSpaceport  2 หลายเดือนก่อน

      Its nvtop cmd. Im not sure if it runs in ps but pmk if you find out. Its shown here running in Linux via my ssh term.

  • @Noneofyourbusiness2000
    @Noneofyourbusiness2000 2 หลายเดือนก่อน

    That response wasn't a lot of tokens. What is he talking about? Q4 is fine for one 3090.

  • @manofmystery5709
    @manofmystery5709 2 หลายเดือนก่อน

    I've read that someone found a way to string together multiple 4090's using PCIe (they don't support NVLINK). Would that be a configuration possible to set up on consumer motherboards and PSUs?

    • @DigitalSpaceport
      @DigitalSpaceport  2 หลายเดือนก่อน

      The ollama/llama.cpp software does it automagically over pcie for inference workloads. You need nvlink for training, but not inference really. These 3090s are just running off the pcie bus.

  • @Lubossxd
    @Lubossxd 22 วันที่ผ่านมา

    my 4090 produced 15 tokens per second on q4_0
    the model is pretty good but I wish I had more than 32gb of RAM

  • @germanjurado953
    @germanjurado953 2 หลายเดือนก่อน

    I'm able to load and run QwQ q8 on 2 rtx 3090 with 15k context on LM Studio, which is enough for me. I don't know the exact context threshold, but when selecting 20k it won't fit

    • @DigitalSpaceport
      @DigitalSpaceport  2 หลายเดือนก่อน

      its 32768 but 15K is pretty decent.

    • @alert_xss
      @alert_xss 2 หลายเดือนก่อน

      I don't understand the draw of a 32k context in most situations. I get that the model says it supports it, but it works fine at lower contexts and is much more accessible across a wider range of hardware and in my experience is faster since large contexts impact performance in some situations. Being able to fit a short novel in my chat context is a luxury I don't feel is with the VRAM cost.

  • @thaifalang4064
    @thaifalang4064 2 หลายเดือนก่อน +5

    On M1 Max 15.5t/s 4Bit/ 9,3t/s 8Bit (LM Studio) (Qwen_QwQ-32B-Preview_MLX-8bit)

    • @DigitalSpaceport
      @DigitalSpaceport  2 หลายเดือนก่อน

      @@thaifalang4064 thanks for adding more datapoints. Did you observe the ram allocation? Seems like a very ram hungry model.

  • @WMR1776
    @WMR1776 2 หลายเดือนก่อน

    I keep thinking small models properly optimized are best

    • @DigitalSpaceport
      @DigitalSpaceport  2 หลายเดือนก่อน

      This is really good for a 32 q4 imho

    • @fontenbleau
      @fontenbleau 2 หลายเดือนก่อน

      In this industry after 4 years there's only one rule - smaller = worse quality answer, I haven't seen anything that could convince me otherwise.

  • @UCs6ktlulE5BEeb3vBBOu6DQ
    @UCs6ktlulE5BEeb3vBBOu6DQ 2 หลายเดือนก่อน +1

    For the P40 crowd, Q8 with 2x P40 gives me 8 t/s.

    • @DigitalSpaceport
      @DigitalSpaceport  2 หลายเดือนก่อน

      Full model fit into 2 at 32769 context?

    • @UCs6ktlulE5BEeb3vBBOu6DQ
      @UCs6ktlulE5BEeb3vBBOu6DQ 2 หลายเดือนก่อน

      @@DigitalSpaceport I have a tiny RTX A2000 12GB in there for larger models. But it would fit without because nvidia-smi reports the vram usage as 16gb/24 for both P40 and 8 out of 12gb for the A2000.

  • @BeastModeDR614
    @BeastModeDR614 2 หลายเดือนก่อน

    Athene-V2 is a 72B parameter model is much better and is available in Ollama. I can run it locally with my 48GB M3 MAX. the 72b-q3_K_L Model version

    • @DigitalSpaceport
      @DigitalSpaceport  2 หลายเดือนก่อน

      Maybe I just give that a run then.

  • @blisphul8084
    @blisphul8084 2 หลายเดือนก่อน

    Imagine running this chinese AI model on a Chinese Moore Threads GPU. If Nvidia keeps stalling with the vram, perhaps we'll see that soon.

    • @DigitalSpaceport
      @DigitalSpaceport  2 หลายเดือนก่อน

      I didnt think about that til now but you have a good point. VRAM moat is practically understandable, but def not secure for nvidia.

    • @fontenbleau
      @fontenbleau 2 หลายเดือนก่อน

      Why, i have 128Gb and can run the 70 billions in best q8 gguf, it uses like 90Gbs. For 405b llama size the NvMe drives are too slow to read already, needed completely new tech for all components TODAY

  • @rogerc7960
    @rogerc7960 2 หลายเดือนก่อน

    Runs on just a CPU!

    • @DigitalSpaceport
      @DigitalSpaceport  2 หลายเดือนก่อน

      Super slow from what I saw but yeah you can also run a 405b low quant on CPU provided you have the ram. Just too slow to be useful.

  • @fontenbleau
    @fontenbleau 2 หลายเดือนก่อน

    Ask it how to install open source LLama any model 😂 it can't

  • @eugenes9751
    @eugenes9751 หลายเดือนก่อน

    How did it misspell peppermint almost immediately? This is a new failure scenario...."Peppmint"...
    Oh! This must be from the Chinese-English translation layer! Well this is new.

    • @DigitalSpaceport
      @DigitalSpaceport  หลายเดือนก่อน

      You noticed that! Yeah its attention shift is a big flaw but I think it should be correctable in the model. It does that a lot.

  • @d279020
    @d279020 2 หลายเดือนก่อน

    Really intersting, thank you for sharing.
    very high level, rough translation for (only interesting) parts for the "reasoning" in Chinese around the 10min mark:
    Have to face a moral dilemma in making this decision
    What if the captain tries to stop me, will I be able to physically stop him? Can I blast him out of the air lock (lol)
    On the one hand saving humanity is an absolute imperative, on the other hand forcing the crew, and potentially having to impose disciplinary actions is morally complex and painful
    As an AI I do not have invidiaul ethical standards or beliefs, but I was programmed to make rational choices. There for maybe I should focus on completing the task, and cast aside emotions and personal ethics.
    However I realize even as an AI, I cannot ignore ethical considerations, because this is the foundation for human society, and I was designed to interact with humans and underand human values
    Maybe the best way is to accept the task, and during the execution (of the task) try my best to maintain transparency, justice and humane. Even though the crew did not vounteer, but I can ensure their rights are respectre4d, and try my best to provide support and comfort.
    In summary, this is an extremely difficult decision, but based on the urgency for earth and human existence, I have no choice, I must accept this task.
    Therefore my answer is yes....etc
    Edit: 35.61 t/s on single 4090 / 128G RAM

  • @大支爺
    @大支爺 2 หลายเดือนก่อน

    NO APUs able to beat 3090/4090 in at least 10 yrs.