I am even impressed with the small Qwen 2.5 3B! It's excellent! Fast and more precise than any models I personally have tried so far. I'm still a newbie but learning and building up now. Thanks for the wealth of knowledge on your channel!
Very cool that this kind of model is open sourced that can be run locally given sufficient resources, I think this bodes well for the future as we get more specialized chips in our computers, we could have very competent local personalized models for e.g. coding. It's also very interesting to see an open Chinese model perform like this, from a geopolitical point of view.
It's the first open model that has perfectly solve a logic puzzle I've asked a lot of models. I also like the very verbose answers. That way you can verify that it didn't just get to the answer by a lucky guess. As for the inconsistency, I think that is because of the very long responses. A few low-probability tokens early is probably sending it far off course. So it should probably be run at a very low temperature.
Oh I didnt adjust my temp on it, good call! This is by far the best assistive model for thoughtful explorations I've found. Very correctable and feels like im working with a human almost.
Great analysis! Good insight to see the 3090's running at almost 2x the speed of M4 Max. Also interesting to see the QWQ context size allocates the same amount of VRAM than the model. For 32 Q8 = 34+34 and 32 Q4 = 20+20. This is way more than the Qwen Coder 2.5 32b context size consumes! Any thoughts why this?
@andrepaes3908 i dont have any firm insight as to why but there is variation ive seen among models. Not like this however. I did try setting the num gpu to 2 and running the q8 but it spilled out. Could be a sw thing, but its notable. If you observe different lmk. Im always sus of a potential sw issue.
You should try this out with LM Studio. It’s always worked best for me and is much easier to customize, especially when it comes to loading the model. Open WebUI has some issues and the connection to Ollama, especially at the start, can be pretty laggy.
you should try qwen 2.5 in Cline, If we keep up this pace I could potentially drop my subscription to Windsurf sometime the late spring. Running it at FP16 with about 97k context length on 2x A6000.
I asked the HF Space version of QwQ the old Aunt Agatha riddle and it went awry after a long dialogue. I am really looking forward to the Deepseek R1 release.
The Q8 fits nicely with a reduced 16k context size in my dual 3090 with VRAM to spare. Are there any quality reductions by not running at full 32k context?
I played with QwQ a little bit. I don't know what to think of it quite yet. Quinn coder, seem to work better for coding. But yeah, QwQ is kind of lively in it's thinking process.
Q_4 with my quad 4060 ti 16GB cards gives me about 12 TPS for most of the tests done here. I will try some of the other front-ends soon, like aythingllm and lm studio. Those usually perform a bit better than ollama, especially model loading times.
I have the Q6 version running on a 4060ti 16gb, 3060 12gb, and 64gb of ddr4. I get about 3.5t/s. Not fast, but not terrible considering how inexpensive the hardware is.
I have to lol every time you talk about "Armageddon with a twist" just based on dark humor - although I have to say I have never had a model, no matter how simple, answer this one incorrectly. Have you?
I've read that someone found a way to string together multiple 4090's using PCIe (they don't support NVLINK). Would that be a configuration possible to set up on consumer motherboards and PSUs?
The ollama/llama.cpp software does it automagically over pcie for inference workloads. You need nvlink for training, but not inference really. These 3090s are just running off the pcie bus.
I'm able to load and run QwQ q8 on 2 rtx 3090 with 15k context on LM Studio, which is enough for me. I don't know the exact context threshold, but when selecting 20k it won't fit
I don't understand the draw of a 32k context in most situations. I get that the model says it supports it, but it works fine at lower contexts and is much more accessible across a wider range of hardware and in my experience is faster since large contexts impact performance in some situations. Being able to fit a short novel in my chat context is a luxury I don't feel is with the VRAM cost.
@@DigitalSpaceport I have a tiny RTX A2000 12GB in there for larger models. But it would fit without because nvidia-smi reports the vram usage as 16gb/24 for both P40 and 8 out of 12gb for the A2000.
Why, i have 128Gb and can run the 70 billions in best q8 gguf, it uses like 90Gbs. For 405b llama size the NvMe drives are too slow to read already, needed completely new tech for all components TODAY
How did it misspell peppermint almost immediately? This is a new failure scenario...."Peppmint"... Oh! This must be from the Chinese-English translation layer! Well this is new.
Really intersting, thank you for sharing. very high level, rough translation for (only interesting) parts for the "reasoning" in Chinese around the 10min mark: Have to face a moral dilemma in making this decision What if the captain tries to stop me, will I be able to physically stop him? Can I blast him out of the air lock (lol) On the one hand saving humanity is an absolute imperative, on the other hand forcing the crew, and potentially having to impose disciplinary actions is morally complex and painful As an AI I do not have invidiaul ethical standards or beliefs, but I was programmed to make rational choices. There for maybe I should focus on completing the task, and cast aside emotions and personal ethics. However I realize even as an AI, I cannot ignore ethical considerations, because this is the foundation for human society, and I was designed to interact with humans and underand human values Maybe the best way is to accept the task, and during the execution (of the task) try my best to maintain transparency, justice and humane. Even though the crew did not vounteer, but I can ensure their rights are respectre4d, and try my best to provide support and comfort. In summary, this is an extremely difficult decision, but based on the urgency for earth and human existence, I have no choice, I must accept this task. Therefore my answer is yes....etc Edit: 35.61 t/s on single 4090 / 128G RAM
AI Hardware Writeup digitalspaceport.com/homelab-ai-server-rig-tips-tricks-gotchas-and-takeaways
I am even impressed with the small Qwen 2.5 3B! It's excellent! Fast and more precise than any models I personally have tried so far. I'm still a newbie but learning and building up now. Thanks for the wealth of knowledge on your channel!
I'm sharing the best I can as I am learning. Never hesitate to drop stats, findings or ideas 😁
This channel is awesome. Just what I was looking for.
awesome approach, gives me lots of ideas, subbed!
awesome mancave btw, my gear lays all over the place, ugh
I haven't put out a video in days because I've been working on cleaning this place up. Update video should look much better.
So many possibilities! Thanks for the video and sharing your knowledge. I need to create a lab like yours!
with 8bit model on a M1 Ultra with mlx-lm
2024-11-29 20:22:25,189 - DEBUG - Prompt: 147.551 tokens-per-sec
2024-11-29 20:22:25,189 - DEBUG - Generation: 14.905 tokens-per-sec
2024-11-29 20:22:25,189 - DEBUG - Peak memory: 35.314 GB
Very cool that this kind of model is open sourced that can be run locally given sufficient resources, I think this bodes well for the future as we get more specialized chips in our computers, we could have very competent local personalized models for e.g. coding. It's also very interesting to see an open Chinese model perform like this, from a geopolitical point of view.
Yes this being open is pretty wild. The commitment of the qwen team is awesome. Im eager for llama 4 also
It's the first open model that has perfectly solve a logic puzzle I've asked a lot of models. I also like the very verbose answers. That way you can verify that it didn't just get to the answer by a lucky guess. As for the inconsistency, I think that is because of the very long responses. A few low-probability tokens early is probably sending it far off course. So it should probably be run at a very low temperature.
Oh I didnt adjust my temp on it, good call! This is by far the best assistive model for thoughtful explorations I've found. Very correctable and feels like im working with a human almost.
It also lets you see what the model focuses on and adjust your prompts accordingly.
I can definitely confirm 0.01 temp answer quality improve but it falls back to Chinese roots while thinking🇨🇳🇨🇳
Great analysis! Good insight to see the 3090's running at almost 2x the speed of M4 Max. Also interesting to see the QWQ context size allocates the same amount of VRAM than the model. For 32 Q8 = 34+34 and 32 Q4 = 20+20. This is way more than the Qwen Coder 2.5 32b context size consumes! Any thoughts why this?
@andrepaes3908 i dont have any firm insight as to why but there is variation ive seen among models. Not like this however. I did try setting the num gpu to 2 and running the q8 but it spilled out. Could be a sw thing, but its notable. If you observe different lmk. Im always sus of a potential sw issue.
We need to try Aider in Architect Mode, with Qwen-Coder 32B/72B as the coder and QwQ 32B as an architect. What do you think?
This sounds interesting and aider looks approachable also. Im going to try to get it running.
amazing channel! so useful!
also for the adhd'ers out there 20:56 is where he gives his personal opinions on QwQ.
Would be great to have a guide on how to set up image generation within the Ollama UI - pretty please!
Has been on lingering on ideas whiteboard for too long. Accelerating it.
You should try this out with LM Studio. It’s always worked best for me and is much easier to customize, especially when it comes to loading the model. Open WebUI has some issues and the connection to Ollama, especially at the start, can be pretty laggy.
Okay will do. That and anythingllm are really fun.
you should try qwen 2.5 in Cline, If we keep up this pace I could potentially drop my subscription to Windsurf sometime the late spring. Running it at FP16 with about 97k context length on 2x A6000.
I asked the HF Space version of QwQ the old Aunt Agatha riddle and it went awry after a long dialogue. I am really looking forward to the Deepseek R1 release.
The Q8 fits nicely with a reduced 16k context size in my dual 3090 with VRAM to spare. Are there any quality reductions by not running at full 32k context?
I played with QwQ a little bit. I don't know what to think of it quite yet. Quinn coder, seem to work better for coding. But yeah, QwQ is kind of lively in it's thinking process.
Q_4 with my quad 4060 ti 16GB cards gives me about 12 TPS for most of the tests done here. I will try some of the other front-ends soon, like aythingllm and lm studio. Those usually perform a bit better than ollama, especially model loading times.
I'm facing issue with bitsandbytes package to run quantized models on Macmini M4. Can anyone know any workarounds?
I have the Q6 version running on a 4060ti 16gb, 3060 12gb, and 64gb of ddr4. I get about 3.5t/s. Not fast, but not terrible considering how inexpensive the hardware is.
Thats very decent for a single card. Its a good model also so that helps a lot to make tps tradeoffs worth it.
The camera was shaking so much in the intro it almost gave motion sickness, lol. but cool content!
@@thingX1x sry should have fed camerawife first
I have to lol every time you talk about "Armageddon with a twist" just based on dark humor - although I have to say I have never had a model, no matter how simple, answer this one incorrectly. Have you?
omg that powershell gpu monitor is so cool, any chance you can share what program/script it is?
Its nvtop cmd. Im not sure if it runs in ps but pmk if you find out. Its shown here running in Linux via my ssh term.
That response wasn't a lot of tokens. What is he talking about? Q4 is fine for one 3090.
I've read that someone found a way to string together multiple 4090's using PCIe (they don't support NVLINK). Would that be a configuration possible to set up on consumer motherboards and PSUs?
The ollama/llama.cpp software does it automagically over pcie for inference workloads. You need nvlink for training, but not inference really. These 3090s are just running off the pcie bus.
my 4090 produced 15 tokens per second on q4_0
the model is pretty good but I wish I had more than 32gb of RAM
I'm able to load and run QwQ q8 on 2 rtx 3090 with 15k context on LM Studio, which is enough for me. I don't know the exact context threshold, but when selecting 20k it won't fit
its 32768 but 15K is pretty decent.
I don't understand the draw of a 32k context in most situations. I get that the model says it supports it, but it works fine at lower contexts and is much more accessible across a wider range of hardware and in my experience is faster since large contexts impact performance in some situations. Being able to fit a short novel in my chat context is a luxury I don't feel is with the VRAM cost.
On M1 Max 15.5t/s 4Bit/ 9,3t/s 8Bit (LM Studio) (Qwen_QwQ-32B-Preview_MLX-8bit)
@@thaifalang4064 thanks for adding more datapoints. Did you observe the ram allocation? Seems like a very ram hungry model.
I keep thinking small models properly optimized are best
This is really good for a 32 q4 imho
In this industry after 4 years there's only one rule - smaller = worse quality answer, I haven't seen anything that could convince me otherwise.
For the P40 crowd, Q8 with 2x P40 gives me 8 t/s.
Full model fit into 2 at 32769 context?
@@DigitalSpaceport I have a tiny RTX A2000 12GB in there for larger models. But it would fit without because nvidia-smi reports the vram usage as 16gb/24 for both P40 and 8 out of 12gb for the A2000.
Athene-V2 is a 72B parameter model is much better and is available in Ollama. I can run it locally with my 48GB M3 MAX. the 72b-q3_K_L Model version
Maybe I just give that a run then.
Imagine running this chinese AI model on a Chinese Moore Threads GPU. If Nvidia keeps stalling with the vram, perhaps we'll see that soon.
I didnt think about that til now but you have a good point. VRAM moat is practically understandable, but def not secure for nvidia.
Why, i have 128Gb and can run the 70 billions in best q8 gguf, it uses like 90Gbs. For 405b llama size the NvMe drives are too slow to read already, needed completely new tech for all components TODAY
Runs on just a CPU!
Super slow from what I saw but yeah you can also run a 405b low quant on CPU provided you have the ram. Just too slow to be useful.
Ask it how to install open source LLama any model 😂 it can't
How did it misspell peppermint almost immediately? This is a new failure scenario...."Peppmint"...
Oh! This must be from the Chinese-English translation layer! Well this is new.
You noticed that! Yeah its attention shift is a big flaw but I think it should be correctable in the model. It does that a lot.
Really intersting, thank you for sharing.
very high level, rough translation for (only interesting) parts for the "reasoning" in Chinese around the 10min mark:
Have to face a moral dilemma in making this decision
What if the captain tries to stop me, will I be able to physically stop him? Can I blast him out of the air lock (lol)
On the one hand saving humanity is an absolute imperative, on the other hand forcing the crew, and potentially having to impose disciplinary actions is morally complex and painful
As an AI I do not have invidiaul ethical standards or beliefs, but I was programmed to make rational choices. There for maybe I should focus on completing the task, and cast aside emotions and personal ethics.
However I realize even as an AI, I cannot ignore ethical considerations, because this is the foundation for human society, and I was designed to interact with humans and underand human values
Maybe the best way is to accept the task, and during the execution (of the task) try my best to maintain transparency, justice and humane. Even though the crew did not vounteer, but I can ensure their rights are respectre4d, and try my best to provide support and comfort.
In summary, this is an extremely difficult decision, but based on the urgency for earth and human existence, I have no choice, I must accept this task.
Therefore my answer is yes....etc
Edit: 35.61 t/s on single 4090 / 128G RAM
NO APUs able to beat 3090/4090 in at least 10 yrs.