Love your video. Very useful to me as I'm evaluating buying a Mac Mini M4 Pro or a Mac Studio M2 Max (albeit with 64GB RAM) But I disagree with your statement that the 70B model is unusable in this context. For short non interactive fiction writing, you don't need to stand at your desk waiting for it to finish as text quality is more important than speed. But as was said before, context length is an important factor to consider and the 70b would definitely crawl if you asked it to refine its output.
I interact a lot with LLMs while writing, that's why it's not suitable for me. But for automated tasks which don't need to be monitored or "challenged", sure, it works! Thanks for watching!
@@Distillated That makes sense. Out of curiosity, as I'm new to this, what would be the minimum tolerable token generation speed for your style of writing ? Would 10 t/s be enough ?
@@zuhepix I can "tolerate" 10k/s if it's excellent (no reprompting) but anything above 30 t/s is great. I found another model that's great, small(-ish) and outputs at 36 t/s. Will probably do a video on it!
Thanks for the video. I saw your smaller llm one as well. I'm just wondering if any of these, (27-32gb q4) will run on the unbinned 24gb m4 pro, even if they're slow and "unusable" it's hard to find information right now on what's possible on 24gb, even in the 5-10tk/s range. Also, does MLX help with any of these models? Thanks again.
The RAM would be the "limiting factor" in your case - gemma2:27b will probably run fine, but tbh, llama3.2:3b is really good and really small. And i'm sure we'll soon see a 9-14b model (a bit like qwen2.5:14b) that will be the sweet spot for every Mac with 16-24GB.
Just ran the same tests on the unbinned 24GB M4 Pro model: Ran all tests on battery so not sure if might have seen slightly better results while plugged in Memory pressure before running a model was sitting at about 10% Didn't hear the fans at all other than slightly on a second run of Qwen2.5:32b Phi3:14b - Memory pressure: 52% - Prompt eval: 1s - Total duration: 50s - Eval count: 1197 tokens - 24.5 token/s Qwen2.5:14b - Memory pressure: 56% - Prompt eval 0.8s - Total duration: 29s - Eval count: 605 tokens - 21.5 token/s Gemma2:27b - Memory pressure: 74% - Prompt eval: 0.76s - Total duration: 49s - Eval count: 654 tokens - 13.6 token/s Llama3.1:8b-fp16 - Memory pressure: 72% - Prompt eval: 0.45s - Total duration: 39s - Eval count: 597 tokens - 15.6 token/s Qwen2.5:32b - Memory pressure: 72% (plus approx. 3GB swap file) - Prompt eval: 50s - Total duration: 112s - Eval count: 595 tokens - 9.7 token/s Llama3.1:70b (84% mem pressure, ~20GB swap file) - Has been running for about 25 minutes and has generated about 17 words so far. Think I might stop it 😄
@@neiltate81 That's awesome, thanks for the stats! In all seriousness though, I don't think it makes much sense to run llama3.1:70b as a local LLM for most use cases, even with more memory on the system. The smaller models already do a great job.
I just seen you post this! thanks so much, I havnt watched it yet, but people asked for it and you took the time to do it. Thank you! -----Watching now.
This was valuable test. Thanks for sharing. Based on noticing your activity monitor's memory/swap usage, I'd be curious to know if the default unified memory utilization cpu/gpu ratio (25%:75%) is partly a factor. I'm wondering whether maxing out the mini to 64GB improves performance or does it just allow you to run larger models to the memory limit (without swap) with the bottleneck being the number of gpu cores pegging the performance limit?
The GPU core count will definitely be the bottleneck in that case. Still useful to get more memory so the system doesn't come to a crawl. I think with 16 GPU cores, 16-20GB models (with q4 quantization) are fine.
@@Distillated Can try running `sudo sysctl iogpu.wired_limit_mb=48000` to re-allocate the VRAM limit on the Mac to better fit the 70b model. This should help reduce the initial load time.
@@Distillated yes M4 max with 128gb ram is top of the line ? the one you are using is not top of line it is M4 Pro chip with 14 inch .The One I am asking is M4 MAx with 128 gb Ram and 16 inches
As a mobile device, I say yes to that combo. The true powerhouse will be the Mac Studio with a M4 Ultra and max memory next summer.
6 วันที่ผ่านมา
Very good test, would be great to see on the max version of the m4. Also agree with other comments, it's not unusable, it really depends on what you do
I interact with LLMs quite a lot to get what I want, so that's why I said it's unusable. But for batches that don't need constant monitoring, it actually works fine! (be ready to hear the fans though 😅)
Model parameter-size is just one element for deciding which model is best. Quantization is the second one and f16 commonly makes little sense, Q4_K_S or Q5_K_S commonly are the sweet-spot. And you forgot the 3rd memory/performance-hog: context-length (llama has a low default). With your machine (I will get a 48GB M4 Pro next week to try it out) the llama 3.1 70B sweet-spot probably is Q3_K_S or Q3_K_M quantization your model (probably Q4_0 quantized) swapped and this makes it useless.
That's true, I'll definitely try the q3 versions! And sure, as is, the context length will make the 70B Q4_0 version even more sluggish than it is. Waiting for something like llama3.3:14b or 27b, that would be amazing. Thanks for watching!
I agree. Quantization is a really help full for text work. If you use it for agent to plan task or precision work I recommend use FP models. But there you can use small 1B-3B models.
@@aladdinsonni unfortunately, more memory only allows you to only load bigger models, but the compute power stays the same. The bigger qwen2.5 and gemma2 will work, but I really hope llama3.2 (or 3.3) gets a 14b or 27b version.
Oh I got baited (by my own fault). I throught the 70B was meant for 70 bytes. Running a 70 bytes LLM. Well now the video is boring to me, because it's not silly. Still I would asuke it's an interesting video judt not for me, bye :)
I agree. My M1 Pro 32GB is not too far behind this. This M4 Pro 48GB in this video which I was seriously thinking about getting is about 30-35% faster which is good, but I think I will keep my M1 Pro and put the money towards a 5090 build instead.
You only have 75% of the Unified memory for usage of the GPU. So your Llama31 70b at 39GB is larger than the 36GB of you have available.
11:30 Model Results - need a MBP Max with 64GB ram minimum to run llama3.1 and bigger models
Love your video. Very useful to me as I'm evaluating buying a Mac Mini M4 Pro or a Mac Studio M2 Max (albeit with 64GB RAM)
But I disagree with your statement that the 70B model is unusable in this context.
For short non interactive fiction writing, you don't need to stand at your desk waiting for it to finish as text quality is more important than speed.
But as was said before, context length is an important factor to consider and the 70b would definitely crawl if you asked it to refine its output.
I interact a lot with LLMs while writing, that's why it's not suitable for me. But for automated tasks which don't need to be monitored or "challenged", sure, it works! Thanks for watching!
@@Distillated That makes sense. Out of curiosity, as I'm new to this, what would be the minimum tolerable token generation speed for your style of writing ? Would 10 t/s be enough ?
@@zuhepix I can "tolerate" 10k/s if it's excellent (no reprompting) but anything above 30 t/s is great. I found another model that's great, small(-ish) and outputs at 36 t/s. Will probably do a video on it!
Thanks for the video. I saw your smaller llm one as well. I'm just wondering if any of these, (27-32gb q4) will run on the unbinned 24gb m4 pro, even if they're slow and "unusable" it's hard to find information right now on what's possible on 24gb, even in the 5-10tk/s range. Also, does MLX help with any of these models? Thanks again.
The RAM would be the "limiting factor" in your case - gemma2:27b will probably run fine, but tbh, llama3.2:3b is really good and really small. And i'm sure we'll soon see a 9-14b model (a bit like qwen2.5:14b) that will be the sweet spot for every Mac with 16-24GB.
Just ran the same tests on the unbinned 24GB M4 Pro model:
Ran all tests on battery so not sure if might have seen slightly better results while plugged in
Memory pressure before running a model was sitting at about 10%
Didn't hear the fans at all other than slightly on a second run of Qwen2.5:32b
Phi3:14b
- Memory pressure: 52%
- Prompt eval: 1s
- Total duration: 50s
- Eval count: 1197 tokens
- 24.5 token/s
Qwen2.5:14b
- Memory pressure: 56%
- Prompt eval 0.8s
- Total duration: 29s
- Eval count: 605 tokens
- 21.5 token/s
Gemma2:27b
- Memory pressure: 74%
- Prompt eval: 0.76s
- Total duration: 49s
- Eval count: 654 tokens
- 13.6 token/s
Llama3.1:8b-fp16
- Memory pressure: 72%
- Prompt eval: 0.45s
- Total duration: 39s
- Eval count: 597 tokens
- 15.6 token/s
Qwen2.5:32b
- Memory pressure: 72% (plus approx. 3GB swap file)
- Prompt eval: 50s
- Total duration: 112s
- Eval count: 595 tokens
- 9.7 token/s
Llama3.1:70b (84% mem pressure, ~20GB swap file)
- Has been running for about 25 minutes and has generated about 17 words so far. Think I might stop it 😄
@@neiltate81 That's awesome, thanks for the stats! In all seriousness though, I don't think it makes much sense to run llama3.1:70b as a local LLM for most use cases, even with more memory on the system. The smaller models already do a great job.
I just seen you post this! thanks so much, I havnt watched it yet, but people asked for it and you took the time to do it. Thank you! -----Watching now.
This was valuable test. Thanks for sharing.
Based on noticing your activity monitor's memory/swap usage, I'd be curious to know if the default unified memory utilization cpu/gpu ratio (25%:75%) is partly a factor.
I'm wondering whether maxing out the mini to 64GB improves performance or does it just allow you to run larger models to the memory limit (without swap) with the bottleneck being the number of gpu cores pegging the performance limit?
The GPU core count will definitely be the bottleneck in that case. Still useful to get more memory so the system doesn't come to a crawl. I think with 16 GPU cores, 16-20GB models (with q4 quantization) are fine.
@@Distillated Can try running `sudo sysctl iogpu.wired_limit_mb=48000` to re-allocate the VRAM limit on the Mac to better fit the 70b model. This should help reduce the initial load time.
do you think Macbook pro M4 Max with 128 GB Ram is great for Machine Learning development and models testing?
Isn't this the current top-of-the-line, maxed-out MacBook Pro?
@@Distillated yes M4 max with 128gb ram is top of the line ? the one you are using is not top of line it is M4 Pro chip with 14 inch .The One I am asking is M4 MAx with 128 gb Ram and 16 inches
As a mobile device, I say yes to that combo. The true powerhouse will be the Mac Studio with a M4 Ultra and max memory next summer.
Very good test, would be great to see on the max version of the m4. Also agree with other comments, it's not unusable, it really depends on what you do
I interact with LLMs quite a lot to get what I want, so that's why I said it's unusable. But for batches that don't need constant monitoring, it actually works fine! (be ready to hear the fans though 😅)
Model parameter-size is just one element for deciding which model is best. Quantization is the second one and f16 commonly makes little sense, Q4_K_S or Q5_K_S commonly are the sweet-spot. And you forgot the 3rd memory/performance-hog: context-length (llama has a low default). With your machine (I will get a 48GB M4 Pro next week to try it out) the llama 3.1 70B sweet-spot probably is Q3_K_S or Q3_K_M quantization your model (probably Q4_0 quantized) swapped and this makes it useless.
P.S. sorry, I forgot to say initially: cool, video, thanks a lot!
That's true, I'll definitely try the q3 versions! And sure, as is, the context length will make the 70B Q4_0 version even more sluggish than it is. Waiting for something like llama3.3:14b or 27b, that would be amazing. Thanks for watching!
I agree. Quantization is a really help full for text work.
If you use it for agent to plan task or precision work I recommend use FP models. But there you can use small 1B-3B models.
Which one of these you think it will be a great to run locally in a 24gb m4 pro ?
Llama3.2:3b is pretty impressive
@ As soon my computer arrives I’ll try it
@@Distillated How about 64GB M4 Pro?
@@aladdinsonni unfortunately, more memory only allows you to only load bigger models, but the compute power stays the same. The bigger qwen2.5 and gemma2 will work, but I really hope llama3.2 (or 3.3) gets a 14b or 27b version.
What app did you ise to show the fan speed?
TG Pro
what processor do you have?
binned M4 Pro
Would these models work on Nvidia GPU 8GB VRAM but being fed to it in chunks from ram?
I don't think this would work but don't quote me on this! Happy to let more technical persons chime in.
Try LM Studio.
A lot faster for llama3.2 - didn't have the time to test a heavy model but will do soon!
The llama 3.1 is still faster than writing the 500 words story ourselves 🙂
Very true! I said unuseable because I interact with LLMs a lot, but for batch operation that don't need interaction or supervision, this is fine!
Oh I got baited (by my own fault).
I throught the 70B was meant for 70 bytes.
Running a 70 bytes LLM.
Well now the video is boring to me, because it's not silly.
Still I would asuke it's an interesting video judt not for me, bye :)
I would actually love to try a 70 bytes LLM 😂
So I am a beginner to this , would a unbinned one perform better ?
Technically yes, but not sure you would notice any drastic performance boost. We're talking 15%, from what I've seen from benchmarks.
Just wait for the next gen GPUs, Nvidia AMD will have much higher fp8 support
I agree. My M1 Pro 32GB is not too far behind this. This M4 Pro 48GB in this video which I was seriously thinking about getting is about 30-35% faster which is good, but I think I will keep my M1 Pro and put the money towards a 5090 build instead.
The M1 is such a beast.