I tried to run a 70B LLM on a MacBook Pro. It didn't go well.

Distillated

มุมมอง 5 173

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 3 ธ.ค. 2024

ความคิดเห็น • 46

@ErikBussink 6 วันที่ผ่านมา ⁺⁵
You only have 75% of the Unified memory for usage of the GPU. So your Llama31 70b at 39GB is larger than the 36GB of you have available.
@johnlehew8192 วันที่ผ่านมา
11:30 Model Results - need a MBP Max with 64GB ram minimum to run llama3.1 and bigger models
@zuhepix 7 วันที่ผ่านมา ⁺⁶
Love your video. Very useful to me as I'm evaluating buying a Mac Mini M4 Pro or a Mac Studio M2 Max (albeit with 64GB RAM)
But I disagree with your statement that the 70B model is unusable in this context.
For short non interactive fiction writing, you don't need to stand at your desk waiting for it to finish as text quality is more important than speed.
But as was said before, context length is an important factor to consider and the 70b would definitely crawl if you asked it to refine its output.
@Distillated 7 วันที่ผ่านมา
I interact a lot with LLMs while writing, that's why it's not suitable for me. But for automated tasks which don't need to be monitored or "challenged", sure, it works! Thanks for watching!
@zuhepix 6 วันที่ผ่านมา
@@Distillated That makes sense. Out of curiosity, as I'm new to this, what would be the minimum tolerable token generation speed for your style of writing ? Would 10 t/s be enough ?
@Distillated 6 วันที่ผ่านมา ⁺¹
@@zuhepix I can "tolerate" 10k/s if it's excellent (no reprompting) but anything above 30 t/s is great. I found another model that's great, small(-ish) and outputs at 36 t/s. Will probably do a video on it!
@MattOldroyd 7 วันที่ผ่านมา ⁺⁸
Thanks for the video. I saw your smaller llm one as well. I'm just wondering if any of these, (27-32gb q4) will run on the unbinned 24gb m4 pro, even if they're slow and "unusable" it's hard to find information right now on what's possible on 24gb, even in the 5-10tk/s range. Also, does MLX help with any of these models? Thanks again.
@Distillated 7 วันที่ผ่านมา ⁺²
The RAM would be the "limiting factor" in your case - gemma2:27b will probably run fine, but tbh, llama3.2:3b is really good and really small. And i'm sure we'll soon see a 9-14b model (a bit like qwen2.5:14b) that will be the sweet spot for every Mac with 16-24GB.
@neiltate81 5 วันที่ผ่านมา ⁺¹
Just ran the same tests on the unbinned 24GB M4 Pro model:
Ran all tests on battery so not sure if might have seen slightly better results while plugged in
Memory pressure before running a model was sitting at about 10%
Didn't hear the fans at all other than slightly on a second run of Qwen2.5:32b
Phi3:14b
- Memory pressure: 52%
- Prompt eval: 1s
- Total duration: 50s
- Eval count: 1197 tokens
- 24.5 token/s
Qwen2.5:14b
- Memory pressure: 56%
- Prompt eval 0.8s
- Total duration: 29s
- Eval count: 605 tokens
- 21.5 token/s
Gemma2:27b
- Memory pressure: 74%
- Prompt eval: 0.76s
- Total duration: 49s
- Eval count: 654 tokens
- 13.6 token/s
Llama3.1:8b-fp16
- Memory pressure: 72%
- Prompt eval: 0.45s
- Total duration: 39s
- Eval count: 597 tokens
- 15.6 token/s
Qwen2.5:32b
- Memory pressure: 72% (plus approx. 3GB swap file)
- Prompt eval: 50s
- Total duration: 112s
- Eval count: 595 tokens
- 9.7 token/s
Llama3.1:70b (84% mem pressure, ~20GB swap file)
- Has been running for about 25 minutes and has generated about 17 words so far. Think I might stop it 😄
@Distillated 5 วันที่ผ่านมา
@@neiltate81 That's awesome, thanks for the stats! In all seriousness though, I don't think it makes much sense to run llama3.1:70b as a local LLM for most use cases, even with more memory on the system. The smaller models already do a great job.
@wasdq9748 5 วันที่ผ่านมา
I just seen you post this! thanks so much, I havnt watched it yet, but people asked for it and you took the time to do it. Thank you! -----Watching now.
@sounds252 6 วันที่ผ่านมา ⁺²
This was valuable test. Thanks for sharing.
Based on noticing your activity monitor's memory/swap usage, I'd be curious to know if the default unified memory utilization cpu/gpu ratio (25%:75%) is partly a factor.
I'm wondering whether maxing out the mini to 64GB improves performance or does it just allow you to run larger models to the memory limit (without swap) with the bottleneck being the number of gpu cores pegging the performance limit?
@Distillated 6 วันที่ผ่านมา
The GPU core count will definitely be the bottleneck in that case. Still useful to get more memory so the system doesn't come to a crawl. I think with 16 GPU cores, 16-20GB models (with q4 quantization) are fine.
@ten._.s 6 วันที่ผ่านมา ⁺¹
@@Distillated Can try running `sudo sysctl iogpu.wired_limit_mb=48000` to re-allocate the VRAM limit on the Mac to better fit the 70b model. This should help reduce the initial load time.
@husratmehmood2629 6 วันที่ผ่านมา ⁺³
do you think Macbook pro M4 Max with 128 GB Ram is great for Machine Learning development and models testing?
@Distillated 6 วันที่ผ่านมา
Isn't this the current top-of-the-line, maxed-out MacBook Pro?
@husratmehmood2629 6 วันที่ผ่านมา
@@Distillated yes M4 max with 128gb ram is top of the line ? the one you are using is not top of line it is M4 Pro chip with 14 inch .The One I am asking is M4 MAx with 128 gb Ram and 16 inches
@ErikBussink 6 วันที่ผ่านมา
As a mobile device, I say yes to that combo. The true powerhouse will be the Mac Studio with a M4 Ultra and max memory next summer.
6 วันที่ผ่านมา
Very good test, would be great to see on the max version of the m4. Also agree with other comments, it's not unusable, it really depends on what you do
@Distillated 6 วันที่ผ่านมา ⁺¹
I interact with LLMs quite a lot to get what I want, so that's why I said it's unusable. But for batches that don't need constant monitoring, it actually works fine! (be ready to hear the fans though 😅)
@andikunar7183 7 วันที่ผ่านมา ⁺²
Model parameter-size is just one element for deciding which model is best. Quantization is the second one and f16 commonly makes little sense, Q4_K_S or Q5_K_S commonly are the sweet-spot. And you forgot the 3rd memory/performance-hog: context-length (llama has a low default). With your machine (I will get a 48GB M4 Pro next week to try it out) the llama 3.1 70B sweet-spot probably is Q3_K_S or Q3_K_M quantization your model (probably Q4_0 quantized) swapped and this makes it useless.
@andikunar7183 7 วันที่ผ่านมา ⁺¹
P.S. sorry, I forgot to say initially: cool, video, thanks a lot!
@Distillated 7 วันที่ผ่านมา
That's true, I'll definitely try the q3 versions! And sure, as is, the context length will make the 70B Q4_0 version even more sluggish than it is. Waiting for something like llama3.3:14b or 27b, that would be amazing. Thanks for watching!
@kaikiefer8303 2 วันที่ผ่านมา
I agree. Quantization is a really help full for text work.
If you use it for agent to plan task or precision work I recommend use FP models. But there you can use small 1B-3B models.
@dchicote 7 วันที่ผ่านมา ⁺¹
Which one of these you think it will be a great to run locally in a 24gb m4 pro ?
@Distillated 7 วันที่ผ่านมา ⁺²
Llama3.2:3b is pretty impressive
@dchicote 6 วันที่ผ่านมา
@ As soon my computer arrives I’ll try it
@aladdinsonni 6 วันที่ผ่านมา
@@Distillated How about 64GB M4 Pro?
@Distillated 6 วันที่ผ่านมา ⁺¹
@@aladdinsonni unfortunately, more memory only allows you to only load bigger models, but the compute power stays the same. The bigger qwen2.5 and gemma2 will work, but I really hope llama3.2 (or 3.3) gets a 14b or 27b version.
@devontebroncas4967 6 วันที่ผ่านมา
What app did you ise to show the fan speed?
@Distillated 6 วันที่ผ่านมา
TG Pro
@seattledude2022 6 วันที่ผ่านมา
what processor do you have?
@Distillated 6 วันที่ผ่านมา
binned M4 Pro
@azhyhama9649 6 วันที่ผ่านมา
Would these models work on Nvidia GPU 8GB VRAM but being fed to it in chunks from ram?
@Distillated 6 วันที่ผ่านมา
I don't think this would work but don't quote me on this! Happy to let more technical persons chime in.
@vin.k.k 2 วันที่ผ่านมา
Try LM Studio.
@Distillated 2 วันที่ผ่านมา ⁺¹
A lot faster for llama3.2 - didn't have the time to test a heavy model but will do soon!
@ExcelsiorXII 4 วันที่ผ่านมา
The llama 3.1 is still faster than writing the 500 words story ourselves 🙂
@Distillated 2 วันที่ผ่านมา
Very true! I said unuseable because I interact with LLMs a lot, but for batch operation that don't need interaction or supervision, this is fine!
@IamTheHolypumpkin 5 วันที่ผ่านมา ⁺¹
Oh I got baited (by my own fault).
I throught the 70B was meant for 70 bytes.
Running a 70 bytes LLM.
Well now the video is boring to me, because it's not silly.
Still I would asuke it's an interesting video judt not for me, bye :)
@Distillated 5 วันที่ผ่านมา
I would actually love to try a 70 bytes LLM 😂
@Atharvakotwal 6 วันที่ผ่านมา
So I am a beginner to this , would a unbinned one perform better ?
@Distillated 6 วันที่ผ่านมา
Technically yes, but not sure you would notice any drastic performance boost. We're talking 15%, from what I've seen from benchmarks.
@johnnytshi 6 วันที่ผ่านมา
Just wait for the next gen GPUs, Nvidia AMD will have much higher fp8 support
@reticlex 6 วันที่ผ่านมา
I agree. My M1 Pro 32GB is not too far behind this. This M4 Pro 48GB in this video which I was seriously thinking about getting is about 30-35% faster which is good, but I think I will keep my M1 Pro and put the money towards a 5090 build instead.
@Distillated 6 วันที่ผ่านมา
The M1 is such a beast.

ต่อไป

เล่นอัตโนมัติ

Have Car Companies "Innovated" Themselves Out of Business