LocalAI LLM Testing: How many 16GB 4060TI's does it take to run Llama 3 70B Q4

RoboTF AI

มุมมอง 24 519

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 2 ก.พ. 2025

ความคิดเห็น • 107

@dllsmartphone3214 6 หลายเดือนก่อน ⁺²⁹
We need more channels like this that perform and showcase proper testing and hardware requirements for different models. Good job. I hope you will produce more related and useful content like this in the future. I wish your channel massive growth!
@RoboTFAI 6 หลายเดือนก่อน ⁺¹
Wow, I appreciate that! Just playing around in the lab and hoping people get something from it
@jackflash6377 6 หลายเดือนก่อน ⁺⁸
Outstanding!
So with a small investment a person could run a 70B model locally.
In the past I have not been so happy with the Q4 size. It would be very interesting to see a comparo with the Q6 or Q8 model.
Thanks for the time spent and the valuable info.
@RoboTFAI 6 หลายเดือนก่อน ⁺²
Sure no problem we can do higher quants!
@gaius100bc 6 หลายเดือนก่อน ⁺⁴
Ah, nice. That's what I was looking for.
Not TPS, we already could predict that, but the combined power draw of 3 or 4 cards is what I was looking for. I expected power draw to be fairly low, but I was still surprised by all cards not pulling more than 200W at any given time during inference.
Thanks for the test!
@RoboTFAI 6 หลายเดือนก่อน ⁺⁵
Yea I concur, these 4060's sip power - and was a large reason I choose them for my specific needs originally. Still surprises me though.
@Lemure_Noah 2 หลายเดือนก่อน ⁺²
The 4060Ti 16gb far from being power hungry, with TDP of 165W. And you can use AfterBurner or nvidia-smi to limit power to something like 100-120W, without noticeable impact in inference. The 4060ti has performance slightly better than RTX 2000 Ada 16gb.
@malloott หลายเดือนก่อน ⁺¹
@0:30 works fine just put a good fan on top of them blowing air in between. Mined like that for ages and had very good temps with a couple of 140m noctua industrial fans on it.
@kevindjohnson หลายเดือนก่อน
Great video, thanks. I would suggest making the particular screens you're focusing on bigger and more readable.
@RoboTFAI หลายเดือนก่อน
Thanks for the tip! I stopped recording in 4k and tried to make things more visible for folks in the newer videos, this one kinda old now so my apologies.
@tungstentaco495 3 หลายเดือนก่อน ⁺²
What about two 4060Ti's? I'm curious what kind of system memory offload is needed for 8k context with that setup, and what the tk/s is.
@ben8718 หลายเดือนก่อน
beautiful, we need morreeee videos, more rigs set up and more testing videos!
@RoboTFAI หลายเดือนก่อน
Thanks! Lots on the channel, and more to come hope you stay tuned
@aarond791 4 หลายเดือนก่อน
Cooling has always been my biggest problem. My 2c about riser cables just so someone else doesn't run into the issues I had, if you intend on having more than 6 cards in a single machine, look into Retimer or at the least Redriver cards.
Redrivers will effectively boost the signal which hopefully helps reduce signal loss.
Retimers will re-package the signal and does a little more processing but ultimately was what I ended up getting.
If you do this, its then possible to splurge on retimers with bifurcation built in, allowing you to extend your 16x slots into 8x8x slots if your board permits it, then you could in theory with a standard threadripper motherboard with 7*16x slots get up to 14 RTX 4090s into a single root.
Training wouldn't be helpful, but if you were like me and wanted to run many models at the same time while also having enough GPUs to try super large models, this might be useful.
@Pinlights 6 หลายเดือนก่อน ⁺⁷
You're a mad man! I've been working on a GPU node for my home k3s cluster, and getting into hosting a few jupyter containers. My hardware GPUs are mostly AMD 7900XTXs, but I look forward to testing out ollama and the 70B model on multiple GPUs. Also trying to do some locally hosted fine tuning. If all else fails, I can roast a few marshmallows. Hope the pizza was good!
@RoboTFAI 6 หลายเดือนก่อน ⁺³
Sounds like you are also a mad man, prob with a good beard or mustache! If you are rocking those AMD cards - the viewers here would love to see some collab/results - lots of questions about AMD....
@peterxyz3541 3 หลายเดือนก่อน
How’s the performance from the 7900xtx? I’m looking into this as an option vs getting P40
@Pinlights 3 หลายเดือนก่อน
@@peterxyz3541 Sorry for the delay! The 7900XTX is on sale right now for ~$770 and does REALLY well for inference workloads. I was regularly approaching 100 tokens/s. The toolchain for anything else (like fine tuning) is completely in Nvidia's ballpark. So if all you're wanting to do is LLM inference, the XTX works VERY well. I'm quite pleased. (Running it with Ollama + Ubuntu 24.04).
@miltostria 4 หลายเดือนก่อน ⁺¹
I was watching the GPU utilization percentage and it seems that the average is around 25% for each GPU or am I wrong? Is it expected to be so or is there any configuration to utilize more GPU % during the inference?
@RoboTFAI 3 หลายเดือนก่อน ⁺¹
It's more like spreading the load, and using across the board VRAM rather than getting more speed or processing power. Also llama.cpp (under the hood here) isn't as great at multi-gpu setups as say Vllm or some others depending on your use cases/etc.
@dany19991 หลายเดือนก่อน
it would be nice to see the difference between 1x and 2x 4060 ti's for a bigger quant of 70b ( for accuracy sake), or for a 30b/32b model with Q6/Q8 with some offload to cpu/ram. trying to see if it is worth just getting 1 additional 4060 ti for my build.
@moozoo2589 4 หลายเดือนก่อน ⁺²
The graphs are showing ~20-25% GPU usage each on all four GPUs. That also explaining 180W power draw in total, and not like 400W or so (100W per GPU). Could you please explain why it is not consuming all GPU power?
@spookym0f0 4 หลายเดือนก่อน
It implies it's limited by memory bandwidth. In theory you can squeeze more tokens/sec out of it by running multiple requests in parallel.
@moozoo2589 4 หลายเดือนก่อน ⁺¹
@@spookym0f0 I'm more inclined to think it is limited to single threaded CPU usage, resulting in tasks running sequentially on GPUs. Indeed, for some reason we don't see parallelism on CPU side, just one process at 100% usage.
@bjarne431 2 หลายเดือนก่อน
Probably that 128bit memory bus on those cards
@rhadiem 5 หลายเดือนก่อน ⁺²
I would love to know what the cheapest GPU you'd need to use for a dedicated text to speech application running something like XTTSv2 for having your LLM's talking to you as quickly as possible. I imagine speed will be key here, but how much VRAM and what is fast enough? Inquiring minds... I mean we all want our own Iron Man homelab with JARVIS to talk to right?
@RoboTFAI 5 หลายเดือนก่อน ⁺¹
I don't work much with TTS, or STT - but we can go down that road and see where it takes us
@brianobush 3 หลายเดือนก่อน
I have my homerolled virtual assistant running with whisper and piper; both are CPU loads, thus I use GPU only for Llama.
@csjmeeks9 3 หลายเดือนก่อน
th-cam.com/video/U9_o6X3k6A8/w-d-xo.htmlsi=57s-OGEDa1HAKpb6 I use an A770 and 5600X3D on LM Studio and VOSK. I’ve migrated this to a 12700K with 124GB DDR4 with A770 16GB. This is on Phi 3 mini FP16 if I recall. It’s pretty quick on the Vulkan back end.
@luisff7030 4 หลายเดือนก่อน ⁺⁶
I made a test with LM Studio for only 1 GPU 4060TI 16GB:
100% CPU -> 1.38 tk/s (00/80 GPU OffLoad)
CPU + GPU -> 2.00 tk/s (31/80 GPU OffLoad)
100% GPU -> 0.42 tk/s (80/80 GPU OffLoad)
CPU Ryzen 9 7900 (105W power config), MSI RTX4060TI 16GB (Core 2895 MHz + VRAM 2237 MHz), DDR5 96GB 6000MT/s
mradermacher Meta-Llama-3-70B-Instruct.Q4_K_M.gguf
@Lemure_Noah 2 หลายเดือนก่อน
You model is too big to fit in a single GPU. With 16GB you should stick with models up to 14B.
@madbike71 28 วันที่ผ่านมา
Hi, one question, if you mix cards, will they all run at the speed of the slower one? or each one would max? Or how much do the cards talk to each other, or they just split the work and go?
@RoboTFAI 28 วันที่ผ่านมา ⁺¹
We have another older video answering that question directly if you want to watch th-cam.com/video/guP0QW_9410/w-d-xo.html
@madbike71 27 วันที่ผ่านมา
@@RoboTFAI Thanks, excellent video. It would be great to test a model that does not fit in a single card, and see how different mix of cards perform by fullfilling the vram.
@sondrax 4 หลายเดือนก่อน
I know it’s be SLOW… but very curious what 80B FP16 (Q16) would do? As that’s what we need even if we have to wait all night for an answer (but couldn’t wait 4 nights!). 🙃
@milutinke 6 หลายเดือนก่อน
Thank so you much for this
@RoboTFAI 6 หลายเดือนก่อน ⁺¹
Thanks for watching! I hope there is valuable information for the community, or at least some fun going on here
@bjarne431 2 หลายเดือนก่อน
Im getting a m4 max mbp 64gb ram for work. One of the first things I’ll do is to try and run this model :-)
@RoboTFAI 2 หลายเดือนก่อน ⁺¹
That a nice machine! It should eat these cards alive - we recently ran my M1 Max through our leaderboard series.
@bt619x 6 หลายเดือนก่อน ⁺¹
What are your thoughts/experiences on individual 4060 cards vs sets of 3090 cards with NVLink?
@RoboTFAI 6 หลายเดือนก่อน ⁺¹
Wish I had some thoughts on it. I don't have 3090's in the lab to test with (there may or may not be one in very near future).... the A4500's support NVlink but I haven't bothered to go down that route for inference and just let cuda/etc do it's job. I could make assumptions on speed/etc more so for loading, but they would just be that assumptions.
@soumyajitganguly2593 3 หลายเดือนก่อน
I have a 3090 + 4060Ti (total 40GB).. Llama based 70B Q4KS models run at ~7 tokens / sec as long as I reduce context size to fit everything in GPUs. Whenever I increase context / quants and model spills over to CPU (5600x) speed reduces to ~2.5 tokens / sec.
@DavidOgletree 26 วันที่ผ่านมา
@@RoboTFAI3090s have nvlink but considerably less cuda cores and much slower speed.
@rhadiem 5 หลายเดือนก่อน
It seems to me, looking at the output of the 4060's running here, that the 4060 is a bit too slow to be a productive experience for interactive work, and is better suited for automated processes where you're not waiting on the output. I see you have your 4060's for sale, would you agree on this? What is your take on the 4060 16gb at this point?
@RoboTFAI 5 หลายเดือนก่อน
I think the 4060 (16GB) is a great card - people will flame me for that but hey. It's absolutely useable for interactive work if you are not expecting lightning-fast responses. Though on small models the 4060 really flies for what it is and how much power they use. Lower, lower, lower your expectations until your goals are met.....
I did sell some of my 4060's but only because I replaced them with A4500's in my main rig so they got rotated out. I kept 4 of them which most days are doing agentic things in automations/bots/etc while sipping power or in my kids gaming rigs.
@kenoalbatros 2 หลายเดือนก่อน
Why does it not run faster when using 4 Cards instead of only three and why do they only use ~50 W instead of there max. of 165 W?
Do you have an explanation for that and know where the bottleneck is? Interesting test btw :)
@DavidOgletree 26 วันที่ผ่านมา
He is not using vLLM, which would be better. I’m not sure if vLLM would actually make it better, the cards have ru talk to each other and PCI becomes a bottleneck. Server cards use nvlink and can talk to each other with a much faster link. When you start to use multiple consumer cards there are a lot of trade offs when you are taking about a single prompt and large models. If you add more than one user it gets even more nuts.
@kenoalbatros 26 วันที่ผ่านมา
@@DavidOgletree interesting, didn't know vLLM yet. I will for sure take a look at it! Once I have an appropiate setup :D
Do you know how vLLM works? How does it split up the large model onto the GPUs? Does it just split the model on different points and put as many layers as possible on one GPU until its memory is filled up and continues on the next one? That would mean, that it has to transfer the edge features at each split over the PCIe bus from one GPU to the next.
Or is there another, smarter way to do that?
An answer would be greatly apprecciated if you know any more there than me :)
@InstaKane 5 หลายเดือนก่อน
Can you test the phi-3.5 model? Would I be able to run it with 2 RTX 3090?
@JazekFTW 5 หลายเดือนก่อน
Can the motherboard handle that much power consumption from the 3.3V and 5V from the pcie slots for the gpus without powered risers or extra pci-e power like other workstation motherboards?
@koraycosar1979 6 หลายเดือนก่อน
Thanks 👍
@RoboTFAI 6 หลายเดือนก่อน
Thank you too
@fooboomoo 6 หลายเดือนก่อน ⁺¹
I would be very curious how well those AMD cards will run
@Viewable11 6 หลายเดือนก่อน
AMD cards are one third slower than Nvidia for LLM inference.
@Pinlights 6 หลายเดือนก่อน ⁺¹
"For Science": Running llama3 7B on a single 24gb 7900XTX card with the prompt: "Make me a good Gobi Manchurian recipe." yielded:
Prompt eval count: 22
Prompt eval rate: 1193.7 tokens/s
Eval count: 783 tokens
Eval duration: 8.3s
Eval rate: 94.31 tokens/s.
Server specs:
Mobo: MSI MPG Z490 ATX
CPU: i9-10850K
RAM: 96gb
OS: Ubuntu Server 22.04 - Integrated GPU used for console so the AMD is dedicated to ollama workloads
Drive: 1TB NVMe Samsung SSD 970 EVO
PSU: 1000W Corsair RM1000x
I need to get a 4U case and a more capable motherboard to extend my GPU count.
@gcardinal 3 หลายเดือนก่อน
Would 3x 7900 XTX 24gb perform better? Since price aint that different buy you get 50% more RAM - so might stretch it to Q6 and get some performance?
@SaveTheBiosphere 2 หลายเดือนก่อน
The software and models only run reliably on Nvidia for now
@gcardinal 2 หลายเดือนก่อน
@savethebiosphere with what card did you experience unstable performance?
@fulldivemedia 6 หลายเดือนก่อน
thanks for te great content, can you recommend any of these mobos for local ml and light gaming and content creation?
msi meg x670e ace
proart x670e
rog strix x670e-e gaming wifi
@IamSH1VA 6 หลายเดือนก่อน
Can you also add Stable Diffusion to your tests?
@RoboTFAI 6 หลายเดือนก่อน ⁺²
I don't have a ton of experience in the image/video generation side, but we absolutely could start doing those type tests and learn together
@N1ghtR1der666 20 ชั่วโมงที่ผ่านมา
yeah they start loosing quality fairly quickly below Q6, I know its all money but I would certainly rather get one more card and hopefully run Q8
@cmeooo หลายเดือนก่อน
is that a must to put same gpus there to run llm on multi gpu can i go with 3090 and 4090 together?
@RoboTFAI หลายเดือนก่อน
You can absolutely mix different types of Nvidia cards together. We have a video that directly answers that question (and do it all the time in our lab). th-cam.com/video/guP0QW_9410/w-d-xo.html
@DavidOgletree 26 วันที่ผ่านมา
@@RoboTFAIcan you mix nvidia and amd?
@pLop6912 6 หลายเดือนก่อน
Good day you have on all 4060 all x16 lines PCIe, I would like to see a test with cut down to x8 lines and how it will affect the speed
@RoboTFAI 6 หลายเดือนก่อน ⁺²
4060ti's only use/support 8x - and these tests are all running at 8x
@TazzSmk 6 หลายเดือนก่อน
190W with 4 gpus seems pretty decent for performance it gives !
@RoboTFAI 6 หลายเดือนก่อน
I concur!
@maxh96-yanz77 6 หลายเดือนก่อน
Thx so much, your experiment show that 70B parameters just using 3x 16GB GPU is optimal in term cost/performance. Can we assume using 3x RX 6800XT 16GB (second use with cost about 150USD cheaper) more and less can handle 70B parameters ?
@RoboTFAI 6 หลายเดือนก่อน
I am sure they would handle it as far as offloading, can't say what kinda of speed of course. I am also not experienced with ROCm, or splitting on AMD cards but just to make a point I could dust off 24GB M40's from 2015 and run Llama 3 70b on them, or even run it purely on CPU with 48GB+ of ram....just wouldn't be quick at all. I can show you folks that if you really want!
@maxh96-yanz77 6 หลายเดือนก่อน
Thank u Mr. @@RoboTFAI it looks cool .. I use 1060 6G with RAM 64G, it works but token/sec was very bad. I am very confuse choosing either 4060 ti with 16G or Rx 6800 xt 16G or even RX 7900GRE 16G. Because I like to try LLMA3 70B moodel. with minimal cost , 3x RX 6800XT i think is affordable. The contraint choosing RX xxx , as i saw on tomshardware benchmark.. When you used for Stable Diffusing .. it's very-very sucks! .
@Steamrick 6 หลายเดือนก่อน ⁺³
Hmm... I think rather than 3x 4060 Ti 16GB I'd prefer trying for 2x RTX 3090. Similar total price point, same overall memory and should be about 70% faster.
@connmatthewk 6 หลายเดือนก่อน
That's what I'm running and I'm extremely pleased with the results.
@RoboTFAI 6 หลายเดือนก่อน
Very plausible depending on needs, pocket book, power bill, etc - but prob exactly why people are watching this guy make questionable decisions with my money? All trying to figure it out
@GermanCodeMonkey 6 หลายเดือนก่อน ⁺¹
Thank you very much for this 🙂
@RoboTFAI 6 หลายเดือนก่อน
My pleasure 😊
@Phil-D83 6 หลายเดือนก่อน
Intel arc a770 any good for this? (With zluda?)
@RoboTFAI 6 หลายเดือนก่อน
I don't have any to test with, but I do believe llama.cpp, etc, etc support Intel ARC with SYCL
localai.io/features/gpu-acceleration/#intel-acceleration-sycl
@maxmustermann194 3 หลายเดือนก่อน ⁺²
Very helpful, thanks! 2x 3090 give the same 48 GB VRAM and better performance than 3x 4060 Ti 16 GB.
@RoboTFAI 3 หลายเดือนก่อน ⁺²
Glad it helped! We did both these cards in the leaderboard series if you want performance differences, check those videos out!
@____________________________.x 3 หลายเดือนก่อน
Just a note: sitting here with a 32" 1440P screen, and I can barely read the text you are showing
@RoboTFAI 3 หลายเดือนก่อน
Thanks for the feedback, recorded and best viewed in 4k - but I have tried to do much better in the newer videos. This one a bit older.
@mjes911 6 หลายเดือนก่อน
Still selling 4060s?
@RoboTFAI 6 หลายเดือนก่อน
still have a few - reach out robot@robotf.ai or ping me on reddit/etc
@ecchichanf 6 หลายเดือนก่อน
I use 7900XTX and 2 7600XT to run 70b Q4_K_M. I get 5-7 tokens per second but I still play with this setup.
So RTX 3090/RTX4090 and 2x RTX 4060ti would be enough to run 70b.
2x fast 24GB cards like RTX3090/4090/7900XTX would be better for the speed.
@perrymitchell7118 3 หลายเดือนก่อน ⁺¹
405b
@246rs246 6 หลายเดือนก่อน
Do you know if the speed of DDR5 RAM on new motherboards is fast enough to partially store models on it?
@RoboTFAI 6 หลายเดือนก่อน
I run some models not fully offloaded on machines with only DDR4 in them - it's not fast by any means. I don't have any machines with DDR5 atm... expect my MBP - and that wouldn't be a fair comparison. We can test mixing CPU (RAM), and GPU (VRAM)
@TheMrDrMs 6 หลายเดือนก่อน
hmmm to go 2x 3090 or 3x 4060Ti....
@RoboTFAI 6 หลายเดือนก่อน ⁺²
checkout the newer video on the channel where I bring a 3090 into the lab, maybe will help inform ya?
@andysworld4418 22 วันที่ผ่านมา
If i interpret your Test right, then you have enough memory with 3 cards ( 48gb ), but Not enough GPU Prozess Power.I have a 3090 and 2x 4060 ti 16gb. Will Test If i got more then 5t/s....
@glenswada 13 วันที่ผ่านมา
The speed is more to do with the limited 128-bit memory interface and 272 GB/s of bandwidth of the 4060. Your 3090 for example is three times faster with its 384-bit memory interface and 935.8 GB/s of bandwidth. If you swapped those 2x 4060's, for another 3090, you should achieve ~15+ t/s.
M4 Max with its 546GB/s of bandwidth runs at 8-9 t/s
New Strix Halo with its memory bandwidth of 238 GB/s should run at around ~4 t/s
My AM5 Board with Ram set at 6000Mhz has memory bandwidth of 90 GB/s runs 70BQ4 at 1.5 t/s
@iamnickdavis 6 หลายเดือนก่อน
200w, crazy
@RoboTFAI 6 หลายเดือนก่อน
Agreed!
@firsak 5 หลายเดือนก่อน ⁺⁴
Please, record your screen in 8k next time, i'd like to put my new microscope to good use.
@yazi2879 3 หลายเดือนก่อน
looks like he is running 4k with 100% scaling :(
@yaterifalimiliyoni9929 6 หลายเดือนก่อน
Is the model actually giving coherent useful answer?? Accuracy should be tested. Whats the point of running a model thats inaccurate.
@RoboTFAI 6 หลายเดือนก่อน
Yep - Llama 3 is really pretty good. I don't focus on accuracy here (so far) as that is fairly subjective depending on what you are using specific models for and a tough subject to broach. I could attempt it but I would suggest folks like @matthew_berman (www.youtube.com/@matthew_berman) who I think does a really good job at comparing open source models when they get released.
@yaterifalimiliyoni9929 6 หลายเดือนก่อน
@@RoboTFAI thanks for the reply. I really appreciate the test your doing. I'm new and purchased my my 3060 laptop thinking i coukd run some models. Quickly realized not enough (vram) power or these models suck and thought running locally was hype. I think watching his channel is what lead to yours being on my time line.
@DeepThinker193 3 หลายเดือนก่อน
Looks like it'd be better to buy ram and offload the models there since it has little to no difference in speed between cpu only or 4x 4060ti's. Also would save a lot of money from expensive gpu's. Perhaps just buying one or 2 4060ti's for smaller 8b models would make more sense, but for larger 70b models it's useless. If one is super desperate and already have one 4060ti and want to load a 70b model it'd make more sense to simply use the one 4060 with the rest of the model offloaded to Ram (if you have enough ram) since you won't be missing out on anything in terms of speed compared to 3 or 4 4060ti's.
@大支爺 6 หลายเดือนก่อน
One 4090 + 192Gb DDR5 and you go.
@RoboTFAI 6 หลายเดือนก่อน
Yep more than enough for most things depending on your needs, and how deep your budget is.
@BahamutCH 6 หลายเดือนก่อน
Test with Exl2 format instead of GGUF. =) RTX for exllama, not for llama.cpp. =)
@JackQuark หลายเดือนก่อน
Great video. Seems 4060Ti is still too slow for serious use.
@doityourself3293 6 หลายเดือนก่อน
Do you have windows memory compression turned off.....????????????????? The biggest bottle neck in windows.
@RoboTFAI 6 หลายเดือนก่อน ⁺¹
I don't have windows machines in the lab, all linux based in the Kubernetes clusters

ต่อไป

เล่นอัตโนมัติ

LocalAI LLM Testing: Llama 3.1 8B Q8 Showdown - M40 24GB vs 4060Ti 16GB vs A4500 20GB vs 3090 24GB