LocalAI LLM Testing: Distributed Inference on a network? Llama 3.1 70B on Multi GPUs/Multiple Nodes

RoboTF AI

มุมมอง 7 861

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 2 ก.พ. 2025

ความคิดเห็น • 84

@TerenceGardner 5 หลายเดือนก่อน ⁺⁷
I honestly think this is the coolest AI related channel on youtube. I hope it keeps growing.
@RoboTFAI 5 หลายเดือนก่อน ⁺¹
Wow thanks! Much appreciated
@tedguy2743 6 หลายเดือนก่อน ⁺²
I can tolerate nonesense ads from TH-cam to watch your videos, your content is incredibly helpful
@RoboTFAI 6 หลายเดือนก่อน ⁺¹
Sorry no control over that! Much appreciated that you put up with it.
@marekkroplewski6760 5 หลายเดือนก่อน ⁺¹⁰
The useful comparison would be to test llama3.1 8B against 70B and distributed 405B. Since you can already run a model, spreading it over more nodes is not usefull. So running a larger model distributed vs smaller model and comparing quality and inference speed is a usefull test. Great channel!
@harshilbhatt1556 2 หลายเดือนก่อน ⁺¹
Great video, as someone who's starting their master's thesis on a similar topic, this was incredibly helpful.
@RoboTFAI 2 หลายเดือนก่อน
Glad it was helpful!
@TrevorSullivan 4 หลายเดือนก่อน ⁺¹
I love open source and I love hardware. This video speaks to me! Love it!! 👨🏻‍💻
@RoboTFAI 4 หลายเดือนก่อน ⁺¹
Hey much appreciated! Glad you enjoyed it.
@senisasAs 5 หลายเดือนก่อน ⁺⁶
As you asked :) it would ne nice to see HOW-TO.
Really nice content and topic itself 👍
@RoboTFAI 5 หลายเดือนก่อน ⁺⁴
Awesome, thank you! How to's coming eventually gotta find the time which I have none of!
@dllsmartphone3214 6 หลายเดือนก่อน ⁺³
just another very rare and useful topic covered in a video. why we dont have more youtube channels like this? 😅
@RoboTFAI 6 หลายเดือนก่อน ⁺¹
Glad you liked it!
@EngkuFizz 4 หลายเดือนก่อน
Because it requires two of the most important things but not all people have these advantages. Firstly the money to buy all those expensive GPUs, and secondly is the knowledge, because I think this topic is already considered a complex one
@_skiel หลายเดือนก่อน
Awesome video - next step: a productive use-case + verification of data quality
@tbranch227 6 หลายเดือนก่อน
Also, you are the man for setting this up and demoing! That's no mean feat!!!
@RoboTFAI 6 หลายเดือนก่อน
Hey thanks for that, much appreciated! Sometimes there is hours of testing/behind scenes setup for some of these.
@tbranch227 6 หลายเดือนก่อน
@@RoboTFAI oh yeah, not many people appreciate everything that goes into a video or really designing a whole channel. It's a ton of work.
@tbranch227 6 หลายเดือนก่อน ⁺³
What I learned was that, if you absolutely have to run the massive models, there's an awesome new way to do it. However, my overall strategy is around mixture of agents and councils of smaller models run in parallel to get the accuracy and speed I need. I don't think this thing will work worth a damn over the Internet except in university settings, but that is a very cool use case. Maybe the big boys can federate their future gigawatt data centers together and make even larger models across giant private networks? This is definitely going to be useful.
@RoboTFAI 6 หลายเดือนก่อน ⁺²
I concur with your strategy, and is also my strategy depending on what I am doing/playing with. We will see where this tech goes, it's at least fun to play with and would allow folks like myself to build a highly available LLM api endpoint with a swarm of machines behind it for our own use.
@nadpro16 6 หลายเดือนก่อน ⁺⁶
I would be interested in a how to, I have a home lab that I would like to try this with.
@animecharacter5866 6 หลายเดือนก่อน
time to watch a great video :D
@RoboTFAI 6 หลายเดือนก่อน ⁺¹
Hope you enjoyed!
@pata-tata557 6 หลายเดือนก่อน
Consider me a subscriber, these bench tests are amazing
@RoboTFAI 6 หลายเดือนก่อน
Welcome aboard!
@sergiynazarenko1542 15 วันที่ผ่านมา
Did bro just cover up the sound of his toilet flushing by going "wow did yall hear thise jets flying over the house?"😂😂
@RoboTFAI 14 วันที่ผ่านมา ⁺¹
If only I could have the full lab setup in the bathroom! 🚽
@Boyracer73 หลายเดือนก่อน
This is relevant to my interests 🤔
@RoboTFAI หลายเดือนก่อน ⁺¹
Thanks for watching! Mine also.... 🤣
@tblankfein8493 6 หลายเดือนก่อน ⁺²
would be very interesting if you would do this distributed inference but on the 405b size. Anyway, great video!
@marekkroplewski6760 5 หลายเดือนก่อน
I second that! 70B vs 405B showdown or just plain 405B test would be of great value. Amazing work anyways!
@RoboTFAI 5 หลายเดือนก่อน ⁺¹
Working on it! Will have to see if can even offload it entirely to VRAM in my lab...most likely not hahaha depending on quant and context size....but let's find out! Might actually have to use kid's gaming machines.
@marekkroplewski6760 5 หลายเดือนก่อน
There is also llamafiles from Mozilla. You have a good CPU, maybe you could also compare llama3.1 8B cpu vs gpu, same with 70B and maybe 405B 😮. Now with llamafiles they claim quite a speedup.
@johnykes 2 หลายเดือนก่อน
Any advantages (except being possible to run big models) of distributed inference? I also tried and it works worse with each added node :)
Do you think it's too early? Will appear some advantages over time?
@RoboTFAI 2 หลายเดือนก่อน
Good questions! Yea running big models on distributed hardware is prob biggest advantage. More so for homelab folks that don't have hardware resources to build single multi-gpu machines/etc.
The other side which I didn't touch on too much is the ability to cluster your api endpoint (load balancing requests/etc) with "servers" in front of a pool of workers. Lots of scale, in which you could run a small model on many nodes balancing requests a bit more than just "parallel requests" in most of our software.
It is REALLY early, but the tech around everything AI/LLM is advancing at such a pace it's hard to keep up with!
@LaDiables 5 หลายเดือนก่อน
This is neat
However question, it appears to be slow for serialized prompts. Does sending parallel prompts/batched change the equation in terms of total tok/sec?
@jelliott3604 2 หลายเดือนก่อน
Dual 40Gb/s port Melanox ConnectX-3 infiniband/ethernet cards are around/less than £40 on ebay
@yvikhlya 5 หลายเดือนก่อน ⁺⁶
So, each time you add more resources to your system, you make it slower. This is pretty bad at all. Why bother adding more nodes, just run everything on a single node.
@ckckck12 4 หลายเดือนก่อน ⁺²
This was my question too. If we've added more video cards with more nodes and doing so makes the tokens per second go down, what's the gain here? I'm not seeing it. I feel like there's something I don't understand about this stuff.
@sondrax 3 หลายเดือนก่อน
@@ckckck12ha! Welcome to the Club of No Return! Everytime I read / watch enough to level up… I just end up SEEMINGLY … to realize I understand LESS of the full picture!
🙃
@markmoorcroft7570 หลายเดือนก่อน
My Apple federal contact was telling me about TB -> fiber channel interfaces. Wondering about a cluster of Mac M4 minis with fiber channel.
@RoboTFAI หลายเดือนก่อน
In theory... slower than a single one, however would have capabilities to run VERY large models or many smaller models at once. With FC based storage would also help with initial loading time of models (fast storage -> fast memory/vram = quicker loading times)
@ricardocosta9336 5 หลายเดือนก่อน ⁺¹
Can you talk more about your kubernetes and hypervisor setup?
@RoboTFAI 5 หลายเดือนก่อน ⁺¹
Coming! Though I don't do hypervisors much anymore (one node under TrueNAS but mainly for my github actions runners through ARC) - mostly bare metal nodes that are low power (n100/i5/etc/etc).....minus the GPU nodes of course!
@nlay42 3 หลายเดือนก่อน
@@RoboTFAI Yes, very interested in the Kubernetes setup!
@Antonios1831 4 หลายเดือนก่อน
Thank you for such a great video! It would be interesting to know performance drop if we use same GPUs.
Like to test 3xA4500 on every node and run the same test. So we can compare 6xGPU on a single node VS 6xGPU across two nodes.
@RoboTFAI 4 หลายเดือนก่อน
Great suggestion! I think we can pull that off
@julienyt1600 4 หลายเดือนก่อน
Any chance you could share your kubernetes manifest changes to make it work ?
@Ulvens 4 หลายเดือนก่อน
I'm trying to do this locally. I have two separate systems, with 3090's connected via NVlinks. It's on a 2.5gbps network, not connected to the internet. Do I need to get a 10Gbps to make this work better, or should I run them all docked in the same system, and drop the NVlinks. Eventhough the NVlinks give me 48gb Vram. My hope was to get them to network, to push 96. Sorry for the technical errors in my question, but I'm pretty noob. It's the first time I have setup a network, and first time I've used SLI/NVlink. So this might not make any sence, but I have to try. Thank you for the great video.
@RoboTFAI 4 หลายเดือนก่อน
Don't need 10 gig network - but it would be faster to "load" the model the first time. Network during inference is relatively low in the swarm.
NVLink isn't something I have tested, as is not necessary for multi-gpu setups - however I am sure it has it's benefits, and more likely during training which is different beast than just inference.
@nickmajkic1436 5 หลายเดือนก่อน
You probably have this in another video but what are you using for server monitoring in the background?
@RoboTFAI 5 หลายเดือนก่อน
I assume you are referring to Grafana (with Prometheus). Along with the DGCM exporter that is part of Nvidia GPU Operator for Kubernetes docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
@shawnvines2514 6 หลายเดือนก่อน ⁺¹
Very interesting information. I'm definitely looking forward to this winter heating my office with CPU heat only :) Since LM Studio supports AMD GPUs, I'm wondering whether to get a AMD 7900 XTX 24GB card instead of a NVidia 4090 given the current price difference. I'm trying to pickup as much general experience as possible. I'm trying to learn Comfy UI right now and understand the basic concepts on my 4060 Ti 16GB.
@Pinlights 6 หลายเดือนก่อน ⁺³
I've been using Ollama for inference on a 7900XTX and it has been fantastic. Running inference with Llama3-8b, I was getting 94 tokens/s. With LLama 3.1 8B, I was getting 88 tokens/s. Now I'm going to test it with LocalAI. We're in a fantastic period right now where things are really improving by the day.
@RoboTFAI 6 หลายเดือนก่อน ⁺¹
Thanks! I like to think we are all just playing around trying to figure this all out... the tech is moving so fast no single person can keep up. That's why communities to learn from are the best.
@ckckck12 4 หลายเดือนก่อน
Please explain how a greater amount of compute power with a lower (about half) tokens per second, is an improvement? This is a real question, not rhetorical. To me that seems like a bad thing but I'm guessing more work is getting done at that slower rate? Like.... A bunch more work total but at a slower per node rate or something? I don't get why this is exciting. I want to.
@RoboTFAI 4 หลายเดือนก่อน ⁺²
We do A/B testing so this is just an example of this tech against a baseline (having it all on one machine), so while not an improvement on TPS we are showing the capabilities of the distributed inference and what it can lead to. I would suggest watching Part 2 of this where we use this tech to pull off 405B on this same distributed hardware, which is not something I could pull off with one machine (with my GPU's at least) that's what makes it exciting to me at least! th-cam.com/video/CKC2O9lcLig/w-d-xo.html
@ckckck12 4 หลายเดือนก่อน
@@RoboTFAI thank you!
@Johan-rm6ec 5 หลายเดือนก่อน
What i would like to know 2 times a 4060 ti 16GB. Is it more usable with LM Studio and various models or is a 4070 ti 16gb super a better option? Cost is the same here around 900 euro's.
@soumyajitganguly2593 3 หลายเดือนก่อน
I use a 3090 with a 4060Ti.. its definitely doable in LM studio, ollama, text-gen-web-ui etc. 2 cards total 32GB is always better for bigger models. But if all you care about is speed and ok with smaller models, then 4070Ti super is faster.
@testales 5 หลายเดือนก่อน
If I connect a worker it goes to CPU mode "create_backend: using CPU backend" - what am I missing? I've installed local-ai on both computers and I can do local inference (p2p off) on both and the GPUs are used in that case.
@RoboTFAI 5 หลายเดือนก่อน
How are you running it? Just local-ai directly or through docker? You would still need to set the correct environment variables (and with docker pass in "--gpus all") when starting the worker. If you are using docker just for the workers/etc make sure you have the nvidia-container-toolkit installed also (pre-req for passing NVIDIA cards to docker containers).
If all that is covered, I would need more info on the setup. Feel free to reach out and I can attempt to help
@testales 5 หลายเดือนก่อน
@@RoboTFAI Thanks for responding and let me say, that there are not many videos out about getting distributed inference running so your take on this is most welcome! I don't use docker and tried to keep it as simple as possible to reduce the amount of error sources. I just used to curl command on the very top of the installation page (curl http[:] localai io [/] install.sh | sh). This installs local-ai as service. So I changed and set the required environment variables in the service or the enviroment file (ADDRESS, TOKEN, LD_LIBRARY_PATH ) and installed the latest version of the Nvidia toolkit since my 12.4 and 550 driver were already too old and I got errors at first. Now I'm at toolikit 12.6 and driver version 560 and local inference works. So far I only tested with the Meta LLama 3.1 8b model in Q4, that can be installed directly via the web UI. I then enabled P2P and set the token on the server side in another environment variable, so it stays the same everytime. I created a worker also as service on my second machine to connect to the first using that token. The connection is sucessful and I can also do chats but only on CPU. I've then simplified it even more, disabled all services, switchted to the service user and ran local-ai (run with --p2p) as server on the main machine and another instance as worker on both machines, all in terminal sessions. Both workers connect but in CPU mode. I don't know if that is supposed to be that case but on the page in the screenshots you can see the same. What's in the log on your workers? I get someling like that:
{"level":"INFO","time":"2024-08-10T16:46:54.792+0200","caller":"node(...)","message":" Starting EdgeVPN network"}
create_backend: using CPU backend
Starting RPC server on 127.0.0.1:44469, backend memory: 63969 MB
There are no errors on any of the 3 running instances, the clients show the connection to the server instance and the server instance does server things. But it takes ages just to load the model and the inference is on CPU. None of the involved GPUs loads anything. Also I wonder how the model is supposed to work. I had expected that there must be a local version of it on the clients too but it doesn't seem to be the case. Yet transfering like 40-50GB of model data over the LAN each time you load a 70b model is very ineffcient. I couldn't find any documentation on this issue either.
Edit: Reposted, seems mentioning a curl request is forbidden now...
@RoboTFAI 4 หลายเดือนก่อน
Yea I agree on the loading across the network being fairly inefficient but the tech is also really new still. As far as the setup goes do you have your model setup for GPU (gpu_layers, f16, etc)? localai.io/features/gpu-acceleration/
@testales 4 หลายเดือนก่อน
@@RoboTFAI I had described every tiny bit in detail and even included some log messages but censortube deleted my answer. Twice. I didn't notice it did it again and since now month has past, I don't remember the details anymore. I've just checked that f16 is set in the yaml but I have not specified layers as the model should fit on either system's VRAM. It also runs in GPU model if I used it locally but remotely it connects only in CPU mode to the server system.
@braeder 2 หลายเดือนก่อน
Looks much fast on two nodes than 1
@blast_0230 5 หลายเดือนก่อน
super video 👍
can you try some amd gpu like mi50 (120$) for 16Go vram and if you have budget the mi100 pls.
i like this bench content
@AlienAnthony 5 หลายเดือนก่อน
Could you make this run on jentson orins?????
@RoboTFAI 5 หลายเดือนก่อน
Good question - want to sponsor me some to play with? 😁
@AlienAnthony 5 หลายเดือนก่อน
If I get the funds. I was interested in them myself. For price over vram + power consumption. I would invest in a cluster software for these. Speed might be diminished but it's certainly a cheaper option then building a server for inference only using self contained devices.
@borriskarlov8140 6 หลายเดือนก่อน
what about distributed training?
@RoboTFAI 6 หลายเดือนก่อน
That a good question...that I don't currently have an answer for!
@cuserroro 6 หลายเดือนก่อน
I would assume a network bottle lack.
also as there must be a reason why nvidia and others just have created new network interfaces to interconect the vram much fuster.
but i'm in for testing! :)
@cuserroro 6 หลายเดือนก่อน
lag ...
@six1free 6 หลายเดือนก่อน
just when I doubted you' have anything relevant for me ... wohooo
coincidentally I've recently been looking into 2.5 and 10G
my guess is since your using about 50MB/s through pcie 2.5G will be plenty
this means I'll be able to build a gpu server and easily add it to the existing server setup... joy
@RoboTFAI 6 หลายเดือนก่อน ⁺¹
2.5G would be plenty, honestly was more like 10-20MB/s per "node" during inference - the loading was dramatically more. Glad I could provide some relevance for ya!
@pubswork2063 3 หลายเดือนก่อน
im working hard to run it with amds i feel youve been a miner, a miner can run gpus no mater how many. if gpu miner could. sell ai tokens! also multi agent approach can us small devices
@RoboTFAI 3 หลายเดือนก่อน
I get that a lot, I was actually never big into crypto or mining. I am however big into distributed systems, hardware, and architecture. GPU's (besides video encoding) didn't enter my lab until heavy AI usage. I play a lot with agents, and running many models in parallel while trying to figure all of this like rest you!
@pubswork2063 3 หลายเดือนก่อน
@@RoboTFAI the push of hardware in minning ıs some thing you can only see by benchmarking unified optimised oc that we aim to have the line dead flat. ai is real world its like you can half mine with the same intelligence ! i thing old school data centre sttuf start failing when every server starting to have 8 gpus!!!!!! data centres are too cool comparing to mines!!!
@pubswork2063 3 หลายเดือนก่อน
work on parallel every thing multi lan multi optical fibres theır not that expensive but ı feel ai and lights are better fit! multi nmve on different pcies you need to map every single pcie lines to that cpu loaded up to the point that oems have use fans too cool motherboards multi rams that u do even training the main mode to use small swarmed agent models in smaller vrams something like treadreaper even the cheapest oldest has 4 time lines like 4 16 lane pcie 4090!
@c0nsumption 2 หลายเดือนก่อน
".... just a set of jets flying over my house"
And there's your like, sir 🧍🏽‍♂
@RoboTFAI 2 หลายเดือนก่อน
😁 much appreciated!

ต่อไป

เล่นอัตโนมัติ

LocalAI LLM Testing: Part 2 Network Distributed Inference Llama 3.1 405B Q2 in the Lab!