The useful comparison would be to test llama3.1 8B against 70B and distributed 405B. Since you can already run a model, spreading it over more nodes is not usefull. So running a larger model distributed vs smaller model and comparing quality and inference speed is a usefull test. Great channel!
Because it requires two of the most important things but not all people have these advantages. Firstly the money to buy all those expensive GPUs, and secondly is the knowledge, because I think this topic is already considered a complex one
What I learned was that, if you absolutely have to run the massive models, there's an awesome new way to do it. However, my overall strategy is around mixture of agents and councils of smaller models run in parallel to get the accuracy and speed I need. I don't think this thing will work worth a damn over the Internet except in university settings, but that is a very cool use case. Maybe the big boys can federate their future gigawatt data centers together and make even larger models across giant private networks? This is definitely going to be useful.
I concur with your strategy, and is also my strategy depending on what I am doing/playing with. We will see where this tech goes, it's at least fun to play with and would allow folks like myself to build a highly available LLM api endpoint with a swarm of machines behind it for our own use.
Working on it! Will have to see if can even offload it entirely to VRAM in my lab...most likely not hahaha depending on quant and context size....but let's find out! Might actually have to use kid's gaming machines.
There is also llamafiles from Mozilla. You have a good CPU, maybe you could also compare llama3.1 8B cpu vs gpu, same with 70B and maybe 405B 😮. Now with llamafiles they claim quite a speedup.
Thank you for such a great video! It would be interesting to know performance drop if we use same GPUs. Like to test 3xA4500 on every node and run the same test. So we can compare 6xGPU on a single node VS 6xGPU across two nodes.
Very interesting information. I'm definitely looking forward to this winter heating my office with CPU heat only :) Since LM Studio supports AMD GPUs, I'm wondering whether to get a AMD 7900 XTX 24GB card instead of a NVidia 4090 given the current price difference. I'm trying to pickup as much general experience as possible. I'm trying to learn Comfy UI right now and understand the basic concepts on my 4060 Ti 16GB.
I've been using Ollama for inference on a 7900XTX and it has been fantastic. Running inference with Llama3-8b, I was getting 94 tokens/s. With LLama 3.1 8B, I was getting 88 tokens/s. Now I'm going to test it with LocalAI. We're in a fantastic period right now where things are really improving by the day.
Thanks! I like to think we are all just playing around trying to figure this all out... the tech is moving so fast no single person can keep up. That's why communities to learn from are the best.
This is neat However question, it appears to be slow for serialized prompts. Does sending parallel prompts/batched change the equation in terms of total tok/sec?
Coming! Though I don't do hypervisors much anymore (one node under TrueNAS but mainly for my github actions runners through ARC) - mostly bare metal nodes that are low power (n100/i5/etc/etc).....minus the GPU nodes of course!
Any advantages (except being possible to run big models) of distributed inference? I also tried and it works worse with each added node :) Do you think it's too early? Will appear some advantages over time?
Good questions! Yea running big models on distributed hardware is prob biggest advantage. More so for homelab folks that don't have hardware resources to build single multi-gpu machines/etc. The other side which I didn't touch on too much is the ability to cluster your api endpoint (load balancing requests/etc) with "servers" in front of a pool of workers. Lots of scale, in which you could run a small model on many nodes balancing requests a bit more than just "parallel requests" in most of our software. It is REALLY early, but the tech around everything AI/LLM is advancing at such a pace it's hard to keep up with!
I assume you are referring to Grafana (with Prometheus). Along with the DGCM exporter that is part of Nvidia GPU Operator for Kubernetes docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
I'm trying to do this locally. I have two separate systems, with 3090's connected via NVlinks. It's on a 2.5gbps network, not connected to the internet. Do I need to get a 10Gbps to make this work better, or should I run them all docked in the same system, and drop the NVlinks. Eventhough the NVlinks give me 48gb Vram. My hope was to get them to network, to push 96. Sorry for the technical errors in my question, but I'm pretty noob. It's the first time I have setup a network, and first time I've used SLI/NVlink. So this might not make any sence, but I have to try. Thank you for the great video.
Don't need 10 gig network - but it would be faster to "load" the model the first time. Network during inference is relatively low in the swarm. NVLink isn't something I have tested, as is not necessary for multi-gpu setups - however I am sure it has it's benefits, and more likely during training which is different beast than just inference.
Please explain how a greater amount of compute power with a lower (about half) tokens per second, is an improvement? This is a real question, not rhetorical. To me that seems like a bad thing but I'm guessing more work is getting done at that slower rate? Like.... A bunch more work total but at a slower per node rate or something? I don't get why this is exciting. I want to.
We do A/B testing so this is just an example of this tech against a baseline (having it all on one machine), so while not an improvement on TPS we are showing the capabilities of the distributed inference and what it can lead to. I would suggest watching Part 2 of this where we use this tech to pull off 405B on this same distributed hardware, which is not something I could pull off with one machine (with my GPU's at least) that's what makes it exciting to me at least! th-cam.com/video/CKC2O9lcLig/w-d-xo.html
If I connect a worker it goes to CPU mode "create_backend: using CPU backend" - what am I missing? I've installed local-ai on both computers and I can do local inference (p2p off) on both and the GPUs are used in that case.
How are you running it? Just local-ai directly or through docker? You would still need to set the correct environment variables (and with docker pass in "--gpus all") when starting the worker. If you are using docker just for the workers/etc make sure you have the nvidia-container-toolkit installed also (pre-req for passing NVIDIA cards to docker containers). If all that is covered, I would need more info on the setup. Feel free to reach out and I can attempt to help
@@RoboTFAI Thanks for responding and let me say, that there are not many videos out about getting distributed inference running so your take on this is most welcome! I don't use docker and tried to keep it as simple as possible to reduce the amount of error sources. I just used to curl command on the very top of the installation page (curl http[:] localai io [/] install.sh | sh). This installs local-ai as service. So I changed and set the required environment variables in the service or the enviroment file (ADDRESS, TOKEN, LD_LIBRARY_PATH ) and installed the latest version of the Nvidia toolkit since my 12.4 and 550 driver were already too old and I got errors at first. Now I'm at toolikit 12.6 and driver version 560 and local inference works. So far I only tested with the Meta LLama 3.1 8b model in Q4, that can be installed directly via the web UI. I then enabled P2P and set the token on the server side in another environment variable, so it stays the same everytime. I created a worker also as service on my second machine to connect to the first using that token. The connection is sucessful and I can also do chats but only on CPU. I've then simplified it even more, disabled all services, switchted to the service user and ran local-ai (run with --p2p) as server on the main machine and another instance as worker on both machines, all in terminal sessions. Both workers connect but in CPU mode. I don't know if that is supposed to be that case but on the page in the screenshots you can see the same. What's in the log on your workers? I get someling like that: {"level":"INFO","time":"2024-08-10T16:46:54.792+0200","caller":"node(...)","message":" Starting EdgeVPN network"} create_backend: using CPU backend Starting RPC server on 127.0.0.1:44469, backend memory: 63969 MB There are no errors on any of the 3 running instances, the clients show the connection to the server instance and the server instance does server things. But it takes ages just to load the model and the inference is on CPU. None of the involved GPUs loads anything. Also I wonder how the model is supposed to work. I had expected that there must be a local version of it on the clients too but it doesn't seem to be the case. Yet transfering like 40-50GB of model data over the LAN each time you load a 70b model is very ineffcient. I couldn't find any documentation on this issue either. Edit: Reposted, seems mentioning a curl request is forbidden now...
Yea I agree on the loading across the network being fairly inefficient but the tech is also really new still. As far as the setup goes do you have your model setup for GPU (gpu_layers, f16, etc)? localai.io/features/gpu-acceleration/
@@RoboTFAI I had described every tiny bit in detail and even included some log messages but censortube deleted my answer. Twice. I didn't notice it did it again and since now month has past, I don't remember the details anymore. I've just checked that f16 is set in the yaml but I have not specified layers as the model should fit on either system's VRAM. It also runs in GPU model if I used it locally but remotely it connects only in CPU mode to the server system.
What i would like to know 2 times a 4060 ti 16GB. Is it more usable with LM Studio and various models or is a 4070 ti 16gb super a better option? Cost is the same here around 900 euro's.
I use a 3090 with a 4060Ti.. its definitely doable in LM studio, ollama, text-gen-web-ui etc. 2 cards total 32GB is always better for bigger models. But if all you care about is speed and ok with smaller models, then 4070Ti super is faster.
im working hard to run it with amds i feel youve been a miner, a miner can run gpus no mater how many. if gpu miner could. sell ai tokens! also multi agent approach can us small devices
I get that a lot, I was actually never big into crypto or mining. I am however big into distributed systems, hardware, and architecture. GPU's (besides video encoding) didn't enter my lab until heavy AI usage. I play a lot with agents, and running many models in parallel while trying to figure all of this like rest you!
@@RoboTFAI the push of hardware in minning ıs some thing you can only see by benchmarking unified optimised oc that we aim to have the line dead flat. ai is real world its like you can half mine with the same intelligence ! i thing old school data centre sttuf start failing when every server starting to have 8 gpus!!!!!! data centres are too cool comparing to mines!!!
just when I doubted you' have anything relevant for me ... wohooo coincidentally I've recently been looking into 2.5 and 10G my guess is since your using about 50MB/s through pcie 2.5G will be plenty this means I'll be able to build a gpu server and easily add it to the existing server setup... joy
2.5G would be plenty, honestly was more like 10-20MB/s per "node" during inference - the loading was dramatically more. Glad I could provide some relevance for ya!
If I get the funds. I was interested in them myself. For price over vram + power consumption. I would invest in a cluster software for these. Speed might be diminished but it's certainly a cheaper option then building a server for inference only using self contained devices.
So, each time you add more resources to your system, you make it slower. This is pretty bad at all. Why bother adding more nodes, just run everything on a single node.
This was my question too. If we've added more video cards with more nodes and doing so makes the tokens per second go down, what's the gain here? I'm not seeing it. I feel like there's something I don't understand about this stuff.
@@ckckck12ha! Welcome to the Club of No Return! Everytime I read / watch enough to level up… I just end up SEEMINGLY … to realize I understand LESS of the full picture! 🙃
I would assume a network bottle lack. also as there must be a reason why nvidia and others just have created new network interfaces to interconect the vram much fuster. but i'm in for testing! :)
work on parallel every thing multi lan multi optical fibres theır not that expensive but ı feel ai and lights are better fit! multi nmve on different pcies you need to map every single pcie lines to that cpu loaded up to the point that oems have use fans too cool motherboards multi rams that u do even training the main mode to use small swarmed agent models in smaller vrams something like treadreaper even the cheapest oldest has 4 time lines like 4 16 lane pcie 4090!
Great video, as someone who's starting their master's thesis on a similar topic, this was incredibly helpful.
Glad it was helpful!
I honestly think this is the coolest AI related channel on youtube. I hope it keeps growing.
Wow thanks! Much appreciated
The useful comparison would be to test llama3.1 8B against 70B and distributed 405B. Since you can already run a model, spreading it over more nodes is not usefull. So running a larger model distributed vs smaller model and comparing quality and inference speed is a usefull test. Great channel!
As you asked :) it would ne nice to see HOW-TO.
Really nice content and topic itself 👍
Awesome, thank you! How to's coming eventually gotta find the time which I have none of!
just another very rare and useful topic covered in a video. why we dont have more youtube channels like this? 😅
Glad you liked it!
Because it requires two of the most important things but not all people have these advantages. Firstly the money to buy all those expensive GPUs, and secondly is the knowledge, because I think this topic is already considered a complex one
I love open source and I love hardware. This video speaks to me! Love it!! 👨🏻💻
Hey much appreciated! Glad you enjoyed it.
I can tolerate nonesense ads from TH-cam to watch your videos, your content is incredibly helpful
Sorry no control over that! Much appreciated that you put up with it.
What I learned was that, if you absolutely have to run the massive models, there's an awesome new way to do it. However, my overall strategy is around mixture of agents and councils of smaller models run in parallel to get the accuracy and speed I need. I don't think this thing will work worth a damn over the Internet except in university settings, but that is a very cool use case. Maybe the big boys can federate their future gigawatt data centers together and make even larger models across giant private networks? This is definitely going to be useful.
I concur with your strategy, and is also my strategy depending on what I am doing/playing with. We will see where this tech goes, it's at least fun to play with and would allow folks like myself to build a highly available LLM api endpoint with a swarm of machines behind it for our own use.
I would be interested in a how to, I have a home lab that I would like to try this with.
Also, you are the man for setting this up and demoing! That's no mean feat!!!
Hey thanks for that, much appreciated! Sometimes there is hours of testing/behind scenes setup for some of these.
@@RoboTFAI oh yeah, not many people appreciate everything that goes into a video or really designing a whole channel. It's a ton of work.
would be very interesting if you would do this distributed inference but on the 405b size. Anyway, great video!
I second that! 70B vs 405B showdown or just plain 405B test would be of great value. Amazing work anyways!
Working on it! Will have to see if can even offload it entirely to VRAM in my lab...most likely not hahaha depending on quant and context size....but let's find out! Might actually have to use kid's gaming machines.
There is also llamafiles from Mozilla. You have a good CPU, maybe you could also compare llama3.1 8B cpu vs gpu, same with 70B and maybe 405B 😮. Now with llamafiles they claim quite a speedup.
Dual 40Gb/s port Melanox ConnectX-3 infiniband/ethernet cards are around/less than £40 on ebay
Thank you for such a great video! It would be interesting to know performance drop if we use same GPUs.
Like to test 3xA4500 on every node and run the same test. So we can compare 6xGPU on a single node VS 6xGPU across two nodes.
Great suggestion! I think we can pull that off
Consider me a subscriber, these bench tests are amazing
Welcome aboard!
Very interesting information. I'm definitely looking forward to this winter heating my office with CPU heat only :) Since LM Studio supports AMD GPUs, I'm wondering whether to get a AMD 7900 XTX 24GB card instead of a NVidia 4090 given the current price difference. I'm trying to pickup as much general experience as possible. I'm trying to learn Comfy UI right now and understand the basic concepts on my 4060 Ti 16GB.
I've been using Ollama for inference on a 7900XTX and it has been fantastic. Running inference with Llama3-8b, I was getting 94 tokens/s. With LLama 3.1 8B, I was getting 88 tokens/s. Now I'm going to test it with LocalAI. We're in a fantastic period right now where things are really improving by the day.
Thanks! I like to think we are all just playing around trying to figure this all out... the tech is moving so fast no single person can keep up. That's why communities to learn from are the best.
Looks much fast on two nodes than 1
This is neat
However question, it appears to be slow for serialized prompts. Does sending parallel prompts/batched change the equation in terms of total tok/sec?
Can you talk more about your kubernetes and hypervisor setup?
Coming! Though I don't do hypervisors much anymore (one node under TrueNAS but mainly for my github actions runners through ARC) - mostly bare metal nodes that are low power (n100/i5/etc/etc).....minus the GPU nodes of course!
@@RoboTFAI Yes, very interested in the Kubernetes setup!
Any advantages (except being possible to run big models) of distributed inference? I also tried and it works worse with each added node :)
Do you think it's too early? Will appear some advantages over time?
Good questions! Yea running big models on distributed hardware is prob biggest advantage. More so for homelab folks that don't have hardware resources to build single multi-gpu machines/etc.
The other side which I didn't touch on too much is the ability to cluster your api endpoint (load balancing requests/etc) with "servers" in front of a pool of workers. Lots of scale, in which you could run a small model on many nodes balancing requests a bit more than just "parallel requests" in most of our software.
It is REALLY early, but the tech around everything AI/LLM is advancing at such a pace it's hard to keep up with!
super video 👍
can you try some amd gpu like mi50 (120$) for 16Go vram and if you have budget the mi100 pls.
i like this bench content
time to watch a great video :D
Hope you enjoyed!
You probably have this in another video but what are you using for server monitoring in the background?
I assume you are referring to Grafana (with Prometheus). Along with the DGCM exporter that is part of Nvidia GPU Operator for Kubernetes docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
I'm trying to do this locally. I have two separate systems, with 3090's connected via NVlinks. It's on a 2.5gbps network, not connected to the internet. Do I need to get a 10Gbps to make this work better, or should I run them all docked in the same system, and drop the NVlinks. Eventhough the NVlinks give me 48gb Vram. My hope was to get them to network, to push 96. Sorry for the technical errors in my question, but I'm pretty noob. It's the first time I have setup a network, and first time I've used SLI/NVlink. So this might not make any sence, but I have to try. Thank you for the great video.
Don't need 10 gig network - but it would be faster to "load" the model the first time. Network during inference is relatively low in the swarm.
NVLink isn't something I have tested, as is not necessary for multi-gpu setups - however I am sure it has it's benefits, and more likely during training which is different beast than just inference.
Any chance you could share your kubernetes manifest changes to make it work ?
Please explain how a greater amount of compute power with a lower (about half) tokens per second, is an improvement? This is a real question, not rhetorical. To me that seems like a bad thing but I'm guessing more work is getting done at that slower rate? Like.... A bunch more work total but at a slower per node rate or something? I don't get why this is exciting. I want to.
We do A/B testing so this is just an example of this tech against a baseline (having it all on one machine), so while not an improvement on TPS we are showing the capabilities of the distributed inference and what it can lead to. I would suggest watching Part 2 of this where we use this tech to pull off 405B on this same distributed hardware, which is not something I could pull off with one machine (with my GPU's at least) that's what makes it exciting to me at least! th-cam.com/video/CKC2O9lcLig/w-d-xo.html
@@RoboTFAI thank you!
If I connect a worker it goes to CPU mode "create_backend: using CPU backend" - what am I missing? I've installed local-ai on both computers and I can do local inference (p2p off) on both and the GPUs are used in that case.
How are you running it? Just local-ai directly or through docker? You would still need to set the correct environment variables (and with docker pass in "--gpus all") when starting the worker. If you are using docker just for the workers/etc make sure you have the nvidia-container-toolkit installed also (pre-req for passing NVIDIA cards to docker containers).
If all that is covered, I would need more info on the setup. Feel free to reach out and I can attempt to help
@@RoboTFAI Thanks for responding and let me say, that there are not many videos out about getting distributed inference running so your take on this is most welcome! I don't use docker and tried to keep it as simple as possible to reduce the amount of error sources. I just used to curl command on the very top of the installation page (curl http[:] localai io [/] install.sh | sh). This installs local-ai as service. So I changed and set the required environment variables in the service or the enviroment file (ADDRESS, TOKEN, LD_LIBRARY_PATH ) and installed the latest version of the Nvidia toolkit since my 12.4 and 550 driver were already too old and I got errors at first. Now I'm at toolikit 12.6 and driver version 560 and local inference works. So far I only tested with the Meta LLama 3.1 8b model in Q4, that can be installed directly via the web UI. I then enabled P2P and set the token on the server side in another environment variable, so it stays the same everytime. I created a worker also as service on my second machine to connect to the first using that token. The connection is sucessful and I can also do chats but only on CPU. I've then simplified it even more, disabled all services, switchted to the service user and ran local-ai (run with --p2p) as server on the main machine and another instance as worker on both machines, all in terminal sessions. Both workers connect but in CPU mode. I don't know if that is supposed to be that case but on the page in the screenshots you can see the same. What's in the log on your workers? I get someling like that:
{"level":"INFO","time":"2024-08-10T16:46:54.792+0200","caller":"node(...)","message":" Starting EdgeVPN network"}
create_backend: using CPU backend
Starting RPC server on 127.0.0.1:44469, backend memory: 63969 MB
There are no errors on any of the 3 running instances, the clients show the connection to the server instance and the server instance does server things. But it takes ages just to load the model and the inference is on CPU. None of the involved GPUs loads anything. Also I wonder how the model is supposed to work. I had expected that there must be a local version of it on the clients too but it doesn't seem to be the case. Yet transfering like 40-50GB of model data over the LAN each time you load a 70b model is very ineffcient. I couldn't find any documentation on this issue either.
Edit: Reposted, seems mentioning a curl request is forbidden now...
Yea I agree on the loading across the network being fairly inefficient but the tech is also really new still. As far as the setup goes do you have your model setup for GPU (gpu_layers, f16, etc)? localai.io/features/gpu-acceleration/
@@RoboTFAI I had described every tiny bit in detail and even included some log messages but censortube deleted my answer. Twice. I didn't notice it did it again and since now month has past, I don't remember the details anymore. I've just checked that f16 is set in the yaml but I have not specified layers as the model should fit on either system's VRAM. It also runs in GPU model if I used it locally but remotely it connects only in CPU mode to the server system.
What i would like to know 2 times a 4060 ti 16GB. Is it more usable with LM Studio and various models or is a 4070 ti 16gb super a better option? Cost is the same here around 900 euro's.
I use a 3090 with a 4060Ti.. its definitely doable in LM studio, ollama, text-gen-web-ui etc. 2 cards total 32GB is always better for bigger models. But if all you care about is speed and ok with smaller models, then 4070Ti super is faster.
im working hard to run it with amds i feel youve been a miner, a miner can run gpus no mater how many. if gpu miner could. sell ai tokens! also multi agent approach can us small devices
I get that a lot, I was actually never big into crypto or mining. I am however big into distributed systems, hardware, and architecture. GPU's (besides video encoding) didn't enter my lab until heavy AI usage. I play a lot with agents, and running many models in parallel while trying to figure all of this like rest you!
@@RoboTFAI the push of hardware in minning ıs some thing you can only see by benchmarking unified optimised oc that we aim to have the line dead flat. ai is real world its like you can half mine with the same intelligence ! i thing old school data centre sttuf start failing when every server starting to have 8 gpus!!!!!! data centres are too cool comparing to mines!!!
just when I doubted you' have anything relevant for me ... wohooo
coincidentally I've recently been looking into 2.5 and 10G
my guess is since your using about 50MB/s through pcie 2.5G will be plenty
this means I'll be able to build a gpu server and easily add it to the existing server setup... joy
2.5G would be plenty, honestly was more like 10-20MB/s per "node" during inference - the loading was dramatically more. Glad I could provide some relevance for ya!
Could you make this run on jentson orins?????
Good question - want to sponsor me some to play with? 😁
If I get the funds. I was interested in them myself. For price over vram + power consumption. I would invest in a cluster software for these. Speed might be diminished but it's certainly a cheaper option then building a server for inference only using self contained devices.
So, each time you add more resources to your system, you make it slower. This is pretty bad at all. Why bother adding more nodes, just run everything on a single node.
This was my question too. If we've added more video cards with more nodes and doing so makes the tokens per second go down, what's the gain here? I'm not seeing it. I feel like there's something I don't understand about this stuff.
@@ckckck12ha! Welcome to the Club of No Return! Everytime I read / watch enough to level up… I just end up SEEMINGLY … to realize I understand LESS of the full picture!
🙃
what about distributed training?
That a good question...that I don't currently have an answer for!
I would assume a network bottle lack.
also as there must be a reason why nvidia and others just have created new network interfaces to interconect the vram much fuster.
but i'm in for testing! :)
lag ...
work on parallel every thing multi lan multi optical fibres theır not that expensive but ı feel ai and lights are better fit! multi nmve on different pcies you need to map every single pcie lines to that cpu loaded up to the point that oems have use fans too cool motherboards multi rams that u do even training the main mode to use small swarmed agent models in smaller vrams something like treadreaper even the cheapest oldest has 4 time lines like 4 16 lane pcie 4090!
".... just a set of jets flying over my house"
And there's your like, sir 🧍🏽♂
😁 much appreciated!