Very informative, thank you! Detailed, to the point and exactly what we need to know. I use a 3090 at home, a 4060 at work and on my coding machine I use an old GTX 1080Ti with 11GB. It does OK for Continue in VSCode but it is slow. Tell you wife "it's for science" in your best Doc from back to the future voice. Thank you again.
Great video. It is definitely nice to see a benchmark against different Nvidia board with something similar I have ran before. At the end of June, I bought parts and built for a computer for AI development with a Ryzen 7 7800X3D for $339 and a 4060 Ti 16GB for ($450). I bought it to begin local development waiting on the RTX 5090 but it looks like that will be delayed for awhile. I've just been using LM Studio and Anything LLM for running local LLM to analyze data. And using many Python open source projects for audio and image processing.
LM Studio and Anything LLM both in my toolbox also for daily driving laptop! Excellent tools. I also use Continue/etc integrated into VScode pointing at lab or LM Studio locally.
Thats great, Im thinking of doing the samething. Do you mind telling me how your setup is working so far? Is the 4060 Ti 16GB good enough for code generation or are you seeing lots of errors? Thanks!
exactly as requested. more useful videos. thank you for your content... i think about buying to buy a h100 80gb because i wanna run miatral large 2 so badly 😅
@@iheuzio what a nice suggestion. it looks promising more vram more speed and lesser price. if this is true.... well i can wait a little, i will def consider now the intel one. thank you very much!
@@maxmustermann194 that doesn´t work as good, the frequency jumps like crazy (well if you use tensors, it works because tensors only need a frequency of 1500 in the gpu, more than that almost make no difference)
Test out the 3060 12 gb cards comprehensively pleaser! Also would be nice to hear your opinions on what the best card combos might be be for cost to performance .
The read speed of an LLM (prompt eval tokens/s) only depends on the compute speed of the hardware (which depends on number and frequency of tensor cores, number and frequency of CUDA cores, chip generation). The write speed of an LLM (eval tokens/s) only depends on the memory bandwith (in GB/s) of the hardware and the chip generation.
nice. but then there is price too if its just a test lab. considering in australia a 3090 is USD $1500-$2000 in australia. and 4090 $2500 USD. so tempted to get an old tesla. but the 3090 just works in simple motherboard.
GGUF model file format is meant for usage in CPU inference. For GPU inference, use the GPTQ file format (.safetensor). The GGUF format takes more space, has less quality, but can be used on CPU.
GGUF is my territory and you need for such a decent CPU and any server motherboard with plenty 12 RAM slots, the only cheap way to get real terabyte RAM and run at best q8 quality, speed doesn't affect quality in this area if you can afford space for quality, even the slowest one give same result as cloud but later in time.
Me with a 4090 and a 1500w psu chuckling about your concern for burning the house down at 300w. :D I triggered a 15a breaker earlier this week, found out the outlet I have my microwave plugged into is on the same circuit as my home office and I must have been pushing the GPU at the time. So glad I don't have insane power costs like Europe. Btw, if you need a "the youtubers asked for it, it's a business write-off" excuse for a 4090... you really need a 4090 for testing data comparisons for the people. :D Thanks for all the tests. Would love to see a 4070ti in there too, to fight with the 4060.
Haha - but I have been known to burn up a power supply or three, a big UPS, couple breakers, etc - luckily lab has a few dedicated 20amp circuits these days. There is absolutely a fire extinguisher hanging in workshop/lab! Hey I would love to buy a 4090, and a million other cards! To be honest, I never really planned on a channel, I put up a video from a discussion with friends (basically to prove them wrong with data) and somehow you folks seem to like what this crazy guy does in his lab? if channel continues to grow and happens to make money one day happy to throw it all back in the channel. For now my budget is not much 💸
@@RoboTFAI Haha well you've earned this sub, curious what you end up testing next. "Not much" as you have a handful of $1k gpu's. Carry on good sir. o7
Something I am really wondering about is Radeon VII vs RX 6950XT (to keep it inside the AMD family). Having stuff work with ROCm is bothersome and most of what is available for NVIDIA just refuses to but as long as only inference is involved it works well (tried some tuning with no success so far). Would the HBM2 massive bandwidth able to score any win against a more recent more capable compute. Or if no win were to be seen for the HBM2, how would it affect the scaling ?
I'm impressed.. perhaps due to having used HP DL580 G7s to mine ETH years ago that are just sitting around and will take these M40s in pairs nicely, PCIe Optane and a 25Gb/s RJ45 card so they can all 'talk together'.. like 4 of them. With 1200W (208/240V or 1050w@125v) , of course bandwith limited on the board. PCIe 2 can run PCIe 3 cards pretty well, but I'm not so sure about PCIe 4 cards. I would have liked if you could have had the RTX Titan run with this group although. My other though is that quant size may vary the accuracy of the output in more subjective matter, questions that can be interpreted in differing ways at the lower quant levels varying the output results especially in training. I think that if time is not that big a consideration, the cheapness of the M40 makes it an appealing card-set at the 48GB+ level running SLI connections, but still have not evaluated the bus connections (lane 0 connected to lane 32). Presently setting up one machine with dual P40s.
Hey, you make really interesting and comprehensive videos! Many thanks for that. What I always ask myself and I think maybe many others too(?): What exactly do you use to connect the GPUs? So your system looks like a mining rig. Is there any performance loss between this extension or the direct connection via PCIe 16x lane? Have you already been able to test things like NVLink with your systems? Does it make sense to use different GPU models, or does this create some kind of bottlenecks? What do you think is important when it comes to choosing hardware to build such a system? Sorry for all the questions. I just find the whole topic really exciting.
Hey much appreciated! I use PCIe extenders - make sure they are 8/16x capable at your PCIe level (3,4,5) - I have had good luck with these ones www.amazon.com/gp/product/B09NB9D9PH I haven't noticed any performance difference of direct vs using these extenders.....but sounds like a good idea for a test... NVLink = I haven't seen a reason to do it for my needs. It adds extra cost and almost all LLM software/cuda/etc supports splitting without it. Could it be a performance increase when using two cards....I don't know, again sounds like something we could test but I have no budget left at the moment. Your last question is fairly subjective without knowing your requirements, as I don't think most people are doing what I do with my systems - do you want to be able to run big models? small models? multiple parallel models? looking for tokens per second or power usage? Performance vs Power vs Cost vs Needs (let's be honest it's Wants) - I find this tends to be different for everyone since most people will put one of those at the top of their priorities.
Hey question for you, what drivers/process is required to have a 3090 and a k80 runningn side by side, depending on the driver I install its either one or the other. I believe I need multi gpu support enabled? Not sure... maybe you might have the clue I need. cheers.
K80? That's a Kepler card, which I think Nvidia removed support for a few years ago in the drivers/cuda. Not sure you will get them to fully function together just from that.
Paradox of this area - speed not affecting quality, if you can afford max quality(many RAM) but at very slow hardware-you'll get same result as cloud but later. Slow Ai even getting more popular at corporate sector.
Good video, very detailed. I like that you looked at all aspects, power, price, efficiency etc. I don't suppose you have an AMD card lying around to compare as well? :D
My new phone runs rocket 3b llm (~3gb) on my phone ams gives answers in under 2 seconds. I have 3.3ghz based on 4nm and with ai hardware support + 16gb ddr5 ram. I can use an offline picture generation ai which finishes a 512x512 picture with 20 steps in around 2 minutes. Thats absoluteley INSANE IMO
It's crazy, mobile is where I always predicted small models would reign. The technology and the software are advancing at a pace I haven't seen in my career.
Great job! Llama3.1 is really much better, so I would encourage you to go on a quest! How to run different flavours of 3.1 most efficiently on commodity hardware. The it projects around llm's will explode imo, because the model family is good and a lot of companies can not share their data to public clouds.
I think you can take your evaluation a little further and tell us cost/token and total power/token. I'm interested in seeing some more high-end builds too. What hardware do we need to achieve 100t/s for instance and beyond? Thanks for the video! This was great!
I have two Tesla P40s here but I unsuccessfull in my trys on making use of both for my AI workloads. especially my stable diffusion trainings are taking very long. do you know how i could make them appear as one large gpu?
How many shrouds and fan sizes have you tried on Tesla GPU's? I want to get a quieter run which a larger fan could theoretically do but the shroud funneling might be a source of noise so I don't know what to get for best silence.
Hmm a few, I originally had them in a server that got retired so just some 3D printed shrouds. For bench testing I use high speed fans (very loud)....They require a good amount of air through them to keep them cool.
I am curious to know if anyone has done tests of e.g. 3060 vs 3090 vs 4090 on big models that do not fit in vram but doing gpu offload??? E.g. I get 2 tokens/ second with a 3060 and 7950x for 40gb memoey models... Anyone knows how 3090 performs here? Ddr5 ram 6000mt btw
@RoboTF AI Thanks for the video(s), they have very helpful. I would love to see one that goes over the software,drivers as well as the cpu,mem,mobo you use to set this up. One that would answer the question "If I wanted to combine 2 RTX 3090s so that LM Studio would be able to utilize 48Gb of VRAM, what software would I need"? The problem is there's a ton of content for exactly the opposite use case, so much so that the GPTy-bots that I've asked assume that I want to share one GPU with many VMs. Does that video exist? If not.....I'll subscribe and wait.
thank you for the great video! i have actually zero bucks. i hope you can give me recommendations on this one. i currently have spent 200 bucks on a X99-WS motherboard so i'll have 4PCIe at full x16 if i dont hook any m.2 NVMEs i assume. so thats awesome, it also has a 10C/20Th Xeon low profile, 32GB ram, and a okayish CPU cooler. i have already saved 200$ more and i dont know what to do. i was going to buy one or two P40s and later upgrade to 4 of them. but now i cannot even afford one, they're there for almost 300 bucks im afraid. one option is to go with M40s but im afraid they're trash for LLMs and specifically for Stable Diffusion stuff. they're pretty old, although your video shows they're quite good. i'm lost i'd love to get help from you. if you thought you'd have time we can discuss it. i can mail you or anything you'd think is aapropriate. special thanks. K1
Feel free to reach out, tis a community! I have several M40's from when I first started down this road that I would be willing to part with.... it's a slippery slope
@@RoboTFAI you're fantastic! Thanks. I'd love to reach out. Would appreciate to have your email or something so i can discuss! Maybe we can make a deal on your M40s of you have any spare of them that u don't use? Thanks.
Just wanted to say I really appreciate your content and would appreciate even more if you can find a way to enlarge texts so it’ll be easier to read. Thank you so much
how does it feel ? 🙂 it feels that we don't see much from your screen 😀 we can trust that you say the truth 😀 joking a bit, thanks for the review but please do zoom in a bit next time to see more
i don't understand why you use Mac to view server, that's the most questionable part of whole system. I've used Macbooks myself but latest MacOS is dead OS compared to earlier ones, devs abandoned it, that's why they shove iPhone apps there. One Mac mini laying on my table, i would never use it for such, it's overheating like oven even for web browser use.
This is great, and a very professional test environment. I was especially impressed by the ability to switch GPU's using the Kubernetes cluster.
Thanks, much appreciated!
GREAT video... Learned a lot from this video. It's hard to find good AI benchmark videos on TH-cam.
Glad you enjoyed it!
Very informative, thank you!
Detailed, to the point and exactly what we need to know.
I use a 3090 at home, a 4060 at work and on my coding machine I use an old GTX 1080Ti with 11GB. It does OK for Continue in VSCode but it is slow.
Tell you wife "it's for science" in your best Doc from back to the future voice.
Thank you again.
I tried she wasn't having it lol!
wish i could have the 1080Ti someday if you wanned to get rid of it. dont forget about me in a third world country Lol.
Great video. It is definitely nice to see a benchmark against different Nvidia board with something similar I have ran before. At the end of June, I bought parts and built for a computer for AI development with a Ryzen 7 7800X3D for $339 and a 4060 Ti 16GB for ($450). I bought it to begin local development waiting on the RTX 5090 but it looks like that will be delayed for awhile.
I've just been using LM Studio and Anything LLM for running local LLM to analyze data. And using many Python open source projects for audio and image processing.
LM Studio and Anything LLM both in my toolbox also for daily driving laptop! Excellent tools. I also use Continue/etc integrated into VScode pointing at lab or LM Studio locally.
Thats great, Im thinking of doing the samething. Do you mind telling me how your setup is working so far? Is the 4060 Ti 16GB good enough for code generation or are you seeing lots of errors? Thanks!
exactly as requested. more useful videos. thank you for your content...
i think about buying to buy a h100 80gb because i wanna run miatral large 2 so badly 😅
get gaudi 3, it is cheaper at 15k per card for the same performance as a h100 and you get double the vram
@@iheuzio what a nice suggestion. it looks promising
more vram more speed and lesser price. if this is true.... well i can wait a little, i will def consider now the intel one. thank you very much!
undervolt the 3090 and will give you bassically the same performance with around 220-250 watts
No need to undervolt, there is a simple nvidia-smi command to set the powerlimit.
@@maxmustermann194 that doesn´t work as good, the frequency jumps like crazy (well if you use tensors, it works because tensors only need a frequency of 1500 in the gpu, more than that almost make no difference)
Test out the 3060 12 gb cards comprehensively pleaser! Also would be nice to hear your opinions on what the best card combos might be be for cost to performance .
Pro tipp: You can reduce the power draw of the RTX 3090 by 90 Watts via undervolting without speed reduction during LLM inference.
Yep for sure and good info for people on power limiting, I wasn't going to do for this test of course
The read speed of an LLM (prompt eval tokens/s) only depends on the compute speed of the hardware (which depends on number and frequency of tensor cores, number and frequency of CUDA cores, chip generation). The write speed of an LLM (eval tokens/s) only depends on the memory bandwith (in GB/s) of the hardware and the chip generation.
nice. but then there is price too if its just a test lab. considering in australia a 3090 is USD $1500-$2000 in australia. and 4090 $2500 USD. so tempted to get an old tesla. but the 3090 just works in simple motherboard.
What about AMD using rocm?
Superb! Could you make a tutorial on how to setup and implement everything needed (SOFTWARE WISE) to achieve what you did here?
Yes, soon
Yes, I would love to learn how to do what you do in the setup. Looking forward to the video @RoboTFAI
GGUF model file format is meant for usage in CPU inference. For GPU inference, use the GPTQ file format (.safetensor). The GGUF format takes more space, has less quality, but can be used on CPU.
GGUF is my territory and you need for such a decent CPU and any server motherboard with plenty 12 RAM slots, the only cheap way to get real terabyte RAM and run at best q8 quality, speed doesn't affect quality in this area if you can afford space for quality, even the slowest one give same result as cloud but later in time.
Can you please test the A770 16gb card? Thanks
right, even just intel and amd in general
Me with a 4090 and a 1500w psu chuckling about your concern for burning the house down at 300w. :D I triggered a 15a breaker earlier this week, found out the outlet I have my microwave plugged into is on the same circuit as my home office and I must have been pushing the GPU at the time. So glad I don't have insane power costs like Europe. Btw, if you need a "the youtubers asked for it, it's a business write-off" excuse for a 4090... you really need a 4090 for testing data comparisons for the people. :D Thanks for all the tests. Would love to see a 4070ti in there too, to fight with the 4060.
Haha - but I have been known to burn up a power supply or three, a big UPS, couple breakers, etc - luckily lab has a few dedicated 20amp circuits these days. There is absolutely a fire extinguisher hanging in workshop/lab!
Hey I would love to buy a 4090, and a million other cards! To be honest, I never really planned on a channel, I put up a video from a discussion with friends (basically to prove them wrong with data) and somehow you folks seem to like what this crazy guy does in his lab? if channel continues to grow and happens to make money one day happy to throw it all back in the channel. For now my budget is not much 💸
@@RoboTFAI Haha well you've earned this sub, curious what you end up testing next. "Not much" as you have a handful of $1k gpu's. Carry on good sir. o7
For shootouts you should set your seed value for the run so they are deterministic between cards.
Something I am really wondering about is Radeon VII vs RX 6950XT (to keep it inside the AMD family).
Having stuff work with ROCm is bothersome and most of what is available for NVIDIA just refuses to but as long as only inference is involved it works well (tried some tuning with no success so far).
Would the HBM2 massive bandwidth able to score any win against a more recent more capable compute. Or if no win were to be seen for the HBM2, how would it affect the scaling ?
I'm impressed.. perhaps due to having used HP DL580 G7s to mine ETH years ago that are just sitting around and will take these M40s in pairs nicely, PCIe Optane and a 25Gb/s RJ45 card so they can all 'talk together'.. like 4 of them. With 1200W (208/240V or 1050w@125v) , of course bandwith limited on the board. PCIe 2 can run PCIe 3 cards pretty well, but I'm not so sure about PCIe 4 cards. I would have liked if you could have had the RTX Titan run with this group although. My other though is that quant size may vary the accuracy of the output in more subjective matter, questions that can be interpreted in differing ways at the lower quant levels varying the output results especially in training. I think that if time is not that big a consideration, the cheapness of the M40 makes it an appealing card-set at the 48GB+ level running SLI connections, but still have not evaluated the bus connections (lane 0 connected to lane 32). Presently setting up one machine with dual P40s.
Hey, you make really interesting and comprehensive videos! Many thanks for that. What I always ask myself and I think maybe many others too(?):
What exactly do you use to connect the GPUs? So your system looks like a mining rig. Is there any performance loss between this extension or the direct connection via PCIe 16x lane?
Have you already been able to test things like NVLink with your systems? Does it make sense to use different GPU models, or does this create some kind of bottlenecks?
What do you think is important when it comes to choosing hardware to build such a system?
Sorry for all the questions. I just find the whole topic really exciting.
Hey much appreciated!
I use PCIe extenders - make sure they are 8/16x capable at your PCIe level (3,4,5) - I have had good luck with these ones www.amazon.com/gp/product/B09NB9D9PH
I haven't noticed any performance difference of direct vs using these extenders.....but sounds like a good idea for a test...
NVLink = I haven't seen a reason to do it for my needs. It adds extra cost and almost all LLM software/cuda/etc supports splitting without it. Could it be a performance increase when using two cards....I don't know, again sounds like something we could test but I have no budget left at the moment.
Your last question is fairly subjective without knowing your requirements, as I don't think most people are doing what I do with my systems - do you want to be able to run big models? small models? multiple parallel models? looking for tokens per second or power usage?
Performance vs Power vs Cost vs Needs (let's be honest it's Wants) - I find this tends to be different for everyone since most people will put one of those at the top of their priorities.
Hey question for you, what drivers/process is required to have a 3090 and a k80 runningn side by side, depending on the driver I install its either one or the other. I believe I need multi gpu support enabled? Not sure... maybe you might have the clue I need. cheers.
K80? That's a Kepler card, which I think Nvidia removed support for a few years ago in the drivers/cuda. Not sure you will get them to fully function together just from that.
Paradox of this area - speed not affecting quality, if you can afford max quality(many RAM) but at very slow hardware-you'll get same result as cloud but later. Slow Ai even getting more popular at corporate sector.
Good video, very detailed. I like that you looked at all aspects, power, price, efficiency etc.
I don't suppose you have an AMD card lying around to compare as well? :D
Thanks! I do not have any AMD cards around but would be willing to test them if I got my hands on a few to borrow
My new phone runs rocket 3b llm (~3gb) on my phone ams gives answers in under 2 seconds.
I have 3.3ghz based on 4nm and with ai hardware support + 16gb ddr5 ram.
I can use an offline picture generation ai which finishes a 512x512 picture with 20 steps in around 2 minutes.
Thats absoluteley INSANE IMO
It's crazy, mobile is where I always predicted small models would reign. The technology and the software are advancing at a pace I haven't seen in my career.
Great job! Llama3.1 is really much better, so I would encourage you to go on a quest! How to run different flavours of 3.1 most efficiently on commodity hardware. The it projects around llm's will explode imo, because the model family is good and a lot of companies can not share their data to public clouds.
I think you can take your evaluation a little further and tell us cost/token and total power/token. I'm interested in seeing some more high-end builds too. What hardware do we need to achieve 100t/s for instance and beyond? Thanks for the video! This was great!
Some people achieved 100t/s with a RTX 4090. More important than chosing hardware is chosing the right software.
@@Viewable11 I'd say that your argument has flaws. Changing software is a lot easier than getting your money back fully for a GPU and buying another.
I have two Tesla P40s here but I unsuccessfull in my trys on making use of both for my AI workloads. especially my stable diffusion trainings are taking very long. do you know how i could make them appear as one large gpu?
How many shrouds and fan sizes have you tried on Tesla GPU's? I want to get a quieter run which a larger fan could theoretically do but the shroud funneling might be a source of noise so I don't know what to get for best silence.
Hmm a few, I originally had them in a server that got retired so just some 3D printed shrouds. For bench testing I use high speed fans (very loud)....They require a good amount of air through them to keep them cool.
great work !😊
Many many thanks
I am curious to know if anyone has done tests of e.g. 3060 vs 3090 vs 4090 on big models that do not fit in vram but doing gpu offload??? E.g. I get 2 tokens/ second with a 3060 and 7950x for 40gb memoey models... Anyone knows how 3090 performs here? Ddr5 ram 6000mt btw
Mixing of GPUs, and CPUs is something we do here, we can dive in farther
@RoboTF AI Thanks for the video(s), they have very helpful. I would love to see one that goes over the software,drivers as well as the cpu,mem,mobo you use to set this up. One that would answer the question "If I wanted to combine 2 RTX 3090s so that LM Studio would be able to utilize 48Gb of VRAM, what software would I need"? The problem is there's a ton of content for exactly the opposite use case, so much so that the GPTy-bots that I've asked assume that I want to share one GPU with many VMs. Does that video exist? If not.....I'll subscribe and wait.
thank you for the great video!
i have actually zero bucks. i hope you can give me recommendations on this one.
i currently have spent 200 bucks on a X99-WS motherboard so i'll have 4PCIe at full x16 if i dont hook any m.2 NVMEs i assume.
so thats awesome, it also has a 10C/20Th Xeon low profile, 32GB ram, and a okayish CPU cooler.
i have already saved 200$ more and i dont know what to do. i was going to buy one or two P40s and later upgrade to 4 of them. but now i cannot even afford one,
they're there for almost 300 bucks im afraid. one option is to go with M40s but im afraid they're trash for LLMs and specifically for Stable Diffusion stuff. they're pretty old, although your video shows they're quite good.
i'm lost i'd love to get help from you. if you thought you'd have time we can discuss it. i can mail you or anything you'd think is aapropriate.
special thanks.
K1
Feel free to reach out, tis a community! I have several M40's from when I first started down this road that I would be willing to part with.... it's a slippery slope
@@RoboTFAI you're fantastic! Thanks. I'd love to reach out. Would appreciate to have your email or something so i can discuss! Maybe we can make a deal on your M40s of you have any spare of them that u don't use? Thanks.
@@k1tajfar714 robot@robotf.ai or can find me on reddit/discord/etc - though not as active as I would like to be.
Can someone explain in a nutshell what this is? Is it an Ai language model like chatgpt that runs entirely offline on my own computer?
That's exactly what it is, if talking about LocalAI (localai.io). Open source API that mimics OpenAI (ChatGPT) to run open source models.
Just wanted to say I really appreciate your content and would appreciate even more if you can find a way to enlarge texts so it’ll be easier to read. Thank you so much
Sorry - recorded and best viewed at 4K - I'll try to do better at making things larger for people on smaller screens. Thanks for the feedback!
Could u please test out 4070ti super ?
There may or may not be one in the lab the channel hasn't seen yet 😜
These letters are very small. It's like a blank screen.
Sorry - recorded and best viewed at 4K
how does it feel ? 🙂 it feels that we don't see much from your screen 😀 we can trust that you say the truth 😀 joking a bit, thanks for the review but please do zoom in a bit next time to see more
Sorry - recorded and best viewed at 4K - I'll try to do better at making things larger for people on smaller screens. Thanks for the feedback!
@@RoboTFAI 👍 many thanks in advance
llama is pure garbage. Worse than GPT. It refuses to answer some of the most basic questions.
i don't understand why you use Mac to view server, that's the most questionable part of whole system. I've used Macbooks myself but latest MacOS is dead OS compared to earlier ones, devs abandoned it, that's why they shove iPhone apps there. One Mac mini laying on my table, i would never use it for such, it's overheating like oven even for web browser use.