i seen a reddit post of guy running 4 p100 16gb under 1300$ getting 30 tokens a second with vLLM , on 70b llama 3 lol, im so happy to see other builds like dual 3090's too, so far i have managed to pick up one titan rtx im hoping to shoot for a 3090 or another titan rtx,
Its been very cool to see the use cases of older cards for localllm setups. I want to grab a tesla p40 at some point and put in in a restomod llm pc if nothing more than for the cool factor of how it looks.
That will be a great system. It's been a ton of fun to have and the amount of new repos I am finding that allow me to take full advantage of the system has really been enjoyable.
Titan RTX's can bridge memory for 48GB VRAM, at half precision, that's as good as 96GB. Can do some serious AI work. Only other similar option is A6000, but it costs 2x as much as a par of Titans. They're form the RTX 2000 generation, so a bit slower than a 3080, but not by much.
I did not know what the titans can pool the memory, that would be very useful for some of the image/video generation models as even with an nvlink it won't appear as one card (at least from what I am aware of RE my 3090's)
@@enilenis Very good point. Yes, in the opensora test I did and based on others feedback when trying it, the bottleneck is always the vram. I would be fine with a speed trade for more vram.
It is something I would like to add once I swap over to a threadripper. I have seen conflicting opinions on how much it helps but I would like it for "completeness" if nothing more.
Good set up. The only issue I have with it is how close the GPUs are stacked together. I understand that these are some thickkk boy GPUs, but you should have at least ONE pcie slot of separation MINIMUM between these 2 gpus. The least of your worry will be the the top GPU overheating. The CPU will get too hot for too long
I am not much of a gamer, but from what I have read, you're better off with a single 4090 as it is faster and can take advantage of newly released game-centric technologies. I do not believe vram is as big of a consideration for gaming as it is for something like llms which is what this machine was made for.
I have not seen crazy temps while running localllama. I did render something in keyshot pro that made the cards far too hot but for any llm stuff it hasn't been too bad at all.
I have heard conflicting info on this, so I won't speak with certainty. With that said, It is my understanding that if you're just running llms and such it won't make a noticeable difference. I have heard that in training instances it may add some benefit but I am not able to verify that myself.
@@OminousIndustries the newer pcie ports can substiitute the nvlink, there will be a time delay in training ect., but not that bad, can do mem pooling as well
Hello, I’m just building a set with two 4080 foundation cards, on the Asus Z790 WiFi board, but I’m wondering if there will be a problem with cooling, temperature as there are such large cards for 3 slots one above the other, do you have any problems with it?
Honestly, yes. I would look for alternate cooling solutions for the long term. If I have both cards training a model the top card has sometimes gone north of 85c which is not good. I believe as an alternate solution you can also power limit the cards to help prevent them from getting too hot as well, but at this point I am seriously looking into some water cooling solutions.
@@rafal_mazur Not a possibility in this current setup. Once I swap everything over to the tf view 71 I have embarrassingly) had sitting in the box for a few months I will explore alternate mounting options, though I will likely just water cool the system at that point since I will have to "rebuild" it all anyways.
Thank you! I do not have personal experience with a 4090 for any AI related tasks, so take what I say with a grain of salt. My understanding is that while there will be a speed differential for performing certain tasks in favor of the 4090, the vram is the important part in terms of ability to "fit" the thing being run. To simplify, both cards would be able to run the same "items", though the 4090 would be faster in terms of the generation speed. Another example is with offline video gen stuff like Open Sora. The bare minimum generation needs a 24gb card so both the 3090 and 4090 would be able to generate a result, though speeds may differ. In terms of training and multi card setups, there appears to be a lot more to consider beyond the speed differential of the two cards, such as memory bandwith, a cpu with support for the neccesary number of pci lanes, etc. The r/localllama subreddit is a very good resource as a lot of folks there have experience with these sorts of builds and set ups. If the budget is there, I would go with the 4090(s), though with that said a great many of us are happily having fun and running neat things with 3xxx series cards as well.
Is heat an issue for you? Considering getting another 3090 for my build, but worried about overheating. Do they really generate that much heat during inference?
Yes, it was and yes they do, though training is what made them too hot for me to be comfortable with. I ended up watercooling both of them and they run much much better. Though, the case I had them in was from the late 2000's and was not designed to provide adequate airflow to a setup like this, so I believe had I air cooled them with a proper modern case and fans, it may not have necessitated going liquid cooled.
Ok so i am very interested in local llms and found that my system is way too weak for my likings. But i really have to ask.. what are you doing with this technology? I have no "real" use case for it and wouldn't consider buying two new gpus for it. What are actual beneficial use cases for it? Maybe coding?
I have a business that utilizes LLMs for some of my products so it is a 50/50 split between business-related research and hobbyist tinkering. The requirements to run LLMS locally are heavily dependent on the type and size of model you want to run. You don't need a large vram setup like this to fool around with them, I just went for this so that I could run larger models like 70B models. Some of the smaller models would run fine on an older card like a 3060 which can be had without breaking the bank. Some of the model "curators" post the requirements for vram for the models on huggingface, bartowski being one who lists the requirements.
I have not tried that. I believe hypothetically I could run it in some form of quant, but I don't know if it would be one that would be worth using, as I have read that below Q4 seems to become a bit sketchy output wise.
It is an MSI Pro Z690. With that said, I think any ATX mobo with multiple PCIe slots would accommodate the two cards. Be mindful of temps with a setup like this however, as the cards were getting uncomfortably warm during training.
I unfortunately have not yet nvlinked these cards so I won't be much help with this. I would suggest heading over to one of the ML adjacent subreddits like r/localllama where I'm sure a few people at least have gone through those steps and could help you out!
Just for 70b models initially, but I now use it a lot for SD and other random experiments which I suppose don't necessarily need two cards but oh well lol
@@ALEFA-ID Of course! It is an awesome setup to have regardless if the cards are ti or not, 48gb builds are sick! Check out r/localllama on reddit as well if you're interested in the community based around home setups like this.
@@ALEFA-ID Of course! I am excited for you hahah if you have any other questions or anything feel free to reach out anytime! I just got two waterblocks shipped so I will be posting a watercool build video soon(ish) :)
It would only be better in terms of how large a model I could run. They are slower to run the models and having to have four separate cards would have added additional considerations like a new MOBO and having to deal with linking them all to use the vram with whichever service I was going to use them with.
Hello!, I was trying to find 3090 cards but only found differents brands (Galax and EVGA). Would that affect performance? or should I keep searching for the same model of cards.
I can't say for sure as I have not personally tried with different cards, but I will say that I have seen people combining multiple cards of the same family, like a 3060+3090 and have made it work so I wouldn't think there would be any issue. The different cards can have slight differences like clock speeds, etc, though I don't believe that would make a noticeable difference.
@@Chilly.Mint. Sure thing! Be sure to check out the localllama subreddit as well, it contains a lot of personal experience from people running setups like this one.
Personally I just use it for fun. Some people use these uncensored image models to generate NSFW images that they then release on patreon, etc to make some money, but that is not in my wheelhouse.
The cpu is an I7-12700K and the mobo is a MSI PRO Z690-A. I purchased them as a micro center bundle about a year ago. I have not seen the card temps get over about 75c when using the text-gen-webui. I was using keyshot pro for something and decided to use both cards to render the project and they got far too hot, so cooling is first priority to be upgraded.
@@OminousIndustries Okay thanks. Yeah there's not much space in that case. I have a bigger case, I'm looking to get another 3090 or 4090 and possibly water cool them. Would be nice to get an A6000 but too much right now
@@codescholar7345 I have a thermaltake view 71 to swap them into when I get the time. The A6000 would be awesome but yeah that price could get you a dual 4090 setup. A water cooling setup would be very cool and a good move for these situations.
I believe the consensus is still that dual 24gb cards (like the 3090) is the best "budget move", however, I would head over to: www.reddit.com/r/LocalLLaMA/ and browse/ask there as there are a lot of knowledgeable people in there who can provide good insight on this. For what it's worth, I don't believe it will make a huge difference if you get non Ti cards, I just bought them as I wanted to match the first card I had which happened to be a Ti.
@@braeder It's a lot of fun to have. Not necessarily. Quantization essentially removes some of the "precision" of the model but in turn allows it to be a much more manageable size so it can be run on less horsepower. There are tons of quants of many models on huggingface so it's pretty good pickings to find something you like that will fit on a specific setup.
@@OminousIndustries I see! I am running a 30b on my cpu right now... Taking for ever! I want a script that can off load some calculations onto my gpu for a program I am writing. But Building a GPU server will really open up some possibilities! I was thinking of used k80's
@@braeder Yes CPU will struggle due to the way the models are run, the older tesla cards are good, but there could be potential compatibility issues for some libraries and things like that, though I can't say for certain without experience with those cards. I think having a couple 12gb 3060's would open up a lot of possibilities as they will allow you to use most common current libraries and such. For navigating this space this subreddit is a really good resource to find info on good setups and other related things: www.reddit.com/r/LocalLLaMA/
i seen a reddit post of guy running 4 p100 16gb under 1300$ getting 30 tokens a second with vLLM , on 70b llama 3 lol, im so happy to see other builds like dual 3090's too, so far i have managed to pick up one titan rtx im hoping to shoot for a 3090 or another titan rtx,
Its been very cool to see the use cases of older cards for localllm setups. I want to grab a tesla p40 at some point and put in in a restomod llm pc if nothing more than for the cool factor of how it looks.
That original pc case is actually really cool
It is an old Lian Li case, I agree and actually bought another older similar case because I liked it so much.
Excellent, I am getting ready to build something like this, using an Epyc CPU and Supermicro MB.
That will be a great system. It's been a ton of fun to have and the amount of new repos I am finding that allow me to take full advantage of the system has really been enjoyable.
Titan RTX's can bridge memory for 48GB VRAM, at half precision, that's as good as 96GB. Can do some serious AI work. Only other similar option is A6000, but it costs 2x as much as a par of Titans. They're form the RTX 2000 generation, so a bit slower than a 3080, but not by much.
I did not know what the titans can pool the memory, that would be very useful for some of the image/video generation models as even with an nvlink it won't appear as one card (at least from what I am aware of RE my 3090's)
@@enilenis Very good point. Yes, in the opensora test I did and based on others feedback when trying it, the bottleneck is always the vram. I would be fine with a speed trade for more vram.
Great video!
Thanks very much!
Great! Thanks for sharing, that’s exactly what I’m trying to do.
Sure thing! It is a fun setup indeed!
Did You plan to use NVLink with new Ryzen setup?
It is something I would like to add once I swap over to a threadripper. I have seen conflicting opinions on how much it helps but I would like it for "completeness" if nothing more.
@@OminousIndustriesit’s true, I work in the vfx industry. We used 3090s a lot in pooled render rigs. It’s one feature I miss in the 4090.
Good set up. The only issue I have with it is how close the GPUs are stacked together. I understand that these are some thickkk boy GPUs, but you should have at least ONE pcie slot of separation MINIMUM between these 2 gpus. The least of your worry will be the the top GPU overheating. The CPU will get too hot for too long
Yes, it was pretty bad cooling wise. The gpus are now watercooled and do not get very hot at all anymore !
The video is very cool, the case of the 3090 could be very beautiful
Thanks very much! I am going to be swapping everything over into a Thermaltake View 71 case very soon.
Unlike the inside of that case!
Thanks for this video.
Sure thing, thanks!
Bro, use nvtop. You're welcome.
I'm going to install that tonight for my intel gpu build, I previously hadn't found a monitor for that gpu on linux.
Thank you for suggesting it. Super useful!
Could this be used as a badass gaming computer? Would two 3090Ti’s run games faster than a single 4090?
I am not much of a gamer, but from what I have read, you're better off with a single 4090 as it is faster and can take advantage of newly released game-centric technologies. I do not believe vram is as big of a consideration for gaming as it is for something like llms which is what this machine was made for.
I am awaiting my second 3090 ti, probably going to end up water cooling. How has it been for you with heat management?
I have not seen crazy temps while running localllama. I did render something in keyshot pro that made the cards far too hot but for any llm stuff it hasn't been too bad at all.
I wonder how hot gpus will get on the full loud ?
They would get too hot, I ended up liquid cooling them as I was not comfortable with the potential heat they would generate at full load.
would connecting the two 3090Ti's with NVLink make it more capable at handling AI models?
I have heard conflicting info on this, so I won't speak with certainty. With that said, It is my understanding that if you're just running llms and such it won't make a noticeable difference. I have heard that in training instances it may add some benefit but I am not able to verify that myself.
@@OminousIndustries the newer pcie ports can substiitute the nvlink, there will be a time delay in training ect., but not that bad, can do mem pooling as well
Hello, I’m just building a set with two 4080 foundation cards, on the Asus Z790 WiFi board, but I’m wondering if there will be a problem with cooling, temperature as there are such large cards for 3 slots one above the other, do you have any problems with it?
Honestly, yes. I would look for alternate cooling solutions for the long term. If I have both cards training a model the top card has sometimes gone north of 85c which is not good. I believe as an alternate solution you can also power limit the cards to help prevent them from getting too hot as well, but at this point I am seriously looking into some water cooling solutions.
@@OminousIndustries you don’t tray mount one card vertical ?
@@rafal_mazur Not a possibility in this current setup. Once I swap everything over to the tf view 71 I have embarrassingly) had sitting in the box for a few months I will explore alternate mounting options, though I will likely just water cool the system at that point since I will have to "rebuild" it all anyways.
Nice! Need more videos like this 😶😶
Thanks very much !!
Extremely toasty if you do not lock power limit under 300w
Is there any real notable difference between 3090’s and 4090’s when running ai? Great video btw
Thank you! I do not have personal experience with a 4090 for any AI related tasks, so take what I say with a grain of salt. My understanding is that while there will be a speed differential for performing certain tasks in favor of the 4090, the vram is the important part in terms of ability to "fit" the thing being run. To simplify, both cards would be able to run the same "items", though the 4090 would be faster in terms of the generation speed. Another example is with offline video gen stuff like Open Sora. The bare minimum generation needs a 24gb card so both the 3090 and 4090 would be able to generate a result, though speeds may differ. In terms of training and multi card setups, there appears to be a lot more to consider beyond the speed differential of the two cards, such as memory bandwith, a cpu with support for the neccesary number of pci lanes, etc. The r/localllama subreddit is a very good resource as a lot of folks there have experience with these sorts of builds and set ups. If the budget is there, I would go with the 4090(s), though with that said a great many of us are happily having fun and running neat things with 3xxx series cards as well.
The difference lies in whether you can run AI models or not, depending on whether you have 24GB or 48GB of memory on some large LLMs.
Is heat an issue for you? Considering getting another 3090 for my build, but worried about overheating. Do they really generate that much heat during inference?
Yes, it was and yes they do, though training is what made them too hot for me to be comfortable with. I ended up watercooling both of them and they run much much better. Though, the case I had them in was from the late 2000's and was not designed to provide adequate airflow to a setup like this, so I believe had I air cooled them with a proper modern case and fans, it may not have necessitated going liquid cooled.
Ok so i am very interested in local llms and found that my system is way too weak for my likings. But i really have to ask.. what are you doing with this technology? I have no "real" use case for it and wouldn't consider buying two new gpus for it. What are actual beneficial use cases for it? Maybe coding?
I have a business that utilizes LLMs for some of my products so it is a 50/50 split between business-related research and hobbyist tinkering. The requirements to run LLMS locally are heavily dependent on the type and size of model you want to run. You don't need a large vram setup like this to fool around with them, I just went for this so that I could run larger models like 70B models. Some of the smaller models would run fine on an older card like a 3060 which can be had without breaking the bank. Some of the model "curators" post the requirements for vram for the models on huggingface, bartowski being one who lists the requirements.
@@OminousIndustries thank you for the insights really appreciate it
@@M4XD4B0ZZ Of course!
Does the dual GPU use NVLink bridge, Bro?
It doesn't. Now that the cards are watercooled I do not believe the bridge would even fit on them either so I will likely never have it installed.
@@OminousIndustries Ok, thanks! This is very inspiring!
Are you able to run llama 3.2 90B, or does this exceed the available vram?
I have not tried that. I believe hypothetically I could run it in some form of quant, but I don't know if it would be one that would be worth using, as I have read that below Q4 seems to become a bit sketchy output wise.
What motherboard are you using? I need to find one that will fit two big cards like yours.
It is an MSI Pro Z690. With that said, I think any ATX mobo with multiple PCIe slots would accommodate the two cards. Be mindful of temps with a setup like this however, as the cards were getting uncomfortably warm during training.
@@OminousIndustries Hi, what cpu are you using?
@@jesusleguiza77 It is a 12th gen i7
Hi, I have 2x3090 using asus crosshair x670E hero, could you show me how to enable nvlink please.
I unfortunately have not yet nvlinked these cards so I won't be much help with this. I would suggest heading over to one of the ML adjacent subreddits like r/localllama where I'm sure a few people at least have gone through those steps and could help you out!
NVIDIA RTX NVLink Bridge P/N: NVRTXLK2 or NVRTXLK3
You use this setup just for running the 70b llm model? Or for fine tune?
Just for 70b models initially, but I now use it a lot for SD and other random experiments which I suppose don't necessarily need two cards but oh well lol
@@OminousIndustries thanks for replying man. nice info. I was asking because I want to have the same setup lol. but with dual rtx 3090 non TI.
@@ALEFA-ID Of course! It is an awesome setup to have regardless if the cards are ti or not, 48gb builds are sick! Check out r/localllama on reddit as well if you're interested in the community based around home setups like this.
@@OminousIndustries Yeah that's so great, can't wait to have that setup. Thanks again dude!
@@ALEFA-ID Of course! I am excited for you hahah if you have any other questions or anything feel free to reach out anytime! I just got two waterblocks shipped so I will be posting a watercool build video soon(ish) :)
I have the same setup. I use to make AI furry p0rn.
4x p40 wouldnt be cheaper and better in perfs ?
It would only be better in terms of how large a model I could run. They are slower to run the models and having to have four separate cards would have added additional considerations like a new MOBO and having to deal with linking them all to use the vram with whichever service I was going to use them with.
How'd you increase your swap file? I have the same issues with 72B models running dual 3090s
These instructions should work, though I have only used them on 2022.04 wiki.crowncloud.net/?How_to_Add_Swap_Space_on_Ubuntu_22_04#Add+Swap
Hello!, I was trying to find 3090 cards but only found differents brands (Galax and EVGA). Would that affect performance? or should I keep searching for the same model of cards.
I can't say for sure as I have not personally tried with different cards, but I will say that I have seen people combining multiple cards of the same family, like a 3060+3090 and have made it work so I wouldn't think there would be any issue. The different cards can have slight differences like clock speeds, etc, though I don't believe that would make a noticeable difference.
@@OminousIndustries Thank you for the answer!
@@Chilly.Mint. Sure thing! Be sure to check out the localllama subreddit as well, it contains a lot of personal experience from people running setups like this one.
What is the aim for using opendalie? Is it just... for fun, or is there some monetary gain to be had through this?
Personally I just use it for fun. Some people use these uncensored image models to generate NSFW images that they then release on patreon, etc to make some money, but that is not in my wheelhouse.
Where did you buy your cafd
I got it at Micro Center, they were selling them refurbished. Not sure if they still have any in stock. They also had 3090s.
What CPU and motherboard? What is the temperature of the cards? Thanks!
The cpu is an I7-12700K and the mobo is a MSI PRO Z690-A. I purchased them as a micro center bundle about a year ago. I have not seen the card temps get over about 75c when using the text-gen-webui. I was using keyshot pro for something and decided to use both cards to render the project and they got far too hot, so cooling is first priority to be upgraded.
@@OminousIndustries Okay thanks. Yeah there's not much space in that case. I have a bigger case, I'm looking to get another 3090 or 4090 and possibly water cool them. Would be nice to get an A6000 but too much right now
@@codescholar7345 I have a thermaltake view 71 to swap them into when I get the time. The A6000 would be awesome but yeah that price could get you a dual 4090 setup. A water cooling setup would be very cool and a good move for these situations.
@@OminousIndustriesHi, pcie finally work in x8 and x8? Regards
also curious on this one if the msi pro z690-a supports pcie bifurcation x8/x8?
Hi Ominous, Im looking for a good spec to train llm, ai (gaming sometime) with budget 3000$. Is 2x3090ti 24GB is the best option for my budget ?
I believe the consensus is still that dual 24gb cards (like the 3090) is the best "budget move", however, I would head over to: www.reddit.com/r/LocalLLaMA/ and browse/ask there as there are a lot of knowledgeable people in there who can provide good insight on this. For what it's worth, I don't believe it will make a huge difference if you get non Ti cards, I just bought them as I wanted to match the first card I had which happened to be a Ti.
so, any trouble ?
Still going strong, the cards are now water cooled as they were getting too hot being that close together for certain tasks.
What about llama3?
I tested a small version of it in one of my more recent videos!
A cooler master cpu cooler is proper.
It definitely is, had to keep it simple this time, though!
next: make aquarian
u should totaly let me buy ur build
I'm in the middle of building a custom loop and I'm about ready to toss it out the window so maybe hahaha
@@OminousIndustries dude ong id buy it
@@Jasonlifts I finished it and am happy with it again lol
BIG Price ...i guess 200bucks to High
I think your build is slightly under sized for that model..
You mean for the physical components or for a 70B model? It runs q4 exl2 70B quants of llama3 very well and at a decent speed.
@@OminousIndustries Well I am jealous! I want ot build a rig. I thought you needed at least 140 gigs of Vram for a 70b model..
@@braeder It's a lot of fun to have. Not necessarily. Quantization essentially removes some of the "precision" of the model but in turn allows it to be a much more manageable size so it can be run on less horsepower. There are tons of quants of many models on huggingface so it's pretty good pickings to find something you like that will fit on a specific setup.
@@OminousIndustries I see! I am running a 30b on my cpu right now... Taking for ever! I want a script that can off load some calculations onto my gpu for a program I am writing. But Building a GPU server will really open up some possibilities!
I was thinking of used k80's
@@braeder Yes CPU will struggle due to the way the models are run, the older tesla cards are good, but there could be potential compatibility issues for some libraries and things like that, though I can't say for certain without experience with those cards. I think having a couple 12gb 3060's would open up a lot of possibilities as they will allow you to use most common current libraries and such. For navigating this space this subreddit is a really good resource to find info on good setups and other related things: www.reddit.com/r/LocalLLaMA/