Also wanted to say that I think the research path you chose has countless possible applications and to the best of my knowledge, I don't believe there is any other kind of software geared towards identifying ai programs apart from actual people on social media. It's a brilliant idea dude. I wish you the best of luck going forward. I'm rooting for you! Cheers!
I really appreciated the time and effort you took explaining all this in detail. I've known about NV link, but I've never heard it explained and seen examples of how it can be beneficial like you did here. Awesome video man. Thanks for the content. Good job brotha.
I bought my EVGA 3090 Hydro Copper from eBay. It was supposed to be "used" but turned out to be brand new/open box. Peel ply wasn't even removed. I imagine someone bought it for a water cooled rig that they never got around to building. Water cooled cards are the same price, and sometimes cheaper than air cooled cards on eBay. I will add more in time for local inferencing, which I'm already happy with but I imagine my needs will grow over time. Really nice rig you've built here.
I bought a Hydro Copper Kingpin in great shape, and think people avoid because of all the waterworks. I sort of collected these over time slowly, and oddly people weren't buying the 3090 Ti on eBay when I was looking. Just have to check for the model number and make sure what you are getting is the correct model. I once bought a 3090 thinking it was a Ti and it wasn't so returned it.
I used to do the folding at home with my Playstation 3 for fun. Later, I realized it was like crypto mining, and your power bill was higher for another's gain.
Sweet build! Im building a similar rig using NVLink with 3090s, since the roi is great currently compared to the enterprise cards. im surprised more people aren't adopting this approach for local AI. Also what are the CPU/MB specs (Threadripper+Asus Pro? And how useful has Intel Optane been? Would love to see an update video detailing any changes, additional GPUs, or insights! Thanks!
Yes, the ROI if you can make money on this compared to an enterprise setup. For me this is implicitly making me money in that it helped me get my doctorate which should increase my income by way of Generative AI and Cybersecurity knowledge. Here is the playlist for my Hephaestus Build th-cam.com/play/PLqL965J4xElJek_JlCG60EOddZZykeb1y.html - I have another video - th-cam.com/video/r_48PaGLMnA/w-d-xo.html and this shows the CPU, motherboard, and memory. AMD Ryzen Threadripper Pro 5975WX, Asus Pro WS WRX80E-SAGE SE WIFI II, and 512GB of Micron 64GB DDR4-3200 RDIMM 2Rx4. The Optane is supposed to be great due to it's IOPS speed, so latency is really low making this a fast drive for repetitive read/writes. The durability is supposed to be insane with the Optane, but not sure only time will tell. Again, check out the playlist and as I work on stuff and have time I planned to make videos and updates. My rigs over time get upgrades as I mess with stuff.
Does using nvlink help with LLM inference speed? Like, say Aphrodite engine or VLLM. I'm going to switch from the standard prosumer "two 24gb cards on koboldcpp" to "actual parallelism across 8 32gb cards" and am trying to get the most out of them because they're older. Love your build btw.
The NVLink so far hasn't technically speed up inference speed because most times the LLM inference I've used will fit on one GPU. If you can run everything in one GPU the response is best, I haven't yet tried to see if I can make an inference span multiple GPUs. Something I may have to dig into as LLMs have been getting larger and larger. I took a quick look at Aphrodite and think there will be many frameworks like this to better scale on GPUs. I noticed I was able to span and leverage a mix of cards with two 3090s, two 3090 ti, and 3 4090s with PyTorch.
Hi! I watched your video about using NVLink that you posted about six months ago, and I found it really fascinating. I was wondering if you had the chance to compare the performance of a language model (like llama) with and without NVLink? If so, how much of a difference did you notice in terms of speed and efficiency? Also, do language models generally support this interface? I’d really appreciate it if you could share your experience with us. Thanks a lot in advance!
I was going to make a video to show this, but my research and deadlines approached fast. I found when using BERT I was able to get about 20% uplift using the NVLink. I wrote new code to work with llama and didn't perform this same test. PyTorch does some loading underweight the hoods, so not exactly sure if it helped greatly or not. When I have a moment will try and confirm, make a video also.
@@nispoe Thanks so much for the response! that 20% uplift with BERT is really interesting. I understand how research deadlines can be, so no worries! if you do get a chance to test it with Llama or any other models in the future, I'd love to hear about it. looking forward to that video whenever you have time to make it-I'm sure it'll b insightful. Best of luck with your research!
@@alirezashekari7674 I finished my research, now a doc. I did run Llama 2 and 3. I found that at times depending on what I was training the three 4090s were so much more powerful than the four 3090s.
I've been a bit concerned by some of the things I've read about trying to use multiple PSUs to power GPUs, basically warning against powering the GPUs with a PSU that isn't also powering the motherboard because of having separate grounding and synchronization issues. Did you have any concerns about that? Any issues running things off multiple PSUs for various system components are essentially directly plugged (granted with riser cables) into the PCIe slots on the Motherboard?
There may be synchronization issues between motherboard and GPUs, but what I have done it all seems to work fine. Just need to make sure to power on the GPUs before the motherboard is turned on. This may may the fans on 3090s spin at full speed initially. The 4090s seem to control this better. If you work with LN2 and overclocking GPUs this is something people do all the time. Also, if you have the PCIe cables attached and the GPUs not powered on, you'll see red lights on the power being supplied and nothing will happen with the motherboard and OS. It will not be seen as installed. I should make a video about this, there are some things to know about this that many probably don't know.
Hi. Very instructive - thanks for sharing. Question: using NVLink what ist the biggest model size you are able to run? Is the limit equal to 2x 24 GB, since the bridge connects only 2 cards? Or can you run bigger models? And I mean not due to quantization or with llama.cpp but in GPU VRAM
There's is a difference with fine-tuning versus running an inference. The largest I could tune was 140 billion parameters, could maybe do more, but stopped there for my research. I did try running larger models for inference and could do much larger, but again was focused on my dissertation, played with Falcon which I think was 180 billion parameters. The one thing to note about the NVLink is it only helps with sending data between the cards, I was not able to figure out how to pool memory between the cards. I do know that for example running 2 cards definitely helps run larger models, but the difference is the time savings when you NVLink them. So just having 48 GB you can run larger models. If you use quantization and other techniques then you can get into the 7 billion models. I should make some videos to show what I mean...
Thanks for the update. Super cool set-up. Can you NVLink different brands of the same GPU model? I have a bunch of RTX 3080s and a single RTX 3090. I might get another 3090 for the NVLink.
I know I was able to connect an EVGA 3090 Kingpin with an EVGA 3090 XC3. This is because of the height of where the NVLink can reach for a connection. The other 3090s are taller and have problems connecting. You can adjust the cards and connect them by propping the card up to make the NVLink connect. If you are asking about cannot take a 3090 from EVGA and Asus, or MSI. I believe you can, but may have to adjust the card height to make the NVLink match up.
@@nispoe yeah it makes sense. I have been reading up on the NVLink on reddit since I watched your video to see the performance gains. I hope you can make a video about the temperatures if you run the GPUs for hours. I suppose you can always add Noctua fans. That's the beauty of using an open frame.
@vitalis yes, I did one AI training session where all GPUs were running for more than 27 hours continuous. When I do something that runs long, I'll try to capture and show temps, fan speed, memory usage, and such. I do have noctua fans I could put in place, but I live in the Midwest and temps are cooler here at the moment.
It's the Kingwin KC 8GPU open case, I have a video on it. I put it on an IKEA cart I bought a while ago. I was going to put extra fans, but it seems I'm okay at the moment without while running some AI stuff on it.
That is cool , I like to know what main board you using? I plane build one like your but start from 2 GPU first then add more next 2 GPU later. Thanks for share your work.
I went with the Asus Pro WS WRX80E Sage SE Wifi II. I think the x16 lanes for all slots is amazing. The GPU's are all PCIe 4.0 and I did find something new about the last 3 slots and re-drivers. Because of this I ended up upgrading the cables to the newer Linkup PCIe 5.0 AVA5 cables. I messed with the re-drivers as well and improved the signal, but took a while and maybe better to explain in an video.
I’m building a somewhat similar rig for AI as well (same MOBO, CPU, PSU, and NVMe’s). I’m wondering if you have any lessons learned from installing CUDA drivers? I’ve never done that before and apparently it’s a pain, so it’s got me a bit nervous!
Yeah, it is somewhat of a pain but if you've done it enough times it's good experience and good to know. I have some instructions I put on GitHub and some lessons learned are in my research paper. This is the GitHub location github.com/nispoe/kuk-praxis-hephaestus/blob/main/ai-machine-setup.md and here is my published dissertation scholar.google.com/scholar?hl=en&as_sdt=0%2C14&q=Detecting+Machine-Generated+News+Using+Fine-Tuned+Transformers&btnG=
@@datascienceharp I have other videos with the machine completed, but haven't posted yet. After building I had to get started into running my experiments. Now that I'm done I hope to work on a video to explain what I had to do to add 7 cards and stuff I learned about the PCIe redrivers on the last slots because they are farther from the CPU.
Short answer is 96GB, and not a "pool" as you think. I was thinking the NVLink would act as a way to pool VRAM between cards, but it only seems to help with processing time between data that can transfer between cards. I was messing with various frameworks while training and tried HuggingFace Accelerate, Meta PyTorch Distributed Data Parallel (DDP), and at the end used Low-Rank Adaptation (LoRA) which leverages I believe Pytorch DDP. What happened was that adding more cards enabled me to work with larger models because I could distribute load across the 4 GPUS with 24GB of VRAM, and later this would be the 96GB of VRAM for the four cards. Later I added the final 3 GPUs and able to run with a combined 168GB of VRAM. I was thinking (and maybe I didn't find the configuration for this) that the NVLink would give you 48GB with two cards in some way, and it does not. The NVLink helps to speed up any data exchanges between the two cards linked instead of routing through PCIe lanes to the CPU. To fine-tune larger models or to run a larger LLM instance, VRAM is the major constraint. I was ultimately able to train a model that was 140 billion parameters. I think I could go larger but only went so far. I was utilizing roughly 60 or 70% of the memory and ran some fine tuning that went for around 15 days. I also had deadlines for my doctoral research so had to cut myself from going too far. As a hobby and time willing I planned on digging into this more to get a deeper understanding. Planned on making more videos to show this, but life... it happens...
@@videocruzer yes, if you can run on one 4090 or 5090 and the VRAM is enough that is what you want to do to save time. I found that my constraints were more VRAM, so multiple cards for VRAM more important than GPU processing power. There are times I hit limits training purely because I didn't have enough VRAM.
@@nispoe I get the ramming thing, Me I dont plan on using the AI stuff until I can just punch in a whole script and or screen play as i have had way too much of my property stolen over the decades. I am guess that the software access has some super confusing end user agreement that will stiff the average user for their screen play at the end of the day.
@@videocruzer I think my situation is different, I used these GPUs to learn more about setting up the hardware and software for my doctorate. I had to run for months to get the results I needed to show results on my dissertation's hypotheses. I think the end user agreement is what you would expect for hardware and the software frameworks like CUDA and Meta/Facebook PyTorch, and etc. I think having cards locally and not having to pay the thousands for using a cloud GPU environment has the perks that I can continue to make mistakes and continue to use the hardware after the initial cost of buying the hardware. In one of my articles It sounded like one research team spend $25,000 for training, and who knows how much on mistakes. Sorry to hear you have had too much property stolen over the decades... maybe time to be somewhere else and have your hardware somewhere else?
@@nispoe Naa, I am descendant of Polish Grandparents that survived their Notzy Concentration camp experiences and British Grandmother that was a spy for the Allied side during the 2nd great war ( Gwenn Fussy was assaulted dropped from her overhead transfer lift 5 times in a row, almost every Saturday shift in a row until she was dead from her injuries), not to mention my Canadian Grandfather that survived the war against the Germans. Surviving in a City after being hit with a hard ball and stuffed into a deep freezer because My Polish Grandfather stood up to three of his neighbor's that were threating his family members with a bat about 1 year before the hit on his grandson happens just makes the story that much funnier. Being forced to live in one of the most corrupted cities in Canada, you end up connecting to the crowd that flush them out. It has been interesting to say the least. you know the saying, there are always survivors that live long enough to tell the tail. BTW one learns a valuable skills since mid 70's. Look for the Book, movie and video game deal Tagged. "Mares Leg" ;)
@videocruzer sounds interesting, my daughter is a quarter Polish. I think many families have some sort of history they have lived through in their life. I feel lucky having gone through my own hardships and able to tinkier with all this tech. Just recently had a new son, and never thought I would have another kid after my daughter. I want to try to give them what I never had, and not have them experience the bad things I have. Just like what I hope any parent would.
@@nispoe Thanks! I’m making something similar for my clients soon so I want to get an idea of what works. Which threadripper did you use? I’m not incredibly familiar with AI applications but do you actually need a powerful CPU or do you just need the PCIe lanes to utilize the GPUs? Thanks so much, the video was helpful
Sorry thought you were asking something else. What do you mean tokes per second? I don't know, but next time I run my 3090s will check. I'm currently messing with one 4090 in this rig the past weeks. When I do LLM processing I'm taking words and tokenizing them. When I search an embedding database on one 4090 it can churn through 99 embeddings in 0.00003 seconds. Each embedding chunk of words has a mean of 569.93 words/tokens. So I guess maybe if you want to know tokens how much in one second, this would be about 1,880,769,000 tokens per second.
@@lowreadst9 Unfortunately you cannot NVLink 4 of these since these are consumer cards. I think the datacenter setup has interlinks between all cards for transfer efficiency. I was going to put 2 more cards and link them, but instead put in 3 other 4090s so I can finish my research for my doctorate. I have a video of this, and after the doc will try messing more with other configurations and show some training timings.
I’m curious about the behaviour of the nvlinks - I was chatting to an LLM and it implied that the second Nvlinked GPU didn’t need to be connected to the PCI bus.. with your rig it would be trivial to test that by just disconnecting the second pci extender for a pair of cards.. could you say if that can work or not? It would allow a lot more nvlinked GPUs in effect to be connected to a pci bus limited motherboard (although clearly there would be tradeoffs with a new potential bottleneck..)
@@drmartinbartos I tried this and unfortunately this does not work. When you do this it the other GPU isn't controlled by the motherboard, and there are no signals to control things like the fan (the GPU fans by default runs at max speed). I also had thoughts with this setup with NVLinks that video memory could also be pooled in some way to for larger models so 48GB, but this also doesn't work this way. I think maybe at some point it could have, or think there is driver settings that don't allow this. Maybe there is some setting that has to be set explicitly to do this? I have to tinker more with this and if I find something more I'll let you know or make a video.
These take what power they need when running and the way my AI jobs run will not max out on power most of the time. Yes, would probably love to have efficient data center or even workstation cards, but this is what I have and was far cheaper. I have a bad hobby of collecting hardware, so there's more to why I started with these cards and actually I do have an EVGA 3090 XC3 that is water cooled. I was initially going to start with some water cooled cards but due to time switched to air cooled for quick setup.
Also wanted to say that I think the research path you chose has countless possible applications and to the best of my knowledge, I don't believe there is any other kind of software geared towards identifying ai programs apart from actual people on social media. It's a brilliant idea dude. I wish you the best of luck going forward. I'm rooting for you! Cheers!
Thanks for the praise and luck, I thrive on luck!
I really appreciated the time and effort you took explaining all this in detail. I've known about NV link, but I've never heard it explained and seen examples of how it can be beneficial like you did here. Awesome video man. Thanks for the content. Good job brotha.
I have so much in my head, wish I had more time to share it. Thanks for watching.
I bought my EVGA 3090 Hydro Copper from eBay. It was supposed to be "used" but turned out to be brand new/open box. Peel ply wasn't even removed. I imagine someone bought it for a water cooled rig that they never got around to building. Water cooled cards are the same price, and sometimes cheaper than air cooled cards on eBay. I will add more in time for local inferencing, which I'm already happy with but I imagine my needs will grow over time. Really nice rig you've built here.
I bought a Hydro Copper Kingpin in great shape, and think people avoid because of all the waterworks. I sort of collected these over time slowly, and oddly people weren't buying the 3090 Ti on eBay when I was looking. Just have to check for the model number and make sure what you are getting is the correct model. I once bought a 3090 thinking it was a Ti and it wasn't so returned it.
That would make a lot of points if you used it for folding.
I used to do the folding at home with my Playstation 3 for fun. Later, I realized it was like crypto mining, and your power bill was higher for another's gain.
Sweet build! Im building a similar rig using NVLink with 3090s, since the roi is great currently compared to the enterprise cards. im surprised more people aren't adopting this approach for local AI. Also what are the CPU/MB specs (Threadripper+Asus Pro? And how useful has Intel Optane been? Would love to see an update video detailing any changes, additional GPUs, or insights! Thanks!
Yes, the ROI if you can make money on this compared to an enterprise setup. For me this is implicitly making me money in that it helped me get my doctorate which should increase my income by way of Generative AI and Cybersecurity knowledge. Here is the playlist for my Hephaestus Build th-cam.com/play/PLqL965J4xElJek_JlCG60EOddZZykeb1y.html - I have another video - th-cam.com/video/r_48PaGLMnA/w-d-xo.html and this shows the CPU, motherboard, and memory. AMD Ryzen Threadripper Pro 5975WX, Asus Pro WS WRX80E-SAGE SE WIFI II, and 512GB of Micron 64GB DDR4-3200 RDIMM 2Rx4. The Optane is supposed to be great due to it's IOPS speed, so latency is really low making this a fast drive for repetitive read/writes. The durability is supposed to be insane with the Optane, but not sure only time will tell. Again, check out the playlist and as I work on stuff and have time I planned to make videos and updates. My rigs over time get upgrades as I mess with stuff.
Does using nvlink help with LLM inference speed? Like, say Aphrodite engine or VLLM. I'm going to switch from the standard prosumer "two 24gb cards on koboldcpp" to "actual parallelism across 8 32gb cards" and am trying to get the most out of them because they're older.
Love your build btw.
The NVLink so far hasn't technically speed up inference speed because most times the LLM inference I've used will fit on one GPU. If you can run everything in one GPU the response is best, I haven't yet tried to see if I can make an inference span multiple GPUs. Something I may have to dig into as LLMs have been getting larger and larger. I took a quick look at Aphrodite and think there will be many frameworks like this to better scale on GPUs. I noticed I was able to span and leverage a mix of cards with two 3090s, two 3090 ti, and 3 4090s with PyTorch.
Hi! I watched your video about using NVLink that you posted about six months ago, and I found it really fascinating. I was wondering if you had the chance to compare the performance of a language model (like llama) with and without NVLink? If so, how much of a difference did you notice in terms of speed and efficiency? Also, do language models generally support this interface? I’d really appreciate it if you could share your experience with us. Thanks a lot in advance!
I was going to make a video to show this, but my research and deadlines approached fast. I found when using BERT I was able to get about 20% uplift using the NVLink. I wrote new code to work with llama and didn't perform this same test. PyTorch does some loading underweight the hoods, so not exactly sure if it helped greatly or not. When I have a moment will try and confirm, make a video also.
@@nispoe Thanks so much for the response! that 20% uplift with BERT is really interesting. I understand how research deadlines can be, so no worries! if you do get a chance to test it with Llama or any other models in the future, I'd love to hear about it. looking forward to that video whenever you have time to make it-I'm sure it'll b insightful. Best of luck with your research!
@@alirezashekari7674 I finished my research, now a doc. I did run Llama 2 and 3. I found that at times depending on what I was training the three 4090s were so much more powerful than the four 3090s.
I've been a bit concerned by some of the things I've read about trying to use multiple PSUs to power GPUs, basically warning against powering the GPUs with a PSU that isn't also powering the motherboard because of having separate grounding and synchronization issues. Did you have any concerns about that? Any issues running things off multiple PSUs for various system components are essentially directly plugged (granted with riser cables) into the PCIe slots on the Motherboard?
There may be synchronization issues between motherboard and GPUs, but what I have done it all seems to work fine. Just need to make sure to power on the GPUs before the motherboard is turned on. This may may the fans on 3090s spin at full speed initially. The 4090s seem to control this better. If you work with LN2 and overclocking GPUs this is something people do all the time. Also, if you have the PCIe cables attached and the GPUs not powered on, you'll see red lights on the power being supplied and nothing will happen with the motherboard and OS. It will not be seen as installed. I should make a video about this, there are some things to know about this that many probably don't know.
Hi. Very instructive - thanks for sharing. Question: using NVLink what ist the biggest model size you are able to run? Is the limit equal to 2x 24 GB, since the bridge connects only 2 cards? Or can you run bigger models? And I mean not due to quantization or with llama.cpp but in GPU VRAM
There's is a difference with fine-tuning versus running an inference. The largest I could tune was 140 billion parameters, could maybe do more, but stopped there for my research. I did try running larger models for inference and could do much larger, but again was focused on my dissertation, played with Falcon which I think was 180 billion parameters. The one thing to note about the NVLink is it only helps with sending data between the cards, I was not able to figure out how to pool memory between the cards. I do know that for example running 2 cards definitely helps run larger models, but the difference is the time savings when you NVLink them. So just having 48 GB you can run larger models. If you use quantization and other techniques then you can get into the 7 billion models. I should make some videos to show what I mean...
Thanks for the update. Super cool set-up. Can you NVLink different brands of the same GPU model? I have a bunch of RTX 3080s and a single RTX 3090. I might get another 3090 for the NVLink.
I know I was able to connect an EVGA 3090 Kingpin with an EVGA 3090 XC3. This is because of the height of where the NVLink can reach for a connection. The other 3090s are taller and have problems connecting. You can adjust the cards and connect them by propping the card up to make the NVLink connect. If you are asking about cannot take a 3090 from EVGA and Asus, or MSI. I believe you can, but may have to adjust the card height to make the NVLink match up.
@@nispoe yeah it makes sense. I have been reading up on the NVLink on reddit since I watched your video to see the performance gains. I hope you can make a video about the temperatures if you run the GPUs for hours. I suppose you can always add Noctua fans. That's the beauty of using an open frame.
@vitalis yes, I did one AI training session where all GPUs were running for more than 27 hours continuous. When I do something that runs long, I'll try to capture and show temps, fan speed, memory usage, and such. I do have noctua fans I could put in place, but I live in the Midwest and temps are cooler here at the moment.
Great video! What case are you using? 😊 No additional fans in use for the GPU's?
It's the Kingwin KC 8GPU open case, I have a video on it. I put it on an IKEA cart I bought a while ago. I was going to put extra fans, but it seems I'm okay at the moment without while running some AI stuff on it.
That is cool , I like to know what main board you using?
I plane build one like your but start from 2 GPU first then add more next 2 GPU later.
Thanks for share your work.
I went with the Asus Pro WS WRX80E Sage SE Wifi II. I think the x16 lanes for all slots is amazing. The GPU's are all PCIe 4.0 and I did find something new about the last 3 slots and re-drivers. Because of this I ended up upgrading the cables to the newer Linkup PCIe 5.0 AVA5 cables. I messed with the re-drivers as well and improved the signal, but took a while and maybe better to explain in an video.
@@nispoewas searching for this info. Thanks for your content
I’m building a somewhat similar rig for AI as well (same MOBO, CPU, PSU, and NVMe’s).
I’m wondering if you have any lessons learned from installing CUDA drivers? I’ve never done that before and apparently it’s a pain, so it’s got me a bit nervous!
Yeah, it is somewhat of a pain but if you've done it enough times it's good experience and good to know. I have some instructions I put on GitHub and some lessons learned are in my research paper. This is the GitHub location github.com/nispoe/kuk-praxis-hephaestus/blob/main/ai-machine-setup.md and here is my published dissertation scholar.google.com/scholar?hl=en&as_sdt=0%2C14&q=Detecting+Machine-Generated+News+Using+Fine-Tuned+Transformers&btnG=
@@nispoe Awesome, thank you! Much appreciated!
@@datascienceharp I have other videos with the machine completed, but haven't posted yet. After building I had to get started into running my experiments. Now that I'm done I hope to work on a video to explain what I had to do to add 7 cards and stuff I learned about the PCIe redrivers on the last slots because they are farther from the CPU.
@@nispoe looking forward to those!
Does this setup give your two VRAM pools of 48GB each? Or does it result in a single VRAM pool of 96GB?
Short answer is 96GB, and not a "pool" as you think. I was thinking the NVLink would act as a way to pool VRAM between cards, but it only seems to help with processing time between data that can transfer between cards. I was messing with various frameworks while training and tried HuggingFace Accelerate, Meta PyTorch Distributed Data Parallel (DDP), and at the end used Low-Rank Adaptation (LoRA) which leverages I believe Pytorch DDP. What happened was that adding more cards enabled me to work with larger models because I could distribute load across the 4 GPUS with 24GB of VRAM, and later this would be the 96GB of VRAM for the four cards. Later I added the final 3 GPUs and able to run with a combined 168GB of VRAM. I was thinking (and maybe I didn't find the configuration for this) that the NVLink would give you 48GB with two cards in some way, and it does not. The NVLink helps to speed up any data exchanges between the two cards linked instead of routing through PCIe lanes to the CPU. To fine-tune larger models or to run a larger LLM instance, VRAM is the major constraint. I was ultimately able to train a model that was 140 billion parameters. I think I could go larger but only went so far. I was utilizing roughly 60 or 70% of the memory and ran some fine tuning that went for around 15 days. I also had deadlines for my doctoral research so had to cut myself from going too far. As a hobby and time willing I planned on digging into this more to get a deeper understanding. Planned on making more videos to show this, but life... it happens...
so pretty much one of the new 5090's is faster than all four of those cards less the Vram.
@@videocruzer yes, if you can run on one 4090 or 5090 and the VRAM is enough that is what you want to do to save time. I found that my constraints were more VRAM, so multiple cards for VRAM more important than GPU processing power. There are times I hit limits training purely because I didn't have enough VRAM.
@@nispoe I get the ramming thing, Me I dont plan on using the AI stuff until I can just punch in a whole script and or screen play as i have had way too much of my property stolen over the decades. I am guess that the software access has some super confusing end user agreement that will stiff the average user for their screen play at the end of the day.
@@videocruzer I think my situation is different, I used these GPUs to learn more about setting up the hardware and software for my doctorate. I had to run for months to get the results I needed to show results on my dissertation's hypotheses. I think the end user agreement is what you would expect for hardware and the software frameworks like CUDA and Meta/Facebook PyTorch, and etc. I think having cards locally and not having to pay the thousands for using a cloud GPU environment has the perks that I can continue to make mistakes and continue to use the hardware after the initial cost of buying the hardware. In one of my articles It sounded like one research team spend $25,000 for training, and who knows how much on mistakes. Sorry to hear you have had too much property stolen over the decades... maybe time to be somewhere else and have your hardware somewhere else?
@@nispoe Naa, I am descendant of Polish Grandparents that survived their Notzy Concentration camp experiences and British Grandmother that was a spy for the Allied side during the 2nd great war ( Gwenn Fussy was assaulted dropped from her overhead transfer lift 5 times in a row, almost every Saturday shift in a row until she was dead from her injuries), not to mention my Canadian Grandfather that survived the war against the Germans. Surviving in a City after being hit with a hard ball and stuffed into a deep freezer because My Polish Grandfather stood up to three of his neighbor's that were threating his family members with a bat about 1 year before the hit on his grandson happens just makes the story that much funnier. Being forced to live in one of the most corrupted cities in Canada, you end up connecting to the crowd that flush them out. It has been interesting to say the least. you know the saying, there are always survivors that live long enough to tell the tail. BTW one learns a valuable skills since mid 70's. Look for the Book, movie and video game deal Tagged. "Mares Leg" ;)
@videocruzer sounds interesting, my daughter is a quarter Polish. I think many families have some sort of history they have lived through in their life. I feel lucky having gone through my own hardships and able to tinkier with all this tech. Just recently had a new son, and never thought I would have another kid after my daughter. I want to try to give them what I never had, and not have them experience the bad things I have. Just like what I hope any parent would.
Yo shadow of the erdtree just dropped u gonna do a level 713 play through
Really want to, but too busy right now. Working through my doctorate and defense is this month.
What CPU, board, and how much ram are you using? Thanks
@@patrickdepaolo8932 Pro WS WRX80E-SAGE SE WIFI II and 512 GB of Crucial ECC 3200 RAM
@@nispoe Thanks! I’m making something similar for my clients soon so I want to get an idea of what works. Which threadripper did you use? I’m not incredibly familiar with AI applications but do you actually need a powerful CPU or do you just need the PCIe lanes to utilize the GPUs? Thanks so much, the video was helpful
@@patrickdepaolo8932 I used a 5975WX Threadripper Pro and this motherboard so I can run 16x PCIe 4.0 on all 7 GPUs.
@@patrickdepaolo8932 This machine is what I used to just recently finish my doctorate. I was able to train AI models with it.
@@nispoe awesome man! Thanks so much.
cool shit
thanks, glad you liked it
How many tokes per second do you have?
Sorry thought you were asking something else. What do you mean tokes per second? I don't know, but next time I run my 3090s will check. I'm currently messing with one 4090 in this rig the past weeks. When I do LLM processing I'm taking words and tokenizing them. When I search an embedding database on one 4090 it can churn through 99 embeddings in 0.00003 seconds. Each embedding chunk of words has a mean of 569.93 words/tokens. So I guess maybe if you want to know tokens how much in one second, this would be about 1,880,769,000 tokens per second.
Is there a 4 way bridge to connect all 4 of them together?
@@lowreadst9 Unfortunately you cannot NVLink 4 of these since these are consumer cards. I think the datacenter setup has interlinks between all cards for transfer efficiency. I was going to put 2 more cards and link them, but instead put in 3 other 4090s so I can finish my research for my doctorate. I have a video of this, and after the doc will try messing more with other configurations and show some training timings.
I’m curious about the behaviour of the nvlinks - I was chatting to an LLM and it implied that the second Nvlinked GPU didn’t need to be connected to the PCI bus.. with your rig it would be trivial to test that by just disconnecting the second pci extender for a pair of cards.. could you say if that can work or not? It would allow a lot more nvlinked GPUs in effect to be connected to a pci bus limited motherboard (although clearly there would be tradeoffs with a new potential bottleneck..)
@@drmartinbartos I tried this and unfortunately this does not work. When you do this it the other GPU isn't controlled by the motherboard, and there are no signals to control things like the fan (the GPU fans by default runs at max speed). I also had thoughts with this setup with NVLinks that video memory could also be pooled in some way to for larger models so 48GB, but this also doesn't work this way. I think maybe at some point it could have, or think there is driver settings that don't allow this. Maybe there is some setting that has to be set explicitly to do this? I have to tinker more with this and if I find something more I'll let you know or make a video.
evga
I have a fondness of the company, broke my heart when they said they were stopping with GPUs.
3090 xc3 Ultras are better i. That they consume less power. For AI, you won't get any benefit from those larger power hungry cards.
These take what power they need when running and the way my AI jobs run will not max out on power most of the time. Yes, would probably love to have efficient data center or even workstation cards, but this is what I have and was far cheaper. I have a bad hobby of collecting hardware, so there's more to why I started with these cards and actually I do have an EVGA 3090 XC3 that is water cooled. I was initially going to start with some water cooled cards but due to time switched to air cooled for quick setup.