Would love to see you try things out with CUDA. Not only from the perspective of what those GPUs can do, but also to show how much abstraction there actually from a python library to the GPU + what it actually means to go “low level”
The bottom line, from the point of view of a ML practitioner that's not going to access CUDA directly (or at least not so often), is: 1. NVlink won't male a big difference with data parallelization (although some slight advantage will still be appreciated). 2. NVlink *will* make a substantial difference with MODEL parallelization, for obvious reasons. @Jeff: you should definitely do a video in which you show this practically.
I have been scratching my head over this. I agree. Most modelling I see at surface level allows data parallization and not model parallelization (unless it is some custom thing) so NVLink will make no difference at all. This video does not explain this so I gave it the thumbs down. Please correct me if I am wrong.
It's a pity that NVLink/SLI are entirely inaccessible to OpenCL. Makes it useless for non-proprietary software. At least PCIe bandwidth is rapidly increasing and becoming a good alternative, yet PCIe peer-to-peer transfer for Nvidia GPUs is also not accessible to OpenCL, so everything has to go through CPU memory once. PS: 8:30 20MB is not nearly enough to saturate PCIe/NVLink transfer. Want you're seeing here is only the transfer latency.
Excellent video. I also have two RTX 3090 gpu cards connected with NVLink. My goal is to use pycuda as you have done, but I also want to scale up to the cloud (probably using AWS as you have shown). I think I am on a good learning path. I want the experience of prototyping my code using my two RTX 3090s and NVLink and then scale it up to the cloud to see how the speed scales with more 3090s. Like you have done in this video I will not be using TensorFlow or PyTorch.
Thanks for the great explanation... I have 1 Quadro M5000 in... 2nd coming and SLI.... planning to switch to Nvlink eventually... Mostly do editing and big number crunching stuff. Installed a game to try it.... got bored playing in 5 minutes and went back to building... lol
Thanks for your dual gpus series. Since no nvlink port canceled on RTX4090, would you think it is still efficient to build a 4090x2 workstation for deep learning
good question! Well the devs from nvidia superseeded SLI -> NVLINK -> PCIE5 Memory allocation (it starts with the lovelace quadro cards and is thought to be implemented on RTX blackwell consumer cards.) Dont beat me on the name of this tech, i read it on some article but forgot its actual name. PCIE5 motherboards dont need SLI bridges , they are super fast anyway, i got one for just 180€, using 2 rtx 3090 in multi gpu mode, super fast.
Some questions (naïve probably): 1 - They look like custom built 3090s - what are the temperatures (GPU, Mem, Hotspot) when both are under full load? 2 - Any impact of the lower GPU blocking the upper one's airflow? 3 - Will 3090 and 3080 on the same system help in sharing the training load?
Can only answer 3. you can not Nvlink two different GPU‘s and if you use a 3090 and 3080 as two Independent GPU‘s via Pcie express slots, your 3090 will throttle down its speed/power and memory to the speed and memory of the 3080, as if you had two 3080‘s. So it makes no sense using 2 different tiers of GPUs to train a model. As your model can only train as fast as your slowest GPU allows to.
Hi Jeff, I'm very outdated with the gpu based machine learning training. By connecting 2+ cards with nvlink, would we overcome the limitations of model complexity that comes a single gpu's memory size?
Really, no software solution can combine two GPUs into the same logical unit. NVLink just provides a very fast conduit to keep the local memories of the GPUs synced. Often, though, the way training is being batched, this can give you 2X speedup for that 2nd GPU.
@@nullpointerexception1685 They are using the same memory address space, but you still must divide the processing across all of the GPUs, which is not automatic.
hey jeff, i want to figure out this can be used with stablediffusion image generation such that automatic1111 uses both my GPUs and not just one? can you make a video please?
Does it matter what OS you're using? I can't get one of my 3080s to initialize. Would it be better just to run a threadripper pro for computational power versus a dual GPU setup?
Hi sir, thank you for your sharing. Can you also share a link for the motherboard that supports nvlink? (It's looks like normal Z690 ATX does not support)
I'm interested in what you think about the NVIDIA Jetson Xavier NX or more really the Jetson Mate (cluster). While it clearly doesn't sit as a direct competitor to modern Ampere GPU's since the GPU onboard is a Volta generation, there are other benefits. Namely 50Gb/s memory, NVDLA engines onboard, 6 Arm, 384 Cuda, and 48 tensor cores at just 10/15w per card. for 24 arm, 1536 cuda, 192 tensors, and 8 NVDLA engines @ about $2000 drawing just 90w the SoM's come in two varieties, one with 8gb and an sd card slot(dev kit ~$400) and one with 16gb(~$500). You get 4 system-on-modules with Arm CPU, Volta GPU, NVDLA Engines on the die, sharing access to the fast lpDDR4x ram onboard. Onboard gigabit ethernet and 5 port switch linking them together making it a tidy little cluster. One reason I think this is an interesting option is that at the price point (full Mate) it provides decent local compute with a low ongoing cost as an alternative to buying cloud time for maybe low priority training. And, it allows for practice with directing data flow for parallel processing.
Not only do they have to be the same model (3090 to 3090) but they have to be the exact same brand/model as well. So I have two 3090s from EVGA, and they literally stick out from the motherboard at different lengths because one is the ftw3 ultra version. Given that the SLI attachment is a fixed piece of metal and doesn't have any play to it, you have to have two identical cards for it to connect properly.
@@pavellelyukh5272 Yes, you have to have two of the exact same make/model. Either two XC's or two FTW's. Which, right now, is almost impossible to source at msrp.
Does anyone have experience using NVLink 4 slots from 3090 series on workstation cards like A4500/5000/6000 ? Nvidia support says it won't work. However those cards are the same generation and have exactly the same amount of pins and placement on NVLink. I know people do that the other way around and it's all good. I was wondering if there is really any difference between NVLink or Nvidia wants me to pay them for their own NVLink which is 2/3 Slots and doesn't fit my Motherboard
Thank you for posting this video. I recently assembled a dual-cpu system with three GPUs, being two (nv-linked) A6000 and one A4500, which is being used for academic research purposes, and I found your channel a very accessible source of information. As a beginner, may I ask your advice on the following matter: right now, I have one A6000 installed in a slot handled by CPU0,, and the other A6000 in a slot handled by CPU1 (A4500 is handled by CPU1 as well). Would it be better to have both A6000 GPUs (which are connected through nvlink) handled by the same CPU?
I have a hard time understanding, how you can connect more than one GPU with another via Nvlink. The GPU only has 1 Nvlink slot, right ? So lets say you have 4 A6000's...you connect the 1st and the 2nd GPU with an Nvlink bridge and connect the 3rd and the 4th one with another Nvlink bridge, right ? So now the 1st/2nd and the 3rd/4th GPU are not connected ? An explanation would be very appreciated !
NVlinks for rtx 3 series and A series are different than NVlinks for rtx 2028 ti. Depending on the NVlink type you buy, they allow upto 4 GPUs to be connected or even more.
Nice explanation! Both GPUs should be on x16 PCI-E slots, right? Also, since the spacing varies according to the MB manufacturer, some models won’t be suited to do this, right? Thank you :)
Does anyone successfully using nvlink on two 3090‘s running ubuntu? Please share your configuration below. I‘m currently build my own DL box with originally in mind using nvlink with exact two 3090‘s butI‘m not sure if it will work out on pytorch?
I did a series of videos on a dual 3090 Ubuntu workstation from Exxact. Pytorch did fine. th-cam.com/video/4071A1lu2yo/w-d-xo.html&ab_channel=JeffHeaton
@@HeatonResearch Thanks man for the video, this helps a lot! I think I ll build that as a clone :-) (with ubuntu and pytorch running). Great channel, so much value for all of us 👍
That's pretty neat that Exxact sent you a loaner system to test out! I never worked there but drive past it on my commute. Small world!
Would love to see you try things out with CUDA. Not only from the perspective of what those GPUs can do, but also to show how much abstraction there actually from a python library to the GPU + what it actually means to go “low level”
Okay, adding that to my list. I rather enjoy accessing CUDA directly.
The bottom line, from the point of view of a ML practitioner that's not going to access CUDA directly (or at least not so often), is:
1. NVlink won't male a big difference with data parallelization (although some slight advantage will still be appreciated).
2. NVlink *will* make a substantial difference with MODEL parallelization, for obvious reasons.
@Jeff: you should definitely do a video in which you show this practically.
I have been scratching my head over this. I agree. Most modelling I see at surface level allows data parallization and not model parallelization (unless it is some custom thing) so NVLink will make no difference at all. This video does not explain this so I gave it the thumbs down. Please correct me if I am wrong.
God sent this video !!! Thanks Jeff!
It's a pity that NVLink/SLI are entirely inaccessible to OpenCL. Makes it useless for non-proprietary software. At least PCIe bandwidth is rapidly increasing and becoming a good alternative, yet PCIe peer-to-peer transfer for Nvidia GPUs is also not accessible to OpenCL, so everything has to go through CPU memory once.
PS: 8:30 20MB is not nearly enough to saturate PCIe/NVLink transfer. Want you're seeing here is only the transfer latency.
Excellent video. I also have two RTX 3090 gpu cards connected with NVLink. My goal is to use pycuda as you have done, but I also want to scale up to the cloud (probably using AWS as you have shown). I think I am on a good learning path. I want the experience of prototyping my code using my two RTX 3090s and NVLink and then scale it up to the cloud to see how the speed scales with more 3090s. Like you have done in this video I will not be using TensorFlow or PyTorch.
Always inspiring and educational - thank you!
Thanks for the great explanation... I have 1 Quadro M5000 in... 2nd coming and SLI.... planning to switch to Nvlink eventually... Mostly do editing and big number crunching stuff. Installed a game to try it.... got bored playing in 5 minutes and went back to building... lol
Thanks for explaining a lot of this.
Thanks for your dual gpus series. Since no nvlink port canceled on RTX4090, would you think it is still efficient to build a 4090x2 workstation for deep learning
good question! Well the devs from nvidia superseeded SLI -> NVLINK -> PCIE5 Memory allocation (it starts with the lovelace quadro cards and is thought to be implemented on RTX blackwell consumer cards.) Dont beat me on the name of this tech, i read it on some article but forgot its actual name. PCIE5 motherboards dont need SLI bridges , they are super fast anyway, i got one for just 180€, using 2 rtx 3090 in multi gpu mode, super fast.
@@amanda.collaud what motherboard is it?
@@amanda.collaud But RTX 3090 supports PCIe 4.0 only, so it wouldn't work faster in PCIe 5.0
Some questions (naïve probably):
1 - They look like custom built 3090s - what are the temperatures (GPU, Mem, Hotspot) when both are under full load?
2 - Any impact of the lower GPU blocking the upper one's airflow?
3 - Will 3090 and 3080 on the same system help in sharing the training load?
Can only answer 3. you can not Nvlink two different GPU‘s and if you use a 3090 and 3080 as two Independent GPU‘s via Pcie express slots, your 3090 will throttle down its speed/power and memory to the speed and memory of the 3080, as if you had two 3080‘s. So it makes no sense using 2 different tiers of GPUs to train a model. As your model can only train as fast as your slowest GPU allows to.
@@AOTanoos22 Thanks! :)
I have one question. For Deep Learning, rtx 3090 sli vs rtx 4090 single which build is better? Thanks for your advice
Of course 4090, if you're making a build right? Pure performance is always better than through bridge. If money is no limiter then rtx 6000 ada.
Although the speed of transfer is very large, the time without Nvlink is also acceptable compared with training time.
perfect you explain
Hi Jeff, I'm very outdated with the gpu based machine learning training. By connecting 2+ cards with nvlink, would we overcome the limitations of model complexity that comes a single gpu's memory size?
so 2xGPU w NVLink could be ~x100 faster then 2xGPU without? (on some tasks)
Can pytorch take advantage of the NVlink? Use the cards as one 48G GPU?
Really, no software solution can combine two GPUs into the same logical unit. NVLink just provides a very fast conduit to keep the local memories of the GPUs synced. Often, though, the way training is being batched, this can give you 2X speedup for that 2nd GPU.
@@HeatonResearch but I’ve heard nvidia advertising something about TCC or memory pooling which can effectively combine the VRAMs together?
@@nullpointerexception1685 They are using the same memory address space, but you still must divide the processing across all of the GPUs, which is not automatic.
@@HeatonResearch alright, thanks for your reply. I guess it’s better to get a RTX8000 than 2 3090 in that case.
HI! What do you think, if there are 4 3090 cards and nvlinks in pairs, will it be possible to optimize such a scheme when there are 2 nvlink arrays?
hey jeff, i want to figure out this can be used with stablediffusion image generation such that automatic1111 uses both my GPUs and not just one? can you make a video please?
3:30 - What H/W go I need even with 1CPU but for 3pcs eg. RTX 3090 ?
Does it matter what OS you're using? I can't get one of my 3080s to initialize. Would it be better just to run a threadripper pro for computational power versus a dual GPU setup?
Can you please make video training with tensor cores?
what is the best way to improve connection between four 3080TI gpus ? something like nvlink or infiniband?
Does this still work with the latest linux drivers? Will it work with the 3090TI cards?
Hi sir, thank you for your sharing. Can you also share a link for the motherboard that supports nvlink? (It's looks like normal Z690 ATX does not support)
I'm interested in what you think about the NVIDIA Jetson Xavier NX or more really the Jetson Mate (cluster).
While it clearly doesn't sit as a direct competitor to modern Ampere GPU's since the GPU onboard is a Volta generation, there are other benefits.
Namely 50Gb/s memory, NVDLA engines onboard, 6 Arm, 384 Cuda, and 48 tensor cores at just 10/15w per card.
for 24 arm, 1536 cuda, 192 tensors, and 8 NVDLA engines @ about $2000 drawing just 90w
the SoM's come in two varieties, one with 8gb and an sd card slot(dev kit ~$400) and one with 16gb(~$500).
You get 4 system-on-modules with Arm CPU, Volta GPU, NVDLA Engines on the die, sharing access to the fast lpDDR4x ram onboard. Onboard gigabit ethernet and 5 port switch linking them together making it a tidy little cluster. One reason I think this is an interesting option is that at the price point (full Mate) it provides decent local compute with a low ongoing cost as an alternative to buying cloud time for maybe low priority training. And, it allows for practice with directing data flow for parallel processing.
Please sir how can i combine the outputs of 2 different deep learning (lstm and CNN) models to get a new 3rd model?
I am bit confused, do both gpu need to be same. Like it can be 3070 and 3060
Not only do they have to be the same model (3090 to 3090) but they have to be the exact same brand/model as well. So I have two 3090s from EVGA, and they literally stick out from the motherboard at different lengths because one is the ftw3 ultra version. Given that the SLI attachment is a fixed piece of metal and doesn't have any play to it, you have to have two identical cards for it to connect properly.
@Daniel Vachalek is one xc3 and the other ftw3?
@@pavellelyukh5272 Yes, you have to have two of the exact same make/model. Either two XC's or two FTW's. Which, right now, is almost impossible to source at msrp.
How fast can you run Microsoft FSX 2021 with NVLink enabled?
Please professor, make a video showing how to improve enhance AI (topaz labs) running at quadro NVIDIA card. Thanks.
Does anyone have experience using NVLink 4 slots from 3090 series on workstation cards like A4500/5000/6000 ? Nvidia support says it won't work. However those cards are the same generation and have exactly the same amount of pins and placement on NVLink. I know people do that the other way around and it's all good. I was wondering if there is really any difference between NVLink or Nvidia wants me to pay them for their own NVLink which is 2/3 Slots and doesn't fit my Motherboard
Thank you for posting this video. I recently assembled a dual-cpu system with three GPUs, being two (nv-linked) A6000 and one A4500, which is being used for academic research purposes, and I found your channel a very accessible source of information. As a beginner, may I ask your advice on the following matter: right now, I have one A6000 installed in a slot handled by CPU0,, and the other A6000 in a slot handled by CPU1 (A4500 is handled by CPU1 as well). Would it be better to have both A6000 GPUs (which are connected through nvlink) handled by the same CPU?
I have a hard time understanding, how you can connect more than one GPU with another via Nvlink. The GPU only has 1 Nvlink slot, right ? So lets say you have 4 A6000's...you connect the 1st and the 2nd GPU with an Nvlink bridge and connect the 3rd and the 4th one with another Nvlink bridge, right ? So now the 1st/2nd and the 3rd/4th GPU are not connected ? An explanation would be very appreciated !
You cannot. He does not explain this so deserves a thumbsdown.
NVlinks for rtx 3 series and A series are different than NVlinks for rtx 2028 ti. Depending on the NVlink type you buy, they allow upto 4 GPUs to be connected or even more.
Nice explanation! Both GPUs should be on x16 PCI-E slots, right? Also, since the spacing varies according to the MB manufacturer, some models won’t be suited to do this, right?
Thank you :)
(Dual-kit 32 x 2=64 vs Two single-kit 32+32=64) use of two single-kit 32+32=64
will this affect the performance?
I would love to have a million x million x million point grid, where every point is a 3x3x3x3 matrix. Still not enough RAM.
Prox was here
Does anyone successfully using nvlink on two 3090‘s running ubuntu? Please share your configuration below. I‘m currently build my own DL box with originally in mind using nvlink with exact two 3090‘s butI‘m not sure if it will work out on pytorch?
I did a series of videos on a dual 3090 Ubuntu workstation from Exxact. Pytorch did fine. th-cam.com/video/4071A1lu2yo/w-d-xo.html&ab_channel=JeffHeaton
@@HeatonResearch Thanks man for the video, this helps a lot! I think I ll build that as a clone :-) (with ubuntu and pytorch running). Great channel, so much value for all of us 👍
SLI could not transfer data
SLI is one GPU with other GPU undr?
SLI was 2 GPUs working together it didn't scale linearly.
nvlink, another way of reinventing RDMA but with extra licensing fees
Bezoz
Always inspiring and educational - thank you!
Always inspiring and educational - thank you!