What Exactly Does NVLink do for Machine Learning (featuring Exxact Workstation w/dual 3090s)

แชร์
ฝัง
  • เผยแพร่เมื่อ 3 มิ.ย. 2024
  • NVLink allows two GPUs to directly access each other's memory. This allows much faster data transfers than would normally be allowed by the PCIe bus. This videos discusses the NVLink architrecture from a dual computer system up to an advanced HPC 8-GPU system. The two NVIDIA GeForce RTX 3090 GPUs featured come from a loaner Exxact high performance machine learning workstation.
    ** System Used **
    * TRX40 Motherboard
    * Threadripper 3960x
    * 128GB Memory (16GBx8)
    * 2x 4TB PCIe 4.0 NVME
    * 2x NVIDIA GeForce RTX 3090
    * NVLINK Bridge
    2:01 HPC, Scientific, Rendering
    3:25 High-End HPC NVLink
    6:52 Unified Virtual Addressing (UVA)
    7:14 CUDA Code Example
    For more information about the machine featured in this video, please visit:
    www.exxactcorp.com/Deep-Learn...
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 60

  • @MagnumCarta
    @MagnumCarta 3 หลายเดือนก่อน +1

    That's pretty neat that Exxact sent you a loaner system to test out! I never worked there but drive past it on my commute. Small world!

  • @dinoscheidt
    @dinoscheidt 2 ปีที่แล้ว +23

    Would love to see you try things out with CUDA. Not only from the perspective of what those GPUs can do, but also to show how much abstraction there actually from a python library to the GPU + what it actually means to go “low level”

    • @HeatonResearch
      @HeatonResearch  2 ปีที่แล้ว +4

      Okay, adding that to my list. I rather enjoy accessing CUDA directly.

  • @fredrikhansen75
    @fredrikhansen75 2 ปีที่แล้ว +1

    Always inspiring and educational - thank you!

  • @richarddow8967
    @richarddow8967 ปีที่แล้ว +1

    Thanks for explaining a lot of this.

  • @derwolf9668
    @derwolf9668 ปีที่แล้ว +2

    God sent this video !!! Thanks Jeff!

  • @weylandsmith5924
    @weylandsmith5924 2 ปีที่แล้ว +11

    The bottom line, from the point of view of a ML practitioner that's not going to access CUDA directly (or at least not so often), is:
    1. NVlink won't male a big difference with data parallelization (although some slight advantage will still be appreciated).
    2. NVlink *will* make a substantial difference with MODEL parallelization, for obvious reasons.
    @Jeff: you should definitely do a video in which you show this practically.

    • @Tsardoz
      @Tsardoz 2 ปีที่แล้ว +2

      I have been scratching my head over this. I agree. Most modelling I see at surface level allows data parallization and not model parallelization (unless it is some custom thing) so NVLink will make no difference at all. This video does not explain this so I gave it the thumbs down. Please correct me if I am wrong.

  • @vtrandal
    @vtrandal 10 หลายเดือนก่อน

    Excellent video. I also have two RTX 3090 gpu cards connected with NVLink. My goal is to use pycuda as you have done, but I also want to scale up to the cloud (probably using AWS as you have shown). I think I am on a good learning path. I want the experience of prototyping my code using my two RTX 3090s and NVLink and then scale it up to the cloud to see how the speed scales with more 3090s. Like you have done in this video I will not be using TensorFlow or PyTorch.

  • @TheFarmacySeedsNetwork
    @TheFarmacySeedsNetwork 25 วันที่ผ่านมา

    Thanks for the great explanation... I have 1 Quadro M5000 in... 2nd coming and SLI.... planning to switch to Nvlink eventually... Mostly do editing and big number crunching stuff. Installed a game to try it.... got bored playing in 5 minutes and went back to building... lol

  • @xpim3d
    @xpim3d 2 ปีที่แล้ว

    Nice explanation! Both GPUs should be on x16 PCI-E slots, right? Also, since the spacing varies according to the MB manufacturer, some models won’t be suited to do this, right?
    Thank you :)

  • @fourteen_ljw
    @fourteen_ljw ปีที่แล้ว

    Hi sir, thank you for your sharing. Can you also share a link for the motherboard that supports nvlink? (It's looks like normal Z690 ATX does not support)

  • @FisicoAlexandreBonatto
    @FisicoAlexandreBonatto ปีที่แล้ว

    Thank you for posting this video. I recently assembled a dual-cpu system with three GPUs, being two (nv-linked) A6000 and one A4500, which is being used for academic research purposes, and I found your channel a very accessible source of information. As a beginner, may I ask your advice on the following matter: right now, I have one A6000 installed in a slot handled by CPU0,, and the other A6000 in a slot handled by CPU1 (A4500 is handled by CPU1 as well). Would it be better to have both A6000 GPUs (which are connected through nvlink) handled by the same CPU?

  • @wagmi614
    @wagmi614 ปีที่แล้ว

    hey jeff, i want to figure out this can be used with stablediffusion image generation such that automatic1111 uses both my GPUs and not just one? can you make a video please?

  • @RebelBreed888
    @RebelBreed888 9 หลายเดือนก่อน

    Does it matter what OS you're using? I can't get one of my 3080s to initialize. Would it be better just to run a threadripper pro for computational power versus a dual GPU setup?

  • @artemsult
    @artemsult 2 หลายเดือนก่อน

    HI! What do you think, if there are 4 3090 cards and nvlinks in pairs, will it be possible to optimize such a scheme when there are 2 nvlink arrays?

  • @skinnyboy996
    @skinnyboy996 2 ปีที่แล้ว

    Can you please make video training with tensor cores?

  • @yaminadjoudi4357
    @yaminadjoudi4357 2 ปีที่แล้ว

    Please sir how can i combine the outputs of 2 different deep learning (lstm and CNN) models to get a new 3rd model?

  • @nealschoeler6463
    @nealschoeler6463 2 ปีที่แล้ว

    I'm interested in what you think about the NVIDIA Jetson Xavier NX or more really the Jetson Mate (cluster).
    While it clearly doesn't sit as a direct competitor to modern Ampere GPU's since the GPU onboard is a Volta generation, there are other benefits.
    Namely 50Gb/s memory, NVDLA engines onboard, 6 Arm, 384 Cuda, and 48 tensor cores at just 10/15w per card.
    for 24 arm, 1536 cuda, 192 tensors, and 8 NVDLA engines @ about $2000 drawing just 90w
    the SoM's come in two varieties, one with 8gb and an sd card slot(dev kit ~$400) and one with 16gb(~$500).
    You get 4 system-on-modules with Arm CPU, Volta GPU, NVDLA Engines on the die, sharing access to the fast lpDDR4x ram onboard. Onboard gigabit ethernet and 5 port switch linking them together making it a tidy little cluster. One reason I think this is an interesting option is that at the price point (full Mate) it provides decent local compute with a low ongoing cost as an alternative to buying cloud time for maybe low priority training. And, it allows for practice with directing data flow for parallel processing.

  • @jonfe
    @jonfe ปีที่แล้ว

    what is the best way to improve connection between four 3080TI gpus ? something like nvlink or infiniband?

  • @Edward-un2ej
    @Edward-un2ej ปีที่แล้ว

    Although the speed of transfer is very large, the time without Nvlink is also acceptable compared with training time.

  • @ProjectPhysX
    @ProjectPhysX ปีที่แล้ว +7

    It's a pity that NVLink/SLI are entirely inaccessible to OpenCL. Makes it useless for non-proprietary software. At least PCIe bandwidth is rapidly increasing and becoming a good alternative, yet PCIe peer-to-peer transfer for Nvidia GPUs is also not accessible to OpenCL, so everything has to go through CPU memory once.
    PS: 8:30 20MB is not nearly enough to saturate PCIe/NVLink transfer. Want you're seeing here is only the transfer latency.

  • @IntenseGrid
    @IntenseGrid 4 หลายเดือนก่อน

    Does this still work with the latest linux drivers? Will it work with the 3090TI cards?

  • @user-wi3id2si8g
    @user-wi3id2si8g 2 ปีที่แล้ว +1

    so 2xGPU w NVLink could be ~x100 faster then 2xGPU without? (on some tasks)

  • @YaYa-qg5vb
    @YaYa-qg5vb ปีที่แล้ว +5

    Thanks for your dual gpus series. Since no nvlink port canceled on RTX4090, would you think it is still efficient to build a 4090x2 workstation for deep learning

    • @amanda.collaud
      @amanda.collaud ปีที่แล้ว +4

      good question! Well the devs from nvidia superseeded SLI -> NVLINK -> PCIE5 Memory allocation (it starts with the lovelace quadro cards and is thought to be implemented on RTX blackwell consumer cards.) Dont beat me on the name of this tech, i read it on some article but forgot its actual name. PCIE5 motherboards dont need SLI bridges , they are super fast anyway, i got one for just 180€, using 2 rtx 3090 in multi gpu mode, super fast.

    • @shiro836_
      @shiro836_ ปีที่แล้ว

      @@amanda.collaud what motherboard is it?

    • @arogov
      @arogov 9 หลายเดือนก่อน

      @@amanda.collaud But RTX 3090 supports PCIe 4.0 only, so it wouldn't work faster in PCIe 5.0

  • @AOTanoos22
    @AOTanoos22 2 ปีที่แล้ว

    I have a hard time understanding, how you can connect more than one GPU with another via Nvlink. The GPU only has 1 Nvlink slot, right ? So lets say you have 4 A6000's...you connect the 1st and the 2nd GPU with an Nvlink bridge and connect the 3rd and the 4th one with another Nvlink bridge, right ? So now the 1st/2nd and the 3rd/4th GPU are not connected ? An explanation would be very appreciated !

    • @Tsardoz
      @Tsardoz 2 ปีที่แล้ว +1

      You cannot. He does not explain this so deserves a thumbsdown.

    • @ajey214
      @ajey214 2 ปีที่แล้ว +1

      NVlinks for rtx 3 series and A series are different than NVlinks for rtx 2028 ti. Depending on the NVlink type you buy, they allow upto 4 GPUs to be connected or even more.

  • @anamayasullerey
    @anamayasullerey 2 ปีที่แล้ว

    Can you please share the code used in this Video?

  • @Haley2077
    @Haley2077 9 หลายเดือนก่อน +2

    I have one question. For Deep Learning, rtx 3090 sli vs rtx 4090 single which build is better? Thanks for your advice

    • @robertobokarev439
      @robertobokarev439 3 หลายเดือนก่อน +1

      Of course 4090, if you're making a build right? Pure performance is always better than through bridge. If money is no limiter then rtx 6000 ada.

  • @vb433
    @vb433 2 ปีที่แล้ว

    (Dual-kit 32 x 2=64 vs Two single-kit 32+32=64) use of two single-kit 32+32=64
    will this affect the performance?

  • @nullpointerexception1685
    @nullpointerexception1685 2 ปีที่แล้ว +2

    Can pytorch take advantage of the NVlink? Use the cards as one 48G GPU?

    • @HeatonResearch
      @HeatonResearch  2 ปีที่แล้ว +4

      Really, no software solution can combine two GPUs into the same logical unit. NVLink just provides a very fast conduit to keep the local memories of the GPUs synced. Often, though, the way training is being batched, this can give you 2X speedup for that 2nd GPU.

    • @nullpointerexception1685
      @nullpointerexception1685 2 ปีที่แล้ว +1

      @@HeatonResearch but I’ve heard nvidia advertising something about TCC or memory pooling which can effectively combine the VRAMs together?

    • @HeatonResearch
      @HeatonResearch  2 ปีที่แล้ว +1

      @@nullpointerexception1685 They are using the same memory address space, but you still must divide the processing across all of the GPUs, which is not automatic.

    • @nullpointerexception1685
      @nullpointerexception1685 2 ปีที่แล้ว

      @@HeatonResearch alright, thanks for your reply. I guess it’s better to get a RTX8000 than 2 3090 in that case.

  • @marcelocoi
    @marcelocoi 2 ปีที่แล้ว

    Please professor, make a video showing how to improve enhance AI (topaz labs) running at quadro NVIDIA card. Thanks.

  • @Miesiu
    @Miesiu 2 วันที่ผ่านมา

    3:30 - What H/W go I need even with 1CPU but for 3pcs eg. RTX 3090 ?

  • @wentworthmiller1890
    @wentworthmiller1890 2 ปีที่แล้ว +2

    Some questions (naïve probably):
    1 - They look like custom built 3090s - what are the temperatures (GPU, Mem, Hotspot) when both are under full load?
    2 - Any impact of the lower GPU blocking the upper one's airflow?
    3 - Will 3090 and 3080 on the same system help in sharing the training load?

    • @AOTanoos22
      @AOTanoos22 2 ปีที่แล้ว

      Can only answer 3. you can not Nvlink two different GPU‘s and if you use a 3090 and 3080 as two Independent GPU‘s via Pcie express slots, your 3090 will throttle down its speed/power and memory to the speed and memory of the 3080, as if you had two 3080‘s. So it makes no sense using 2 different tiers of GPUs to train a model. As your model can only train as fast as your slowest GPU allows to.

    • @wentworthmiller1890
      @wentworthmiller1890 2 ปีที่แล้ว +1

      @@AOTanoos22 Thanks! :)

  • @jasb78
    @jasb78 2 ปีที่แล้ว

    How fast can you run Microsoft FSX 2021 with NVLink enabled?

  • @talha_anwar
    @talha_anwar 2 ปีที่แล้ว +1

    I am bit confused, do both gpu need to be same. Like it can be 3070 and 3060

    • @D12075
      @D12075 2 ปีที่แล้ว +1

      Not only do they have to be the same model (3090 to 3090) but they have to be the exact same brand/model as well. So I have two 3090s from EVGA, and they literally stick out from the motherboard at different lengths because one is the ftw3 ultra version. Given that the SLI attachment is a fixed piece of metal and doesn't have any play to it, you have to have two identical cards for it to connect properly.

    • @pavellelyukh5272
      @pavellelyukh5272 2 ปีที่แล้ว

      @Daniel Vachalek is one xc3 and the other ftw3?

    • @D12075
      @D12075 2 ปีที่แล้ว

      @@pavellelyukh5272 Yes, you have to have two of the exact same make/model. Either two XC's or two FTW's. Which, right now, is almost impossible to source at msrp.

  • @igorchurakov5585
    @igorchurakov5585 ปีที่แล้ว

    Does anyone have experience using NVLink 4 slots from 3090 series on workstation cards like A4500/5000/6000 ? Nvidia support says it won't work. However those cards are the same generation and have exactly the same amount of pins and placement on NVLink. I know people do that the other way around and it's all good. I was wondering if there is really any difference between NVLink or Nvidia wants me to pay them for their own NVLink which is 2/3 Slots and doesn't fit my Motherboard

  • @thewizardsofthezoo5376
    @thewizardsofthezoo5376 ปีที่แล้ว

    SLI is one GPU with other GPU undr?
    SLI was 2 GPUs working together it didn't scale linearly.

  • @returncode0000
    @returncode0000 ปีที่แล้ว

    Does anyone successfully using nvlink on two 3090‘s running ubuntu? Please share your configuration below. I‘m currently build my own DL box with originally in mind using nvlink with exact two 3090‘s butI‘m not sure if it will work out on pytorch?

    • @HeatonResearch
      @HeatonResearch  ปีที่แล้ว +1

      I did a series of videos on a dual 3090 Ubuntu workstation from Exxact. Pytorch did fine. th-cam.com/video/4071A1lu2yo/w-d-xo.html&ab_channel=JeffHeaton

    • @returncode0000
      @returncode0000 ปีที่แล้ว

      @@HeatonResearch Thanks man for the video, this helps a lot! I think I ll build that as a clone :-) (with ubuntu and pytorch running). Great channel, so much value for all of us 👍

  • @whoseai3397
    @whoseai3397 ปีที่แล้ว

    SLI could not transfer data

  • @orthodoxNPC
    @orthodoxNPC 2 ปีที่แล้ว

    nvlink, another way of reinventing RDMA but with extra licensing fees

  • @dakshitjyani337
    @dakshitjyani337 2 ปีที่แล้ว

    Bezoz

  • @goutamsarkar1918
    @goutamsarkar1918 ปีที่แล้ว

    Always inspiring and educational - thank you!

  • @goutamsarkar1918
    @goutamsarkar1918 ปีที่แล้ว

    Always inspiring and educational - thank you!