Building a GPU cluster for AI

แชร์
ฝัง
  • เผยแพร่เมื่อ 8 ก.พ. 2025

ความคิดเห็น •

  • @randahan215
    @randahan215 ปีที่แล้ว +14

    Extraordinary presentation. Covered all the important topics in depth and with real teaching talent. Many thanks!!

  • @peterxyz3541
    @peterxyz3541 ปีที่แล้ว +44

    Thanks. I’m planning on building a “massive” 2 GPU system for home use.

    • @fundoo203
      @fundoo203 ปีที่แล้ว +5

      How did it go man? I also want to build something like that and then stumbled on this video, which is excellent

  • @carlschumacher5510
    @carlschumacher5510 3 ปีที่แล้ว +22

    Its nice to see a holistic explanation of designing / building / installing a complex multi-rack system...As someone that has spent years working on both sides of the "analog/digital divide" (physical data center world / digital world's various segments), the un-sexy physical aspects of available rack space / power / cooling / floor loading / network uplink bandwidth are often overlooked (often assumed)...A semi arrives with a pallet: "Hey Carl, you can have this online in a couple days, right?"

    • @lambdacloud
      @lambdacloud  3 ปีที่แล้ว +8

      Hey Carl, thanks for the kind comment. Glad you like the video. It's always funny how difficult it can be to 'bridge the divide' between the physical world and virtual world. Many SWEs expect to be able to "spin up" 1000 servers with an API call and forget that there are actual physical objects and tons of people that actually make that happen when you're on-prem.

  • @AjaySimha-s2y
    @AjaySimha-s2y 5 หลายเดือนก่อน +1

    What an amazing presentation - one of the better videos I have watched. Great breadth and depth.

  • @fundyourhustle
    @fundyourhustle 7 หลายเดือนก่อน +1

    One of the best presentations on GPU cluster design, even at 3 years old. Great teaching skills!

  • @fanyang2295
    @fanyang2295 22 วันที่ผ่านมา

    I work at a network vendor and found this video very informative and practical of listing and explaining all the tasks involved in building a gpu cluster. Just two thoughts if you ever get a chance of building an updated version, 1) include Ethernet as another option to infiniband network, 2) talk about the cost (at least CAPEX) of the cluster as the pain points in the intro was heavily around cost of gpu instances in the cloud. Putting myself in the customer shoes, after watching the video, building cluster seems very complex and I am not sure if the benefits were clearly called out. Just my 2 cents, great video!

    • @samyogdhital
      @samyogdhital 20 วันที่ผ่านมา

      I was talking reference from XAI's new datacetner and how to design such level of datacenter would be really helpful. So waiting for that.

    • @lambdacloud
      @lambdacloud  5 วันที่ผ่านมา

      Stay tuned, a refresh is in the works and will cover InfiniBand vs. RoCE.

  • @randahan215
    @randahan215 ปีที่แล้ว +2

    Most professional and holistic explanation I heard about this topic.
    Thank you so much!!

  • @yassinebouchoucha
    @yassinebouchoucha ปีที่แล้ว +4

    Thank you for highlighting an underrated topic/options that company should re-consider within their compute infrastructure.

  • @onlooker365
    @onlooker365 6 หลายเดือนก่อน

    Ground level details with all the critical aspects covered nice for GPU Cluster to the last cable length calculation.

  • @dr.mikeybee
    @dr.mikeybee ปีที่แล้ว +3

    Thank you. You got me started years ago with your lambda stack -- the only way I could get TensorFlow installed on Linux.

  • @lovanda2000
    @lovanda2000 7 หลายเดือนก่อน

    Best of the best presentation on server clusters. Author presented deep understanding of server clusters so that he explains things in an easy way. thank you!!!

  • @ProjectPhysX
    @ProjectPhysX 3 ปีที่แล้ว +27

    Lots and lots of A100 GPUs. Every single one of them is a monster, almost 2x faster memory than the next best GPU. An entire room full of A100 racks... holy cow.

  • @NSPK-
    @NSPK- ปีที่แล้ว

    Very expert suggestions for hpc and compute sizing.

  • @austynr
    @austynr 2 ปีที่แล้ว +2

    Genius bait and switch. Props!

    • @metal_mo
      @metal_mo 2 ปีที่แล้ว +1

      Lambda needs an explanation on the difference between "building" and "designing".

  • @ilyboc
    @ilyboc ปีที่แล้ว

    Really good analysis and presentation!

  • @zicarconsultancy
    @zicarconsultancy 3 หลายเดือนก่อน

    This is super insightful!

  • @cyberspider78910
    @cyberspider78910 9 หลายเดือนก่อน

    Highly appreciated...TH-cam should have a separate category called Founder's video.

  • @julianfiacconi709
    @julianfiacconi709 2 ปีที่แล้ว

    Still most relevant today, 2 years later. Thanks.

  • @samyogdhital
    @samyogdhital 28 วันที่ผ่านมา

    Hello team lambda labs. Please make an updated video on this topic. I think alot of have changed. Obviously, basics are same but with updated guide video from the CEO himself indepth will be best. Please consider doing that. Waiting for the video.

    • @lambdacloud
      @lambdacloud  23 วันที่ผ่านมา

      Stay tuned! Videos on building the latest GPU clusters are in the works!

    • @samyogdhital
      @samyogdhital 20 วันที่ผ่านมา

      @@lambdacloud This was way useful and packed of knowledge. Please make updated video on this with more information on how these designs will be in near future. Cause a lot has changed in scale wise and otherwise as well. So yeah, waiting for this.

  • @uzairqarni7782
    @uzairqarni7782 ปีที่แล้ว

    This was amazing. Thank you.

  • @brianwesley28
    @brianwesley28 4 ปีที่แล้ว +4

    Thanks for the video.

  • @WorldMover
    @WorldMover 6 หลายเดือนก่อน

    What a remarkable video

  • @sanaullah-qureshi
    @sanaullah-qureshi 2 ปีที่แล้ว +2

    very informative , thank you.

  • @anatolystrashkevich7621
    @anatolystrashkevich7621 2 ปีที่แล้ว +1

    very informative, thanks!

  • @rosenangelow6082
    @rosenangelow6082 ปีที่แล้ว +2

    Tell me how difficult it is so i can buy your solution kind of talk

  • @loadmastergod1961
    @loadmastergod1961 9 หลายเดือนก่อน

    I want to build a multi dual epyc 7742 based system for goofing around learning this stuff.

  • @HarishN.J
    @HarishN.J 9 หลายเดือนก่อน

    Hey Stephen, this is highly informative. I work on this clustering. Now am able to connect the dots and get the bigger picture.
    where can i read about the relationship between numa topology and GPU peering capability.

  • @vtrandal
    @vtrandal 3 ปีที่แล้ว +2

    Excellent.

  • @thePyiott
    @thePyiott 2 ปีที่แล้ว +1

    Great insight!

  • @eyadmufti
    @eyadmufti 2 ปีที่แล้ว

    it is a lecture more than a tutorial, Thx.

  • @natexetan5732
    @natexetan5732 3 ปีที่แล้ว +1

    thanks for the inspiration

  • @glennisholcomb592
    @glennisholcomb592 ปีที่แล้ว +1

    I have three computers, and a nas, and a external hub. I think that I don’t need a another server because of the NAS. As far as my architecture goes, is there anything else that you can advise?

  • @programmingwiththotho4641
    @programmingwiththotho4641 ปีที่แล้ว

    Your are insane, thank you

  • @HankGallows
    @HankGallows 5 หลายเดือนก่อน

    My machine learning team consists of me baby

  • @mengxu2026
    @mengxu2026 3 ปีที่แล้ว +2

    Our group ordered around 10 lambda PCs 1 year ago. Right now more than 5 have problems. Some of them do not start up. Mine gets stuck randomly....

    • @yugr
      @yugr 3 ปีที่แล้ว

      Have you tried looking into the reasons?

    • @lambdacloud
      @lambdacloud  3 ปีที่แล้ว +3

      Meng Xu, you can email support@lambdalabs.com 24/7 or call +1 (866) 711-2025 during business hours. Sorry to hear you're having issues, I'm sure we'll be able to resolve them quickly.

    • @danielleza908
      @danielleza908 ปีที่แล้ว +1

      Our team has 5 lambda laptops, they work perfectly for over a year now..
      We also have a workstation with 3 GPUs, works great too.

  • @chaoticblankness
    @chaoticblankness ปีที่แล้ว

    Very Based

  • @Bloodycub666
    @Bloodycub666 2 ปีที่แล้ว

    I just love this kind things. How do i can start this kind bussnes how i can find customer for like small node and start building up

  • @jleonardoperez5402
    @jleonardoperez5402 11 หลายเดือนก่อน

    Looking for work would love to help

  • @mayori-engineering-hub
    @mayori-engineering-hub ปีที่แล้ว

    Does it work in man????

  • @petevenuti7355
    @petevenuti7355 ปีที่แล้ว

    What if I have a model that I just want to run as provided, it hasn't really been optimized to run around the cluster and has memory requirements greater than any individual system I have. I feel safe to assume that for that specific case a shared distributed memory model would be the solution to run that specific app, yes? Is there any distribution of Linux that has support for such a memory model? It doesn't have to be a full-blown single system image. Perhaps a patch to the memory management driver so storage can be treated as an extension of system memory and not swap memory?
    Does any such software exist?

  • @xiaofu7883
    @xiaofu7883 7 หลายเดือนก่อน

    Is this opensourced?

  • @JustPlainRob
    @JustPlainRob ปีที่แล้ว

    Now if only I was a billionaire so I could make use of this great information...

  • @ikbo
    @ikbo 2 ปีที่แล้ว

    Do you guys have a gpu cluster optimized for 3d rendering.

  • @ravnodinson
    @ravnodinson ปีที่แล้ว

    Hell yes Lambda Lambda Lambda.

  • @micromicro9655
    @micromicro9655 6 หลายเดือนก่อน +1

    This guy is smart as fak

  • @nathanthomas9395
    @nathanthomas9395 3 ปีที่แล้ว

    Does lambda products (gpu cluster) ship with a manual to help you set up the servers for use

  • @thinkinginsomething1859
    @thinkinginsomething1859 ปีที่แล้ว

    Half Life man!

  • @harshikamahesh9459
    @harshikamahesh9459 9 หลายเดือนก่อน

    Talk about what ur expert.. don’t talk useless stuff without knowing all facts

  • @huaveihuavei1045
    @huaveihuavei1045 3 ปีที่แล้ว

    headeggs

  • @mikepict9011
    @mikepict9011 ปีที่แล้ว

    This dudes in full submission mode . Sad

  • @orthodoxNPC
    @orthodoxNPC 3 ปีที่แล้ว

    speak UP