How does Groq LPU work? (w/ Head of Silicon Igor Arsovski!)

แชร์
ฝัง
  • เผยแพร่เมื่อ 11 ธ.ค. 2024

ความคิดเห็น • 51

  • @TheAIEpiphany
    @TheAIEpiphany  9 หลายเดือนก่อน +8

    Let me know how you like this one! If you need some GPUs check out Hyperstack: console.hyperstack.cloud/?Influencers&Aleksa+Gordi%C4%87

    • @AR-iu7tf
      @AR-iu7tf 9 หลายเดือนก่อน +1

      Thank you so much for this interview. This was very helpful. It is not clear why they can’t do training - is it the additional memory needed ? That can’t be correct because he said they can connect multiple chips to have a large amount of sram. When you asked him that question his answer was Nvidia does it well already anyway .

    • @BillKatz
      @BillKatz 9 หลายเดือนก่อน +1

      @@AR-iu7tf I think you mean: why don't they target training instead of inference? It's probably because this LPU architecture is particularly better than GPUs at the smaller tensor sizes you'd see for inference (see the slide at 54:24). So they're focusing on the LPU's competitive advantage and in the coming years, the need for inference compute will skyrocket as these systems permeate society.

    • @AR-iu7tf
      @AR-iu7tf 9 หลายเดือนก่อน

      @@BillKatz sorry yes !!! Thank you !fixed my brain freeze typo !

    • @AR-iu7tf
      @AR-iu7tf 9 หลายเดือนก่อน

      @@BillKatz ah - so large batch size training would be a challenge?

    • @BillKatz
      @BillKatz 9 หลายเดือนก่อน +1

      @@AR-iu7tf From the slide, you see it's more that LPU bandwidth isn't significantly better than GPU at larger batch size. It's also possible that the data flow (given how the loss backpropagates) is less conducive to the LPU architecture, but it's possible their software stack could still do that as well.

  • @rembautimes8808
    @rembautimes8808 หลายเดือนก่อน +2

    Great talk, and thanks Igor for sharing this excellent insight. Impressive to be able to sequence things down to ns. Reminds me of a chip called Transmeta 20:years ago 😂

  • @couldntfindafreename
    @couldntfindafreename 9 หลายเดือนก่อน +7

    In short: We store everything in SRAM and use an optimized stack to drive it. But you cannot run a full model on a single card, you need two full racks of it. Good for LLM hosting at scale, but forget about buying such a card.

    • @autripat
      @autripat 9 หลายเดือนก่อน +1

      per dylan patel's analysis and groq's own publication, 576 chips for sure (maybe 576 cards), a marvel in orchestration

  • @tchlux
    @tchlux 9 หลายเดือนก่อน +2

    Fantastic talk! Loved all the information. Any software engineer that truly cares about performance eventually wants to implement custom hardware!😆
    I was a little disappointed he dodged the question about how many chips were used to run llama 70B. But I get it, because that metric isn’t what matters for infrastructure. What matters is cost per token generated. That means to roughly minimize:
    (Cost of chips × power required × token latency) / (Lifetime of chips)
    Even if the chips cost 10× more, if the set up is 10× faster at half the power consumption it could still halve the total lifetime cost for operating inference infrastructure.
    Can’t wait to see what the Groq team does in the future.

  • @BillKatz
    @BillKatz 9 หลายเดือนก่อน

    Very impressive unified software/hardware stack! Thanks so much for doing this interview. Please do schedule a follow-up so we can hear how graph neural networks might be handled given the way the LPUs do software-controlled data flow.

  • @sabaokangan
    @sabaokangan 9 หลายเดือนก่อน +3

    Thank you so much for sharing this with us on TH-cam ❤️‍🔥 from SeoulNatU

  • @abudhabi9850
    @abudhabi9850 9 หลายเดือนก่อน +2

    Aleksa, thank you, you are an inspiration to me.

  • @sucim
    @sucim 9 หลายเดือนก่อน +2

    Really nice technical overview! I was hesitant to click at first because I expected some marketing blabla

    • @TheAIEpiphany
      @TheAIEpiphany  9 หลายเดือนก่อน +5

      Never on this channel :)

    • @svenvanwier7196
      @svenvanwier7196 6 หลายเดือนก่อน

      @@TheAIEpiphany I do not have a lot of technical know how, but just subbed because of this comment. No marketing bla bla lets see what you can teach me!

  • @couragefox
    @couragefox 9 หลายเดือนก่อน +2

    Just what i wanted to know!

  • @autripat
    @autripat 9 หลายเดือนก่อน

    Thanks! Can anyone shed some light...at 18:10 and 22:05, the speaker talks about 'all-reduce' which is a training primitive; groq is a inference chip.

  • @DailySFY
    @DailySFY 9 หลายเดือนก่อน

    Thanks for sharing this!!

  • @Johnassu
    @Johnassu 4 หลายเดือนก่อน

    incredible!

  • @DynestiGTI
    @DynestiGTI 4 หลายเดือนก่อน

    Please do another interview in the future on Groq!

  • @JMeyer-qj1pv
    @JMeyer-qj1pv 9 หลายเดือนก่อน +6

    It's mind boggling how disruptive this is going to be to GPU based AI inference. Nvidia must be looking at it and thinking OMG. Nvidia is talking about 1000 watt ultra expensive B100 chips in 2025, and maybe those will be needed for training, but most of the money in AI is for inference. If Groq can produce enough chips quickly, they will be in a very good spot.

    • @punk3900
      @punk3900 4 หลายเดือนก่อน

      NVIDIA will buy them next year

  • @stefisha
    @stefisha 9 หลายเดือนก่อน

    bravo coa, samo napred

  • @TopGunMan
    @TopGunMan 3 หลายเดือนก่อน

    I suppose in the even-further domain-specific direction, one would have a true ASIC for a specific, mature model, right? Groq still maintains some flexibility that an ASIC built for a specific family of models would give up for vastly increased performance?

  • @canadianrepublican1185
    @canadianrepublican1185 8 หลายเดือนก่อน +1

    get prices, and while Groq seems awesome, no one asked how much a groq system that compares with a H100 ( for LLM ).

  • @backacheache
    @backacheache 9 หลายเดือนก่อน +1

    Hearing it's original name was "Tensor Streaming Unit" makes it sound like an evolution of their work at Google on "Tensor Processing Units", if that is the case , I wonder what the story is of them leaving Google?

  • @autripat
    @autripat 9 หลายเดือนก่อน +2

    why does it take N=5 days to get a compiled version of a model like Llama. Is the speaker talking about optimizing and tuning?

    • @tchlux
      @tchlux 9 หลายเดือนก่อน +1

      Probably time required to look at all the operators it’s using (softmax, mmul, dot, ..) and ensure they’re mapped to the best sequences of byte codes. They also might consider quantization at various levels to make things fit nicely in memory. And biggest of all they need to figure out how to “bin pack” the parameters into their distributed SRAM (across ~10’s of chips) so that the execution minimizes network hops (they’re not all-to-all beyond ~8 chips w/ ~2GB) and optimizes pipelining of data between core operators.
      Knowing they’ve only got ~10’s of engineers to do it, it must mean their compiler tooling for deciding how to place data on a cluster of cards given a directed compute graph is pretty heavily automated. Impressive.

  • @monicasun7797
    @monicasun7797 9 หลายเดือนก่อน +1

    Any ways we are able to download the presentation ?

  • @ericchang9568
    @ericchang9568 9 หลายเดือนก่อน

    Thanks for the great talk, does the LPU compilation optimization still work with mixed workload, i.e. serving different models/experts in the same data centers?

  • @JayDee-b5u
    @JayDee-b5u 9 หลายเดือนก่อน

    Does this architecture work just as well using the 1.58B networks (instead of using floating point values for weights, only , -1, 0 or 1?

  • @nhtna4706
    @nhtna4706 9 หลายเดือนก่อน

    so, no need of using any DDRAM or no need of using a GPU anymore? or this acts as an accelerator on top of existing RAM's and GPU devices? what would be the cost of 10tb capacity? is there any product usability matrix? that talks about the capacity planning?

  • @ChrisPadron
    @ChrisPadron 9 หลายเดือนก่อน

    So they have any customers? How much debt?

  • @web3devp
    @web3devp 7 หลายเดือนก่อน

    Are these chips compatible with an ordinary computer like incase if i want to use the chip in an offline model. and is it on sale

  • @kozlovskyi
    @kozlovskyi 9 หลายเดือนก่อน

    Looks similar to the Hailo product. Does groq hardware use the same approach?

  • @heelspurs
    @heelspurs 9 หลายเดือนก่อน

    H100 SXM is 700 TDP watts per 2,000 int8 TOPs while Groq is 215 per 750, ratios of 0.35 and 0.28 joules per TOP so I don't get how they can be 10x lower Joules per token. A100 is about like the H100 on this metric.

  • @dmtap1768
    @dmtap1768 9 หลายเดือนก่อน +1

    A simple question: Does latency really matter? If ChatGPT can produce output faster than a human can read, it makes no sense to prioritize reducing latency. However, when it comes to throughput, Groq's solution is not competitive with Nvidia's.

    • @zhurst
      @zhurst 8 หลายเดือนก่อน +2

      Agentic systems currently require lots of looping which is one area where latency really matters atm. We will also see smaller, open-source models run on Groq soon that out-benchmark their larger counterparts when orchestrating at super high token rates. NPCs in games being truly interactive and responding quickly, real-time language translation driving speech synthesis systems, robots navigating the world, etc. require super low latency. Awesome time to be alive!

    • @davidkey4272
      @davidkey4272 7 หลายเดือนก่อน +1

      It does. Our application needs to speak back and forth with the latency of the conversation contained in this video. Nothing else even gets close right now to being able to do that. That’s not even taking into account the class.

  • @bengineer_the
    @bengineer_the 8 หลายเดือนก่อน

    Has anyone run a raytracer on this yet? could be interesting.

  • @anurag-vishwakarma
    @anurag-vishwakarma 9 หลายเดือนก่อน

    I need slides please

  • @sureshchandrapatel3925
    @sureshchandrapatel3925 หลายเดือนก่อน

    Where is the parallel computing revolution? I think AI is not possible without The Parallel Computing Revolution. There have already been training algorithms and data, but not The Parallel Computing Revolution. Also, AI is a parallel computing itself. So it’s a Patellution: The Parallel Computing Revolution. Hope, I am not wrong.

  • @mzzzz3
    @mzzzz3 9 หลายเดือนก่อน

    What’s to stop nvidia from making their own LPU?

    • @autripat
      @autripat 9 หลายเดือนก่อน

      token, language, grammar, all the same, everything is converted to embeddings, NVDA already is...what's in a name?

  • @rominmanojchittettu5073
    @rominmanojchittettu5073 9 หลายเดือนก่อน

    The numbers are so off on the competing company 🤣🤣🤣🤣🤣

  • @rick-kv1gl
    @rick-kv1gl 9 หลายเดือนก่อน +1

    r u my caucasian

    • @Phils_Guide
      @Phils_Guide 5 หลายเดือนก่อน

      CRAZY EYEZ KILLA!

  • @XnndjehdhkNxndjhds
    @XnndjehdhkNxndjhds 3 หลายเดือนก่อน

    Hernandez Scott Thompson Richard Clark Christopher