GPT-Fast - blazingly fast inference with PyTorch (w/ Horace He)

แชร์
ฝัง
  • เผยแพร่เมื่อ 20 ต.ค. 2024

ความคิดเห็น • 20

  • @TheAIEpiphany
    @TheAIEpiphany  7 หลายเดือนก่อน +1

    Horace He joined us to walk us through what can one do with native PyTorch when it comes to accelerating inference! Also, if you need some GPUs check out Hyperstack: console.hyperstack.cloud/?Influencers&Aleksa+Gordi%C4%87 who are sponsoring this video! :)

  • @xl0xl0xl0
    @xl0xl0xl0 7 หลายเดือนก่อน +4

    Wow, this presentation was excellent. Straight to the point. No over-complicating, no over-simplifying, no trying to sound smart by obscuring simple things. Thank you Horace!

  • @orrimoch5226
    @orrimoch5226 7 หลายเดือนก่อน +1

    Wow! It was very educational and practical!
    I liked the graphics in the presentation!
    Great job by both of you!
    Thanks!

  • @Cropinky
    @Cropinky 5 หลายเดือนก่อน +1

    i love this guy so much its unreal

  • @kaushilkundalia2197
    @kaushilkundalia2197 18 วันที่ผ่านมา

    It was so informative

  • @nikossoulounias7036
    @nikossoulounias7036 7 หลายเดือนก่อน

    Super interesting talk!! Do u guys have any idea how the compilation-generated decoding kernel compares against custom kernels like Flash-Decoding or Flash-Decoding++?

  • @xmorse
    @xmorse 7 หลายเดือนก่อน +1

    Your questions about why fast-gpt is faster than the cuda version: kernel fusion, merging kernels into one is faster than multiple hand written ones

  • @SinanAkkoyun
    @SinanAkkoyun 7 หลายเดือนก่อน

    How does PPL look at int4 quants? Also, given GPTQ, how high is the tps with gpt-fast?

  • @xl0xl0xl0
    @xl0xl0xl0 7 หลายเดือนก่อน

    One thing that was not super clear to me. Are we loading the next weight matrix (assuming there is enough SRAM), as the previous matmul+activation is being computed?

    • @Chhillee
      @Chhillee 7 หลายเดือนก่อน

      Within each matmul the loading of data from main memory into registers occurs at the same time as the values being computed.
      So the answer to your question is "no, but it also wouldn't help because the previous matmul/activation is already saturating the bandwidth"

    • @xl0xl0xl0
      @xl0xl0xl0 7 หลายเดือนก่อน

      @@Chhillee Thank you, makes sense.

  • @XartakoNP
    @XartakoNP 7 หลายเดือนก่อน

    I didn't understand one of the points made. In a couple of occasions Horace mentions that we are loading all the weights (into the registers I assume) with every token - that's also what the diagram shows at th-cam.com/video/18YupYsH5vY/w-d-xo.html . Is that what's happening? Can the registers load all the model weights at once? If that were the case why do you need to load them every time instead of leaving them untouched. I hope that's a not too stupid of a question.

    • @Chhillee
      @Chhillee 7 หลายเดือนก่อน

      This is a good question! The big problem is that GPUs do not have enough registers (i.e. SRAM) to load all the model weights at once. A GPU has on the order of megabytes of registers/SRAM, while the weights require 10s of gigabytes to store.
      Q: But what if we used hundreds of chips to have enough SRAM to store the entire model? Would generation be much faster then?
      A: Yes, and that's what we have with Groq :)

    • @XartakoNP
      @XartakoNP 7 หลายเดือนก่อน

      @@Chhillee Thanks!! I appreciate the answer. I assume the diagram has been simplified for clarity then

  • @mufgideon
    @mufgideon 7 หลายเดือนก่อน +1

    Is there any discord for this channel community ?

    • @TheAIEpiphany
      @TheAIEpiphany  7 หลายเดือนก่อน +2

      Yes sir! Pls see vid description

  • @tljstewart
    @tljstewart 7 หลายเดือนก่อน

    awesome talks, can Triton target TPUs?

  • @kyryloyemets7022
    @kyryloyemets7022 7 หลายเดือนก่อน

    But ctranslate2 as i understand still faster?

  • @kimchi_taco
    @kimchi_taco 7 หลายเดือนก่อน

    speculative decoding is major thing, right? If so, not very fair comparison...

    • @Chhillee
      @Chhillee 7 หลายเดือนก่อน

      None of the results are using speculative decoding except the results we specifically mentioned were using speculative decoding. I.e: we hit ~200 tok/s with int4 without spec-dec, and 225 or so with spec-dec.