Deep Dive: Optimizing LLM inference

แชร์
ฝัง
  • เผยแพร่เมื่อ 4 ม.ค. 2025

ความคิดเห็น • 31

  • @cybermanaudiobooks3231
    @cybermanaudiobooks3231 9 หลายเดือนก่อน +2

    Thanks Julien. Your recent series of videos have been top quality.
    A future video, you might consider making, is one about the different prompts required when fine-tuning. Why does llama2 differ from mistral? left-padding for some, right-padding for others; how does trl help simplify things? what is the history of this? what is chatml? guanaco? etc..,
    To get a solid foundation in navigating this area would be a helpful video to say the least!

    • @juliensimonfr
      @juliensimonfr  9 หลายเดือนก่อน +1

      Hi Cyberman, thank you for the kind words. I have a bit of an inference obsession at the moment, but I'll come back to training and fine-tuning after that. Your suggestions sound good, I'll add them to the list :)

  • @jiegong529
    @jiegong529 5 หลายเดือนก่อน

    Thanks so much for the crystal clear explanations! You understand them so well and it's even more amazing how you show them in bullet points and graphs to make your audience understand as well!

    • @juliensimonfr
      @juliensimonfr  5 หลายเดือนก่อน

      Glad you liked it, thank you!

  • @I_like_this_sports_tv
    @I_like_this_sports_tv หลายเดือนก่อน

    This is too good!

  • @sheikhshafayat6984
    @sheikhshafayat6984 4 หลายเดือนก่อน

    The explanation was excellent. Thanks a lot!

    • @juliensimonfr
      @juliensimonfr  4 หลายเดือนก่อน

      Glad it was helpful!

  • @justwest
    @justwest 9 หลายเดือนก่อน +1

    how does the big LLM handle the "predicted" tokens? I mean, how does it check whether these are good or not?

    • @juliensimonfr
      @juliensimonfr  9 หลายเดือนก่อน +1

      Detailed explanation in huggingface.co/blog/assisted-generation. In a nutshell, you can run a forward pass with the large model on a speculative sequence and retrieve the logits (i.e. the probabilities) for the next token at each position in the sequence. If a speculative token does have the highest logit, then it's the token the large model would have generated.

    • @justwest
      @justwest 9 หลายเดือนก่อน

      @@juliensimonfr thx a lot (for the vids in general also, of course). I think I am still missing a point here. why do you get the logits *at each position in the sequence*? isnt the ouput of the model just probabilities for the *next* token? If I would want to have this *at each position*, wouldnt I have to forward pass multiple times? thx!

  • @billykotsos4642
    @billykotsos4642 6 หลายเดือนก่อน

    very informative as always !

    • @juliensimonfr
      @juliensimonfr  5 หลายเดือนก่อน

      Glad it was helpful!

  • @QorQar
    @QorQar 2 หลายเดือนก่อน

    Thank you for the video, but is there a complete mod to use the ideas in the video, especially for beginners?

    • @juliensimonfr
      @juliensimonfr  2 หลายเดือนก่อน

      All these techniques are implemented in open-source inference servers and models. I'd recommend reading the relevant papers, then exploring the implementation in TGI or vLLM.

  • @alexis91459
    @alexis91459 4 หลายเดือนก่อน

    Super cool! Just why in speculative decoding the validation part made by the bigger model is faster? I don"t understand how validation works

    • @juliensimonfr
      @juliensimonfr  4 หลายเดือนก่อน +1

      Good question. The main reason is that the input verification by the larger model only requires a single forward pass per candidate sequence. This is much faster than the usual text generation process, which requires one forward pass per new token.
      If the larger model disagrees on a particular token, then it will generate a better one and the next ones. However, all the tokens generated up to that point by the smaller model are used as is. So, in the end we get large-model generation quality, only quicker :)
      Makes sense ? Here's a detailed example: huggingface.co/blog/whisper-speculative-decoding

  • @DED_Search
    @DED_Search 7 หลายเดือนก่อน +1

    Would you mind sharing the slides please Sir? Thank you!

    • @juliensimonfr
      @juliensimonfr  4 หลายเดือนก่อน +2

      Hi, you can find the slides on Slideshare at fr.slideshare.net/slideshow/julien-simon-deep-dive-model-merging/270921708

  • @mourady5588
    @mourady5588 4 หลายเดือนก่อน

    Thank you very much Julien for this high-quality excerpt!
    Could you please attach the slides in the description, as well as under the other videos?

    • @juliensimonfr
      @juliensimonfr  4 หลายเดือนก่อน

      Hi, you'll find the slides at fr.slideshare.net/slideshow/julien-simon-deep-dive-optimizing-llm-inference/270920916. I'll share the other ones in the next week or so.

    • @mourady5588
      @mourady5588 4 หลายเดือนก่อน

      @@juliensimonfr thanks a lot!

  • @RoyAAD
    @RoyAAD 8 หลายเดือนก่อน

    Very interesting. Can we batch 1 image of the LLM? Or we need multiple copies of it loaded in the gpu in order to batch? And how can we estimate the throughput? Say if I have a 70B model on 2 A100?

    • @juliensimonfr
      @juliensimonfr  8 หลายเดือนก่อน +2

      This works even on a single GPU. Here's the paper if you want to dive deeper: www.usenix.org/conference/osdi22/presentation/yu. Regarding benchmarks, I suggest you run your own to find the right latency/throughput trade-off for your application. This should help: www.databricks.com/blog/llm-inference-performance-engineering-best-practices

    • @RoyAAD
      @RoyAAD 8 หลายเดือนก่อน

      @@juliensimonfr Thanks. Do you have a link that explains how to calculate the feasability for an LLM?

  • @徐迟-i2t
    @徐迟-i2t 7 หลายเดือนก่อน +1

    很快就讲清楚了,好厉害!爱来自瓷器。

  • @EkShunya
    @EkShunya 2 หลายเดือนก่อน

    5star ⭐

  • @rbrowne4255
    @rbrowne4255 9 หลายเดือนก่อน

    Thanks for the video great job!!! in terms of Speculative decoding, can you provide any additional feedback on its impact on GPU performance/memory? i.e kv-cache usage or overall GPU memory resources

    • @juliensimonfr
      @juliensimonfr  9 หลายเดือนก่อน

      The only overhead is the assistant model, which can share layers with the large model . For example, see huggingface.co/blog/whisper-speculative-decoding, which says that there's only 8% of RAM overhead.

  • @Gerald-xg3rq
    @Gerald-xg3rq 8 หลายเดือนก่อน

    hi great video! how to set WAITING_SERVED_RATIO, MAX_BATCH_SIZE, MAX_BATCH_TOTAL_TOKENS, MAX_BATCH_PREFILL_TOKENS etc. for highest throughput? looking at llama2-7b-chat and llama3-8b-instruct with nvidia A10.

    • @juliensimonfr
      @juliensimonfr  8 หลายเดือนก่อน

      Hi, the doc is available at huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher. I would increase batch size and measure.