Thanks Julien. Your recent series of videos have been top quality. A future video, you might consider making, is one about the different prompts required when fine-tuning. Why does llama2 differ from mistral? left-padding for some, right-padding for others; how does trl help simplify things? what is the history of this? what is chatml? guanaco? etc.., To get a solid foundation in navigating this area would be a helpful video to say the least!
Hi Cyberman, thank you for the kind words. I have a bit of an inference obsession at the moment, but I'll come back to training and fine-tuning after that. Your suggestions sound good, I'll add them to the list :)
Thanks so much for the crystal clear explanations! You understand them so well and it's even more amazing how you show them in bullet points and graphs to make your audience understand as well!
Detailed explanation in huggingface.co/blog/assisted-generation. In a nutshell, you can run a forward pass with the large model on a speculative sequence and retrieve the logits (i.e. the probabilities) for the next token at each position in the sequence. If a speculative token does have the highest logit, then it's the token the large model would have generated.
@@juliensimonfr thx a lot (for the vids in general also, of course). I think I am still missing a point here. why do you get the logits *at each position in the sequence*? isnt the ouput of the model just probabilities for the *next* token? If I would want to have this *at each position*, wouldnt I have to forward pass multiple times? thx!
All these techniques are implemented in open-source inference servers and models. I'd recommend reading the relevant papers, then exploring the implementation in TGI or vLLM.
Good question. The main reason is that the input verification by the larger model only requires a single forward pass per candidate sequence. This is much faster than the usual text generation process, which requires one forward pass per new token. If the larger model disagrees on a particular token, then it will generate a better one and the next ones. However, all the tokens generated up to that point by the smaller model are used as is. So, in the end we get large-model generation quality, only quicker :) Makes sense ? Here's a detailed example: huggingface.co/blog/whisper-speculative-decoding
Hi, you'll find the slides at fr.slideshare.net/slideshow/julien-simon-deep-dive-optimizing-llm-inference/270920916. I'll share the other ones in the next week or so.
Very interesting. Can we batch 1 image of the LLM? Or we need multiple copies of it loaded in the gpu in order to batch? And how can we estimate the throughput? Say if I have a 70B model on 2 A100?
This works even on a single GPU. Here's the paper if you want to dive deeper: www.usenix.org/conference/osdi22/presentation/yu. Regarding benchmarks, I suggest you run your own to find the right latency/throughput trade-off for your application. This should help: www.databricks.com/blog/llm-inference-performance-engineering-best-practices
Thanks for the video great job!!! in terms of Speculative decoding, can you provide any additional feedback on its impact on GPU performance/memory? i.e kv-cache usage or overall GPU memory resources
The only overhead is the assistant model, which can share layers with the large model . For example, see huggingface.co/blog/whisper-speculative-decoding, which says that there's only 8% of RAM overhead.
hi great video! how to set WAITING_SERVED_RATIO, MAX_BATCH_SIZE, MAX_BATCH_TOTAL_TOKENS, MAX_BATCH_PREFILL_TOKENS etc. for highest throughput? looking at llama2-7b-chat and llama3-8b-instruct with nvidia A10.
Thanks Julien. Your recent series of videos have been top quality.
A future video, you might consider making, is one about the different prompts required when fine-tuning. Why does llama2 differ from mistral? left-padding for some, right-padding for others; how does trl help simplify things? what is the history of this? what is chatml? guanaco? etc..,
To get a solid foundation in navigating this area would be a helpful video to say the least!
Hi Cyberman, thank you for the kind words. I have a bit of an inference obsession at the moment, but I'll come back to training and fine-tuning after that. Your suggestions sound good, I'll add them to the list :)
Thanks so much for the crystal clear explanations! You understand them so well and it's even more amazing how you show them in bullet points and graphs to make your audience understand as well!
Glad you liked it, thank you!
This is too good!
The explanation was excellent. Thanks a lot!
Glad it was helpful!
how does the big LLM handle the "predicted" tokens? I mean, how does it check whether these are good or not?
Detailed explanation in huggingface.co/blog/assisted-generation. In a nutshell, you can run a forward pass with the large model on a speculative sequence and retrieve the logits (i.e. the probabilities) for the next token at each position in the sequence. If a speculative token does have the highest logit, then it's the token the large model would have generated.
@@juliensimonfr thx a lot (for the vids in general also, of course). I think I am still missing a point here. why do you get the logits *at each position in the sequence*? isnt the ouput of the model just probabilities for the *next* token? If I would want to have this *at each position*, wouldnt I have to forward pass multiple times? thx!
very informative as always !
Glad it was helpful!
Thank you for the video, but is there a complete mod to use the ideas in the video, especially for beginners?
All these techniques are implemented in open-source inference servers and models. I'd recommend reading the relevant papers, then exploring the implementation in TGI or vLLM.
Super cool! Just why in speculative decoding the validation part made by the bigger model is faster? I don"t understand how validation works
Good question. The main reason is that the input verification by the larger model only requires a single forward pass per candidate sequence. This is much faster than the usual text generation process, which requires one forward pass per new token.
If the larger model disagrees on a particular token, then it will generate a better one and the next ones. However, all the tokens generated up to that point by the smaller model are used as is. So, in the end we get large-model generation quality, only quicker :)
Makes sense ? Here's a detailed example: huggingface.co/blog/whisper-speculative-decoding
Would you mind sharing the slides please Sir? Thank you!
Hi, you can find the slides on Slideshare at fr.slideshare.net/slideshow/julien-simon-deep-dive-model-merging/270921708
Thank you very much Julien for this high-quality excerpt!
Could you please attach the slides in the description, as well as under the other videos?
Hi, you'll find the slides at fr.slideshare.net/slideshow/julien-simon-deep-dive-optimizing-llm-inference/270920916. I'll share the other ones in the next week or so.
@@juliensimonfr thanks a lot!
Very interesting. Can we batch 1 image of the LLM? Or we need multiple copies of it loaded in the gpu in order to batch? And how can we estimate the throughput? Say if I have a 70B model on 2 A100?
This works even on a single GPU. Here's the paper if you want to dive deeper: www.usenix.org/conference/osdi22/presentation/yu. Regarding benchmarks, I suggest you run your own to find the right latency/throughput trade-off for your application. This should help: www.databricks.com/blog/llm-inference-performance-engineering-best-practices
@@juliensimonfr Thanks. Do you have a link that explains how to calculate the feasability for an LLM?
很快就讲清楚了,好厉害!爱来自瓷器。
5star ⭐
Thank you!
Thanks for the video great job!!! in terms of Speculative decoding, can you provide any additional feedback on its impact on GPU performance/memory? i.e kv-cache usage or overall GPU memory resources
The only overhead is the assistant model, which can share layers with the large model . For example, see huggingface.co/blog/whisper-speculative-decoding, which says that there's only 8% of RAM overhead.
hi great video! how to set WAITING_SERVED_RATIO, MAX_BATCH_SIZE, MAX_BATCH_TOTAL_TOKENS, MAX_BATCH_PREFILL_TOKENS etc. for highest throughput? looking at llama2-7b-chat and llama3-8b-instruct with nvidia A10.
Hi, the doc is available at huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher. I would increase batch size and measure.