The KV Cache: Memory Usage in Transformers

Efficient NLP

มุมมอง 45 650

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 15 ธ.ค. 2024

ความคิดเห็น • 103

@michaelnguyen1724 ปีที่แล้ว ⁺⁷
You explained KV cache so well in an easy to understand way.
@TL-fe9si ปีที่แล้ว ⁺¹⁵
This is so clear! Thanks for the explanation!
@mohitlamba117 6 หลายเดือนก่อน ⁺⁵
Thanks a ton for this crisp and precise explanation of why we use caching in transformers.
@mamotivated ปีที่แล้ว ⁺⁴
This was a beatiful, simple video. Great job. Feeding this youtube algo.
@alexandretl ปีที่แล้ว ⁺⁸
really great video! funny that i searched for "transformer kv cache" in google and your video was uploaded only 8 hours ago
@EfficientNLP ปีที่แล้ว ⁺³
thanks! last week I looked for a video on this topic and didn't find one so I decided to make it :)
@MacProUser99876 9 หลายเดือนก่อน
@@EfficientNLP - necessity is the mother of invention. please keep up the good work!
@zifencai2135 ปีที่แล้ว ⁺³
Awesome explanation! Looking forward to more videos.
@jow7814 4 หลายเดือนก่อน ⁺¹
This is the best video for kv cache! Thx
@shashank3165 7 หลายเดือนก่อน ⁺¹
A really concise explanation. Thanks a lot.
@UMN-CSCI5541-p9i หลายเดือนก่อน
Amazing explanation. Love it. Thank you so much.
@voncolborn9437 6 หลายเดือนก่อน
Great video. No I understand the importance of 'Time to 1st token'. I like the short ones that are to the point on a topic. Learning in smaller chunks works well for me.Thanks!
@forrest-forrest 5 หลายเดือนก่อน
Amazing. Some of my colleagues work on KV cache, and this video was a great introduction to the topic. Thank you!
@mr.anderson5077 25 วันที่ผ่านมา
Which company mate?
@boi_doingthings 8 หลายเดือนก่อน
Excellent Video. Just Brilliant.
@sarabolouki 2 หลายเดือนก่อน
This is was a great tutorial, thank you for sharing.
@yuanhu6031 6 หลายเดือนก่อน
Excellent video, great high level overview!
@PMX 11 หลายเดือนก่อน ⁺¹
For running an LLM locally, a batch size of 1 is enough and would reduce the KV cache to just 1.4GB in the OPT-30 example
@omgitsbenhayes 4 หลายเดือนก่อน
Can you explain why at 5:23 the self attention layer is using x, x' and not q, q' ?
@EfficientNLP 4 หลายเดือนก่อน
Here, x represents the embeddings generated by the previous layer, and it is required in self-attention because q = W_q x. The purpose of the diagram is to show how self-attention uses the KV cache and the previous layer's embeddings to update the cache and generate the inputs for the next transformer layer.
@vijaypaul5774 3 หลายเดือนก่อน
Trying to understand the KV cache memory requirement (7:17). Why would we use a batch size here given this concept is during inferencing. Is it common/typical to run inferencing workloads as well in batches?
@EfficientNLP 3 หลายเดือนก่อน
Yes, when running on GPUs you achieve much higher throughput by batching rather than processing one at a time.
@huiwei-ed1ip ปีที่แล้ว ⁺⁵
I hear that vLLM is an optimization for K-V cache, it uses continue-batching and pagedAttention
@徐迟-i2t 7 หลายเดือนก่อน
very clear.Thank you!
@einsteinsapples2909 ปีที่แล้ว
great video! thank you, subscribed!
@YoucefKacer 11 หลายเดือนก่อน ⁺³
Thanks, simplfiying to one layer and one head makes things very clear. Now suppose we have n_layers and n_heads, so we have the Memory Usage per Token like M = n_layers*d_embed (d_embed = d_k * n_heads). But, what about the Compute per Token? Some online references claims that C = n_layers * d_embed² and i'm very suprised that this formula does not depend on the past tokens (context + already generated) : I mean that the 2nd layer expects the embeddings vector x2 output by 1st layer to compute kv cache, and x2 depends on past tokens (see Attention formula). What do you think?
@EfficientNLP 11 หลายเดือนก่อน ⁺¹
Sorry, I didn't quite understand your question. The KV cache for the layers of a transformer is not affected by each other; that is, each layer has its own KV cache and does not depend on whether the KV cache is used for previous layers.
@ArmenJeddi หลายเดือนก่อน
Great video! Does this also apply to vision models? Are there any recent papers that explore caching for diffusion/vit models?
@EfficientNLP หลายเดือนก่อน
KV cache is only relevant for models that do autoregressive generation, which not often the case in vision transformer models or diffusion models, but if the architecture involves autoregressive generation, then it can potentially be useful.
@RyanLynch1 5 หลายเดือนก่อน
8:00 i believe the industry jargon for this first computation is "prefill"
@wolpumba4099 ปีที่แล้ว ⁺³
*Video Summary: The KV Cache: Memory Usage in Transformers*
- *Introduction*
- Discusses the memory limitations of Transformer models, especially during text generation.
- *Review of Self-Attention*
- Explains the self-attention mechanism in Transformers.
- Highlights how query, key, and value vectors are generated.
- *How the KV Cache Works*
- Introduces the concept of the KV (Key-Value) cache.
- Explains that the KV cache stores previous context to avoid redundant calculations.
- *Memory Usage and Example*
- Provides an equation for calculating the memory usage of the KV cache.
- Gives an example with a 30 billion parameter model, showing that the KV cache can take up to 180 GB.
- *Latency Considerations*
- Discusses the latency difference between processing the prompt and subsequent tokens due to the KV cache.
The video provides an in-depth look at the KV cache, a crucial component that significantly impacts the memory usage and efficiency of Transformer models. It explains how the KV cache works, its role in self-attention, and its implications for memory usage and latency.
@bnglr ปีที่แล้ว ⁺³
Is my understanding correct: in your example，“chill” has already been generated，you are demonstrating the preparation work after you got “chill” and before generating the token after “chill”.
@EfficientNLP ปีที่แล้ว
It's showing the work done to generate the word 'chill.' We assume that some work has already been done to generate the previous tokens; that's what is cached and can be used to generate this word.
@bnglr ปีที่แล้ว
so Q_new, K_new and V_new has to be for the token "a", not "chill"@@EfficientNLP
@sookinoby หลายเดือนก่อน ⁺¹
This the exactly the part confused me
@evaadam5382 12 วันที่ผ่านมา
Thanks! is it right that the previous embedding doesn't change becuase the attention layers are unidirectional, if for a bidirectional encoder, the previous tokens also change when a new word is appended, so kv cache won't work? or decoder just shouldn't be bidirectional in the first place?
@EfficientNLP 11 วันที่ผ่านมา ⁺¹
That is correct. The KV cache is not useful for bidirectional encoder models like BERT or encoders in general, since it is only in autoregressive decoders that the model is run multiple times with one more token at a time.
@thoughtbox ปีที่แล้ว ⁺¹
What is the consequence when the KVcache grows to such a point that two GPU's (2x the amount of memory) are needed to continue to calculate the next token? How is the KVcache partitioned across two (or more) GPU's? My guess is that as the context length increases, and the KVcache increases then the amount of compute to calculate each token also continues to expand, is that correct?
@EfficientNLP ปีที่แล้ว
Your question depends on which parallelization strategy is used to distribute across multiple GPUs. Assuming the simplest setup, data parallelism, the model replicates across each GPU with each one handling a different part of the input batch. In this scenario, the KV cache is distributed across GPUs as well, and each GPU must store the KV cache for its respective portion of the batch.
@thoughtbox ปีที่แล้ว
@@EfficientNLP If the input batch (which contains the prompts from multiple users) is split across multiple (lets say 2) GPUs (lets call these sub-batches), the resulting KVcache on each GPU would end up being different. I have 2 questions. 1. How is this different than simply running a smaller batch size in the first place? 2. Does this mean that responses from the transformer based on the prompts from user#1 and user#2 (that were processed in the same batch), will in some way be impacted by each others prompts, as these responses will be determined by a shared KVCache?
@Nishant-xu1ns 7 หลายเดือนก่อน
excellent video sir
@billykotsos4642 ปีที่แล้ว ⁺¹
nice info
@thoughtbox ปีที่แล้ว ⁺¹
I have a question regarding KVcache and multi-tenancy. If the KVcache for a single “inference” fills a GPU or two worth of memory, what happens when the next user inputs a sequence? Does the previous users KVcache need to be flushed, and a new KVcache generated for a new user? Where does that KVcache go? To system memory? Only then to have to be brought back into GPU memory for the next sequence?
@EfficientNLP ปีที่แล้ว ⁺¹
In a production deployment, this inference would be batched, so you'll always be handling multiple inputs from different users and generating them simultaneously. This approach utilizes the GPU more efficiently than processing one sentence at a time. Once the outputs are generated to completion, there's no need to keep the KV cache in memory. If you're asking about multiple chat turns, as in ChatGPT, then each turn is treated as a new input. The conversation history is provided as a prompt, and the KV cache from previous turns is not retained in memory.
@thoughtbox ปีที่แล้ว
@@EfficientNLPif we consider a single GPU (with 180GB of HBM), and our first “batch#1” user inputs generates a KVcache, as in your example of 180GB, how does the system handle the second “batch#2”? Does it retain the 60GB of model and simple write over the 120GB of memory as it is needed? Then if a user input that was previously handled in “batch#1” has a follow up chat turn as in chatgpt, the entire chat history will now be handled as a single input?
@EfficientNLP ปีที่แล้ว
That's correct - Batch 2 will clear out the KV cache for Batch 1. The system won't keep the KV cache of a user across chat messages since you might wait a long time before the user sends another message.
@thoughtbox ปีที่แล้ว
@@EfficientNLP Thanks for clarifying! Great channel you have here.
@grilledcheeze101 ปีที่แล้ว ⁺²
Great Video! Btw can we use KV Cache during training?
@EfficientNLP ปีที่แล้ว ⁺⁴
No, it's only useful for inference because we are generating tokens one at a time, and previous K and V matrices can be cached. During training the entire sequence is processed in parallel, not sequentially, so there is no KV cache.
@grilledcheeze101 ปีที่แล้ว ⁺²
@@EfficientNLPThanks a lot for the explanation ❤.
@SoheePark-k8d หลายเดือนก่อน
Thx for this great video!! Btw I have a question, I understand that each layer caches K and V independently, then why do we multiply the number of layers when calculating the memory usage for K and V?
@EfficientNLP หลายเดือนก่อน
That's correct-each layer has its own independent K and V values, so the size of the KV cache increases with the number of layers.
@gamroogamesgalore321 2 หลายเดือนก่อน
Is the runtime in charge of cache bring up and enlargement?
@gamroogamesgalore321 2 หลายเดือนก่อน
Also for the tokens not yet generated the cache stores 0s or it doesn’t exist ? Are we dynamically determining the size of k and v?
@EfficientNLP 2 หลายเดือนก่อน
Yes, this is typically handled by the inference library like hugging face transformers or similar.
@snehotoshbanerjee1938 4 หลายเดือนก่อน
Great video!!
@UMN-CSCI5541-p9i หลายเดือนก่อน
Can you explain why the memory the model takes up is 2*30B, instead of just 30B?
@EfficientNLP หลายเดือนก่อน ⁺¹
It is multiplied by 2 because there are 2 matrices: one for K and one for V.
@webliu หลายเดือนก่อน
really clear explaination! But I still have some questions:
1. After the prefill phase, Whether or not the embeddings of all tokens except the last one no longer update? Intuitively, the meaning of former tokens may also be influenced by the newly added token.
2. What is the batch size means? We use batching in LLM for what?
I would be very grateful if you would answer my question.qaq
@EfficientNLP หลายเดือนก่อน ⁺¹
1. That is correct-the KV cache works because embeddings for previous tokens do not change after future tokens, this is due to the autoregressive architecture.
2. Batching is frequently used in LLMs to increase throughput, as GPUs are more efficient at processing multiple inputs in parallel than one at a time.
@webliu หลายเดือนก่อน
@@EfficientNLP Thanks for answering my questions! Your reply helps me a lot!😸😸😸
@Best9in ปีที่แล้ว ⁺¹
GJ!
@1PercentPure ปีที่แล้ว ⁺¹
cheers
@chuanjiang6931 4 หลายเดือนก่อน
Is KV Cache used in LLM training process?
@EfficientNLP 4 หลายเดือนก่อน
No, it's only useful for inference because we are generating tokens one at a time, and previous K and V matrices can be cached. During training the entire sequence is processed in parallel, not sequentially, so there is no KV cache.
@chuanjiang6931 4 หลายเดือนก่อน
@@EfficientNLP so in training the seq_len of input tensor is usually greater than 1, just like the normal training like GPT series? During inference, we intentionally forward the model with 1 token at a time (seq_len ==1 always)?
@kitgary 10 หลายเดือนก่อน
I am a bit confused, why a token is represented as a column in the K matrix but a row in the V matrix?
@EfficientNLP 10 หลายเดือนก่อน
This is to illustrate the shapes of matrix operations in the self-attention mechanism. The matrix K is transposed during the dot product operation, so each token is a column.
@akhileshgotmare9812 6 หลายเดือนก่อน
Isn't batch decoding a bit impractical to be assumed in estimating the KV cache footprint of OPT 30B? I'd say for bsz = 1 (online decoding) this is still not that significant ~= 1.4 GB.
@EfficientNLP 6 หลายเดือนก่อน
The ideal batch size depends on the size of the model and the memory available in your GPU hardware. You are correct that the KV cache would not take up much memory in the case of a batch size of 1; however, it would result in poor throughput and would not utilize the parallelism capabilities of the GPUs.
@talis1063 ปีที่แล้ว
If you didn't want the cache part of the 'KV cache' could you save VRAM? I understand it would be much slower. Like deallocate K before calculating V or something.
@EfficientNLP ปีที่แล้ว
Yes, indeed, it is possible to disable the KV cache, which would save memory at the expense of increased compute. Theoretically, you can also enable it for some layers and not for others (although I'm not sure if this is done in practice)
@talis1063 ปีที่แล้ว
@@EfficientNLP Thanks for the answer and the video.
@יאירשי 11 หลายเดือนก่อน
didn't you miss the number of attention heads in the multiplication?
@EfficientNLP 11 หลายเดือนก่อน
I simplified the formulat to ignore multi-headed attention. With MHA you have d_k = d_embed / nheads. However, since there are nheads different K and V matrices, the memory requirement becomes seqlen * d_k * nheads, which is the same as seqlen * d_embed. So we can ignore MHA in the memory calculation.
@klstudio9 ปีที่แล้ว
I am confused about the last part. Why prompting part is slower? It can also generate and append embedding one by one, right?
@EfficientNLP ปีที่แล้ว
It is slower during the first iteration, because the model must generate the full K and V matrices (for all the prompt tokens). In subsequent iterations, it only needs to generate a single row or column of the matrices, corresponding to the next token.
@klstudio9 ปีที่แล้ว
OK. so for prompt tokens, the time complexity is also linear. Because of the length, the generation will take longer.
@stasgurevich7786 ปีที่แล้ว ⁺¹
The batch size seems strange for inference. In a chatbot scenario you can have batchsize of 1, not 128. This will reduce memory for kv cache to about 1gb.
@EfficientNLP ปีที่แล้ว ⁺²
It depends on the scenario. For interactive use, if the batch size is 1, you probably don't need to use the KV cache at all -- the memory usage won't be very high but this isn't a very efficient use of GPU memory. Large cloud providers like OpenAI will batch together many requests when serving their ChatGPT or API to make more efficient use of GPU resources.
@RoyAAD 7 หลายเดือนก่อน
Awesome.
@DeepTylerDurden ปีที่แล้ว
At the beginning the memory usage is not 180GB, right? The total context_length of the model is 1024, but not the current length. Let's say we have a prompt with 20 tokens and we will run inference on that. The model will need to store the kv cache only for seqlen = 20, then 21, 22, ..., up to 1024. So we will typically see the memory usage of kv cache growing during inference. Am I correct?
@EfficientNLP ปีที่แล้ว ⁺²
That's right, the kv cache will contain the embeddings for 20 tokens at the beginning, then grow as more tokens are generated.
@wolpumba4099 ปีที่แล้ว
@@EfficientNLP You mention at th-cam.com/video/80bIUggRJf4/w-d-xo.html that one typically uses fp16 for inference. However, I could imagine that 2 bit per entry should be plenty to represent phase and frequency of a sinus in a vector. Can you clarify what parts of the K and V matrices may be quantized?
@EfficientNLP ปีที่แล้ว
@@wolpumba4099 In the example, we assume all of the weights and computations in the network are in fp16.
@SushilDubey171 9 หลายเดือนก่อน
Great expl
@BlockDesignz ปีที่แล้ว
Good video!
@SinanAkkoyun ปีที่แล้ว
Thank you so so much! I have one question: Why is the logit generation for the already existent prompt necessarry? I want to understand how the prediction of a new token is directly related to the already generated logits. I hope my question makes sense.
Again, thank you so much, your videos are the best explanations on youtube!
@EfficientNLP ปีที่แล้ว
I'm not sure what you mean - I didn't discuss logits anywhere in this video. If you're referring to the key and value vectors for the previous prompt tokens, they are required because all the previous tokens in the sequence need to be computed and multiplied by the matrices Wk and Wv to generate k and v vectors necessary to perform self-attention.
@jokmenen_ 9 หลายเดือนก่อน
Wow im starting to get it. Still blows my mind though, how does it learn...
@nimatajbakhsh999 ปีที่แล้ว
Modern GPUs such as A100 or H100 have at most 80GB RAM, so how would one run inference for a large language model with KV caching? In your example, KV-caching takes about 180GB. Model parallelism is the only option?
@EfficientNLP ปีที่แล้ว
Yea, that is more than you can fit on a single GPU currently, so to run a 60GB model you will need to split the model across several GPUs (eg: pipeline or tensor parallelism).
@mrinalde ปีที่แล้ว
Can you add more details on the dimension of each K,Q,V . For example when we compute QKt - here Q is 1xd ( D is the hidden dimension) and o/p is 1 coloumn added to K, which means output is 1xD as well . Given this o/p I try to fit the equation as such op : Q @ Kt ==> 1xD = 1XD @ DxD ?? Is Kt DxD ?
@EfficientNLP ปีที่แล้ว
That's not quite right, the dimension of K^T is D x seqlen, not DxD. A good way to figure out the dimensions is by setting up a breakpoint in any transformers library, such as Hugging Face, and print out the dimensions of the tensors.
@senapatiashok ปีที่แล้ว
Great video. Do you have a notebook implemented with KV Cache ? It will be really helpful.
Memory optimization is one of the key solution to realize on Device. Keep posting insightful optimizations.
@EfficientNLP ปีที่แล้ว ⁺¹
Sure, this blog post contains a minimal implementation: www.dipkumar.dev/becoming-the-unbeatable/posts/gpt-kvcache/
@Roger11265 ปีที่แล้ว
Good Video!! ❤ But I have a question: when u calculate the memory usage, u use the d_embed (dimension of embeddings) why not the d_k which means the dimension in W_k (d_embed*d_k)? I think the K-cache shoud be (seqlen*d_k) which comes from the dot product of input matrix (seqlen*d_embed) and W_k (d_embed*d_k)
@EfficientNLP ปีที่แล้ว ⁺¹
You are correct that the dimension of the k/v matrix should be d_k and not d_embed. The version I showed is a simplification that ignores the multi-headed attention. With MHA you have d_k = d_embed / nheads. However, since there are nheads different K and V matrices, the memory requirement becomes seqlen * d_k * nheads, which is the same as seqlen * d_embed. So we can ignore MHA in the memory calculation.
@Roger11265 ปีที่แล้ว
@@EfficientNLP Got it! Thx a lot! ❤❤
@svkchaitanya 9 หลายเดือนก่อน
Hats okk dude , you rock ...
@punchster289 22 วันที่ผ่านมา
the layer norm retroactively modifies previous tokens right?
@EfficientNLP 22 วันที่ผ่านมา
Not sure if I understood this question, but no, LayerNorm (or any other part of the architecture) does not affect previously generated tokens when generating the current token.
@punchster289 22 วันที่ผ่านมา
@@EfficientNLP From my understanding, when we apply layer norm to an embedding sequence, we find the average of the sequence and center it at zero. Then we find the stdev and scale it to 1. If I append a new embedding, this changes the mean and stdev, which changes the orientation of all the vectors after the layer norm. did i misunderstand?

ต่อไป

เล่นอัตโนมัติ

Rotary Positional Embeddings: Combining Absolute and Relative