Uhh... you already get instant response from GGML/llama.cpp (apart from the model weights loading time, but this is not anything that PagedAttention improves on). The deal with PagedAttention is that it prevents the KV cache from wasting memory by not overallocating the entire context length at once, but rather doing it in chunks as the sequence keeps growing (and possibly sharing chunks among different inference beams or users). This allows the same model to serve more users (throughput) - of course, provided that they generate sequences shorter than the context length. It should not affect the response time for any individual user (if anything, it makes it worse because of the overhead of mapping virtual to physical memory blocks). So if it improves HF in that respect, it just demonstrates that either HF's implementation of KV cache sucks or Sam is comparing non-KV-cached generation with KV-cached one.
A very good test for this, you could make a video for would be to use the OpenAI-Compatible Server functionality with a good performing optimized local model with good coding training, and testing this with new great tools like GPT-Engineer or AIDER, to see how it performs compared to GPT4 in real case scenarios of writing applications.
Hmm, did not know that Red Bull and Verstappen are in the race for turboscharging Llm's😉 thanks for demonstrating vllm in combination with an open source model👍
@Sam Witteveen do you know by any chance how it compares to latest other technics of speeding up model. I don't remember exactly, but sometimes it is just a settings, a parameter nobody didn't use, until somebody share it, as well other technics. AS well, if you would know, which are better suitable for falcon, llama, etc.?
@@samwitteveenai In my office setup, these models cannot be downloaded (blocked), so I download them separately and use their weights using huggingface pipelines as LLM for Langchain and other use cases. Will try a similar approach for vLLM hoping that this approach works
It looks like vLLM itself is CUDA, but I wonder if these techniques could apply to CPU based models like llama.cpp? Presumably, any improvements wouldn't be as dramatic if the bottleneck is processing cycles rather than memory.
Thanks for this video. Although I should mention that at least on my RTX 3090 TI the GPTQ 13B models with exllama loader are absolutely flying. Faster than ChatGPT-3.5 turbo. But I'll definitely take a look
My guess is just because their setup is not using that yet and it will come. I actually just checked my Colab and that seems to be running in Cuda 12.0 but maybe that is not optimal.
Thank you for this nugget. Very useful information for speeding up inference. Does it support bitsandbytes library for loading in 8 or 4-bit? Edit: Noticed no falcon support..
AFAIK I don't think they are supporting Bitsandbytes etc.which doesn't surprise me as for what they are mainly using it for is comparing models which is not ideal a low resolution quantization.
@@s0ckpupp3t Not that I know. I converted bloom-560 to ONNX and got similar latency as with vLLM. I guess with ONNX one could optimise it a bit further, but I'm impressed by vLLM because it's much easier to use.
vLLM is great but lacks support for some models (and some are still buggy like mpt-30b with streaming but mpt was added like 2 day ago so expect that to be fixed soon). For example there's less chance it will support Falcon-40b soon. In that case use huggingface/text-generation-inference which can load falcon 40b in 8-bit flawlessly!
As always, you are one of the few people who hit this topic on TH-cam.
Sam I love you videos but this one takes the cake. Thank you!!!
Talked to its core developer, they don't have plan to support quantized model yet, hence you really need powerfull GPU(s) to run it.
Finally AI models that don't take a year to give a response.
Cheers for sharing this Sam.
Uhh... you already get instant response from GGML/llama.cpp (apart from the model weights loading time, but this is not anything that PagedAttention improves on).
The deal with PagedAttention is that it prevents the KV cache from wasting memory by not overallocating the entire context length at once, but rather doing it in chunks as the sequence keeps growing (and possibly sharing chunks among different inference beams or users).
This allows the same model to serve more users (throughput) - of course, provided that they generate sequences shorter than the context length. It should not affect the response time for any individual user (if anything, it makes it worse because of the overhead of mapping virtual to physical memory blocks).
So if it improves HF in that respect, it just demonstrates that either HF's implementation of KV cache sucks or Sam is comparing non-KV-cached generation with KV-cached one.
Thanks Sam for this video. It would be interesting to dedicate a video comparing OpenAI emulation such as LocalAI, Oobabooga, and vLLM
A very good test for this, you could make a video for would be to use the OpenAI-Compatible Server functionality with a good performing optimized local model with good coding training, and testing this with new great tools like GPT-Engineer or AIDER, to see how it performs compared to GPT4 in real case scenarios of writing applications.
Hmm, did not know that Red Bull and Verstappen are in the race for turboscharging Llm's😉 thanks for demonstrating vllm in combination with an open source model👍
Thanks mate, I am going to try to add that to langchain so it can integrate seamlessly to my product
Finally we can achieve fast responses.
This is very interesting. Thanks for sharing this. It would be nicer, I guess, if langchain can do the same.
My question is it does increase throughput by freeing up the memory to hold in more batches? But how does it achieve the speed up in latenc?
I’m surprised the bottleneck was due to memory inefficiency in the attention mechanism and not volume of matrix multiplications
Can this package be used with quantized 4-bit models? I don't see any support for them in the docs..
no I don't think it will work with that.
can you run this on aws sagemaker too? does it also work with llama 2 model with 7 and 13 billion parameters?
@Sam Witteveen do you know by any chance how it compares to latest other technics of speeding up model. I don't remember exactly, but sometimes it is just a settings, a parameter nobody didn't use, until somebody share it, as well other technics. AS well, if you would know, which are better suitable for falcon, llama, etc.?
for many of the options I have looked at this compares well for the models that it works with etc.
So can I use this with models downloaded from huggingface directly??
Context: In my office setup I can only use models weight downloaded separately.
Yes totally the colab I show was downloading a model from HuggingFace. Not all of the LLMs are compatible, but most the popular ones are.
@@samwitteveenai In my office setup, these models cannot be downloaded (blocked), so I download them separately and use their weights using huggingface pipelines as LLM for Langchain and other use cases.
Will try a similar approach for vLLM hoping that this approach works
@@navneetkrc Yes totally, will just need to load locally etc.
@@samwitteveenai thanks a lot for the quick replies. You are the best 🤗
Where is the model comparison made in terms of execution time wrt HuggingFace?
It looks like vLLM itself is CUDA, but I wonder if these techniques could apply to CPU based models like llama.cpp? Presumably, any improvements wouldn't be as dramatic if the bottleneck is processing cycles rather than memory.
Thanks for this video. Although I should mention that at least on my RTX 3090 TI the GPTQ 13B models with exllama loader are absolutely flying. Faster than ChatGPT-3.5 turbo.
But I'll definitely take a look
It should be noted that for whatever reason it does not work with CUDA 12.x (yet).
My guess is just because their setup is not using that yet and it will come. I actually just checked my Colab and that seems to be running in Cuda 12.0 but maybe that is not optimal.
This looks very useful.
Thank you for this nugget. Very useful information for speeding up inference. Does it support bitsandbytes library for loading in 8 or 4-bit?
Edit: Noticed no falcon support..
AFAIK I don't think they are supporting Bitsandbytes etc.which doesn't surprise me as for what they are mainly using it for is comparing models which is not ideal a low resolution quantization.
I'm wondering how vLLM compares against conversion to onnx (e.g. with optimum) in terms of speed and ease of use. I'm struggling a bit with onnx 😅
does ONNX have a streaming ability? I can't see any mention of websocket or http/2
@@s0ckpupp3t Not that I know. I converted bloom-560 to ONNX and got similar latency as with vLLM. I guess with ONNX one could optimise it a bit further, but I'm impressed by vLLM because it's much easier to use.
not sure since they compared with hf transformer and hf doesn't use flash attention to my knowledge so they are quite slow by default
They compared to TGI also which does have Flash-Attention huggingface.co/text-generation-inference and it is still quite a bit faster
I am wondering if it works with huggingface 8bit and 4bit quantization
If you are talking with bitsandbytes I don't hink it does just yet.
could you show how to add any hugging face model to vllm? Also above colab aint working.
vLLM is great but lacks support for some models (and some are still buggy like mpt-30b with streaming but mpt was added like 2 day ago so expect that to be fixed soon). For example there's less chance it will support Falcon-40b soon. In that case use huggingface/text-generation-inference which can load falcon 40b in 8-bit flawlessly!
Yes none of these are flawless. I might make about video about hosting with HF Text-gen-inference as well.
Great Great Great
Is this usable as a model in langchain for tool use?
You can use it as an LLM in Langchain. Whether it will work with tools will depend on which model you serve etc.
@@samwitteveenai I assume it doesn't support quants? Don't see any mention
Now I wonder if this is possible to launch on CPU
Some models will work tolerable.
Is this possible with langchain and a gui
it don't work.. daim it. I don't want to use Docker to make this work, so I'm stuck
what model you trying to get to work? It also doesn't support quantized models if you are trying for that.
@@samwitteveenai Hi Sam, thanks for the sharing(life-saver for newbies). Wonder your recommendation for quantized models ?
What about data privacy?
You are running it on a machine you control. What are the privacy issues ?
@@samwitteveenai i though that it's cloud based 🎩
can use with GGML model?
no so far these are for full resolution models only
still very slow
It doesnt work on windows folks, trash
they have a docker image. That's what im using right now