Boost Your AI Predictions: Maximize Speed with vLLM Library for Large Language Model Inference
ฝัง
- เผยแพร่เมื่อ 5 ส.ค. 2024
- Discover vLLM, UC Berkeley's open-source library for fast LLM inference, featuring a PagedAttention algorithm for up to 24x higher throughput than HuggingFace Transformers. We'll compare vLLM and HuggingFace using the LLama 2 7b model, and learn how to easily integrate vLLM into your projects.
vLLM page: blog.vllm.ai/2023/06/20/vllm....
Discord: / discord
Prepare for the Machine Learning interview: mlexpert.io
Subscribe: bit.ly/venelin-subscribe
GitHub repository: github.com/curiousily/Get-Thi...
Join this channel to get access to the perks and support my work:
/ @venelin_valkov
00:00 - What is vLLM?
03:27 - vLLM Quickstart
04:58 - Google Colab Setup (with Llama 2)
07:19 - Single Example Inference Comparison
08:57 - Batch Inference Comparison
10:29 - Conclusion
#artificialintelligence #llm #mlops #llama2 #chatbot #promptengineering #python
Is there a way to load quantized models using vLLM?
Awesome ❤,
How to run llms pre downloaded in my disk?
Did you try tensorRT-llm with triton backend llm?