Boost Your AI Predictions: Maximize Speed with vLLM Library for Large Language Model Inference

แชร์
ฝัง
  • เผยแพร่เมื่อ 5 ส.ค. 2024
  • Discover vLLM, UC Berkeley's open-source library for fast LLM inference, featuring a PagedAttention algorithm for up to 24x higher throughput than HuggingFace Transformers. We'll compare vLLM and HuggingFace using the LLama 2 7b model, and learn how to easily integrate vLLM into your projects.
    vLLM page: blog.vllm.ai/2023/06/20/vllm....
    Discord: / discord
    Prepare for the Machine Learning interview: mlexpert.io
    Subscribe: bit.ly/venelin-subscribe
    GitHub repository: github.com/curiousily/Get-Thi...
    Join this channel to get access to the perks and support my work:
    / @venelin_valkov
    00:00 - What is vLLM?
    03:27 - vLLM Quickstart
    04:58 - Google Colab Setup (with Llama 2)
    07:19 - Single Example Inference Comparison
    08:57 - Batch Inference Comparison
    10:29 - Conclusion
    #artificialintelligence #llm #mlops #llama2 #chatbot #promptengineering #python

ความคิดเห็น • 3

  • @thevadimb
    @thevadimb 8 หลายเดือนก่อน +3

    Is there a way to load quantized models using vLLM?

  • @AliAlias
    @AliAlias 6 หลายเดือนก่อน +2

    Awesome ❤,
    How to run llms pre downloaded in my disk?

  • @Gerald-iz7mv
    @Gerald-iz7mv 4 หลายเดือนก่อน

    Did you try tensorRT-llm with triton backend llm?