vLLM on Kubernetes in Production

แชร์
ฝัง
  • เผยแพร่เมื่อ 16 พ.ค. 2024
  • vLLM is a fast and easy-to-use library for LLM inference and serving. In this video, we go through the basics of vLLM, how to run it locally, and then how to run it on Kubernetes in production with GPU-attached nodes via a DaemonSet. It includes a hands-on demo explaining vLLM deployment in production.
    Blog post: opensauced.pizza/blog/how-we-...
    John McBride(‪@JohnCodes‬)
    ►►►Connect with me ►►►
    ► Kubesimplify: kubesimplify.com/newsletter
    ► Newsletter: saiyampathak.com/newsletter
    ► Discord: saiyampathak.com/discord
    ► Twitch: saiyampathak.com/twitch
    ► TH-cam: saiyampathak.com/youtube.com
    ► GitHub: github.com/saiyam1814
    ► LinkedIn: / saiyampathak
    ► Website: / saiyampathak
    ► Instagram: / saiyampathak
    ► / saiyampathak
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 10

  • @aireddy
    @aireddy 14 วันที่ผ่านมา +1

    This is absolutely wonderful session to understand how can we deploy LLMs in production on Kubernetes cluster!!

    • @kubesimplify
      @kubesimplify  14 วันที่ผ่านมา

      @@aireddy glad it was helpful!

  • @JohnCodes
    @JohnCodes 2 หลายเดือนก่อน +3

    Thanks for having me on Saiyam!! It was alot of fun to show you how we use vLLM at OpenSauced!! Happy to answer any questions here people might have!

  • @DaewonSuh
    @DaewonSuh 2 วันที่ผ่านมา

    Thanks for the wonderful Demo!
    I was wondering why you deploy vllm pod through demonsets rather than deployments.
    With daemonset, you can only deploy one pod in one node and a pod occupying a single gpu.
    Considering that nodes are usually attached with multiple gpus, I am afraid that using daemonset might make a lot of gpus idle.

  • @shivangsharma1
    @shivangsharma1 7 วันที่ผ่านมา +1

    Loved it...❤

    • @kubesimplify
      @kubesimplify  5 วันที่ผ่านมา

      Glad you found it useful!

  • @umeshjaiswal5298
    @umeshjaiswal5298 2 หลายเดือนก่อน

    Thanks for this tutorial Saiyam.

    • @kubesimplify
      @kubesimplify  2 หลายเดือนก่อน

      Glad its useful, you building something with LLM?

  • @divyamchandel8734
    @divyamchandel8734 หลายเดือนก่อน

    Hi John / Saiyam. In the last part you mentioned "In lot of cases could be cheaper"
    What are those cases where locally hosting it is cheaper vs when using openai is cheaper:
    Is it just dependent on the load which we will have (RPD and max RPM)?

    • @matrix9083
      @matrix9083 15 วันที่ผ่านมา

      openai is $.50 per million tokens for gpt 3.5 for example. If you rent a gpu server for that same amount, you can generate tens or hundred of millions of tokens in one hour depending on which text generation model you choose. something like mistral 7b, phi 3 series, llama 3 8b, gemma 2b,etc all deliver about the same results if not better than gpt 3.5 and also all fit on a gpu server that costs 44 cents per hour on runpod. (the A5000 gpu server for example.)