Serve a Custom LLM for Over 100 Customers

แชร์
ฝัง
  • เผยแพร่เมื่อ 16 พ.ย. 2024

ความคิดเห็น • 54

  • @xdrap1
    @xdrap1 11 หลายเดือนก่อน +23

    Great video. Almost no one talks about how to create a server and API with a customizable LLM. I'd love to see more videos on this. Your channel is awesome.

    • @93cutty
      @93cutty 11 หลายเดือนก่อน

      just started watching, excited for the vid!

  • @Paris-vz6uv
    @Paris-vz6uv 8 หลายเดือนก่อน +2

    This is exactly what I needed, everyone simply covers the super basics aspect. So, good to see someone going beyond.
    Please keep it coming. Thank you again :)

  • @wood6454
    @wood6454 10 หลายเดือนก่อน +5

    Thank you for this comprehensive and easy-to-understand guide. I will be serving LLM for my friends.

  • @hickam16
    @hickam16 11 หลายเดือนก่อน +5

    One of the best and most comprehensive explanations, thank you!

  • @cprashanthreddy
    @cprashanthreddy 10 หลายเดือนก่อน +2

    Good Video. Explains most of the parameters required to deploy the solution. Thank you. :)

  • @dimknaf
    @dimknaf 11 หลายเดือนก่อน +2

    Very nice explanation in all the videos I saw. I subscribed. Keep the good work!

  • @omarzidan6840
    @omarzidan6840 10 หลายเดือนก่อน +1

    We love you man! Good job. I was lost, but now really understood everything

  • @PunitPandey
    @PunitPandey 11 หลายเดือนก่อน +4

    Excellent video. Very useful.

  • @hadebeh2588
    @hadebeh2588 6 หลายเดือนก่อน +1

    Thank you so much for your video and all the great content you put out. Your channel is a gold mine of knowledge.

  • @WinsonDabbles
    @WinsonDabbles 11 หลายเดือนก่อน +2

    Very very very comprehensive detailed explanation! Could i request for a video on calculating how much vram is required when trying to fine tune mistral for example?

    • @TrelisResearch
      @TrelisResearch  11 หลายเดือนก่อน +5

      Well Mistral is 7B model. Mixtral is about 45 GB.
      To fine-tine in 16-bit (bf16), you need at least 2x the model size in VRAM (because there are two bytes in 16 bits). So you need 14 GB or 90 GB. In practise, for Mixtral, you probalby need 2x A6000 or 2X A100.
      Now, you can fine-tune with QLoRA (see my earlier vid on that). Actually you'll notice in this video that there is a line when loading the model that is commented out when loading the model. I fyou comment this in (bitsandbytes nf4) then you can cut the VRAM in roughly 3x. So now you could train mixtral on about 5 GB VRAM or 30 GB.
      Last thing, you need some VRAM for the sequence length, which depends on the seq length you're training with. Maybe add 20% to the VRAM for buffer.

    • @WinsonDabbles
      @WinsonDabbles 11 หลายเดือนก่อน +1

      @@TrelisResearch thank you for the detailed response! Do you have a discord community or something?

    • @TrelisResearch
      @TrelisResearch  11 หลายเดือนก่อน +2

      @@WinsonDabbles no Discord community. I just use TH-cam as the public forum and then offer paid lifetime membership (and scripts) to an Inference and also a Fine-tuning repo. There are quite a few members that post issues there. There's more info on Trelis.com

    • @WinsonDabbles
      @WinsonDabbles 11 หลายเดือนก่อน +1

      @@TrelisResearch thank you for your insights and responses! Very helpful and much appreciated!

  • @efexzium
    @efexzium 11 หลายเดือนก่อน +1

    Fantastic video sir, very informative.

  • @sherpya
    @sherpya 9 หลายเดือนก่อน

    you can pipe curl api calls that returns json to jq utility to colorize / format

  • @danieldemillard9412
    @danieldemillard9412 10 หลายเดือนก่อน +3

    Have you explored serverles on runpod? It seems like this would be a good way of minimizing idle time and saving costs in production as you would only pay for what your customers are actually using. This might bring costs closer to a per token calculation and be competitive to OpenAI. I think for single concurrent requests, it is still much more expensive than OpenAI but curious about the economics of saturating a serverless GPU server and only paying for when it is active (scale down to 0). It would be great to see a video on this as well as what the impact is on startup latency for the overall api call. I have worked with non-gpu serverless and usually it only adds a couple of seconds to go from 0 to 1 instance. I would also be curious how many parallel requests one of these could handle.

    • @TrelisResearch
      @TrelisResearch  10 หลายเดือนก่อน

      Many thanks, that's a solid idea and I'm going to think through how to make a vid on it.
      Yes, serverless is about 4x more expensive per second, but downtime on a full GPU is very problematic so you're absolutely right.

    • @danieldemillard9412
      @danieldemillard9412 10 หลายเดือนก่อน

      @@TrelisResearch Fantastic, looking forward to watching it! I'm also probably going to buy your inference repo so would love some starter code for this. My specific use-case would be running Mixtral in 8 bit precision using runpod's "48 GB GPU" option but I can work through a general case too. One other thing I am curious about is how you might pre-bake the models into the docker image so that the load times are reasonable since downloading the model every time in serverless is a no-go. 80+ GB seems like a pretty massive docker image but they must have figured out how to make that efficient with their "Quick Deploy" models.

    • @TrelisResearch
      @TrelisResearch  10 หลายเดือนก่อน

      @@danieldemillard9412yeah I'm going to dig in on the serverless options

  • @johnade-ojo2917
    @johnade-ojo2917 4 หลายเดือนก่อน

    Fantastic stuff. Thank you!

  • @jonathanvandenberg3571
    @jonathanvandenberg3571 11 หลายเดือนก่อน +1

    Great content!

  • @alchemication
    @alchemication 10 หลายเดือนก่อน +1

    This is super cool !!
    I've tried to check the performance when using a RunPod template with multiple GPUs, but adding a flag `--gpus all` to docker command as per vllm docs did not work. Did you try running even more requests across N-GPU's?

    • @TrelisResearch
      @TrelisResearch  10 หลายเดือนก่อน

      Runpod makes it hard to add flags without updating the image. Can you use a TGI template instead from here: github.com/TrelisResearch/one-click-llms
      It’s faster and supports multi GPU out of the box.
      If you really want vLLM I have the gpus flag set on the Vast.AI one click template.

  • @pavankumarmantha
    @pavankumarmantha 3 หลายเดือนก่อน

    Awesome tutorial.

  • @sania3631
    @sania3631 7 หลายเดือนก่อน

    Thanks you! Outstanding video, bro!

  • @davidfa7363
    @davidfa7363 5 หลายเดือนก่อน

    Hi. Very nice explanation. It is working prety good while you just want to ask questions from the models knowledge base. I am wondering is it possible to use this solution for a RAG system. Can you answer me that?

    • @TrelisResearch
      @TrelisResearch  5 หลายเดือนก่อน

      Yes! You can.
      This is just an endpoint that you can make queries too.
      When you do RAG, you're just including extra context within the prompt. The prompt (now with RAG) would be sent to the endpoint.
      So:
      -- No, this endpoint won't automatically take in prompts AND documents and automatically do rag, but
      -- Yes, if you are doing RAG on your server or locally, you can send those prompts in to the client.
      If you want an endpoint that takes in both documents and prompts, you need to build out a more advanced server (quite a bit more advanced as you need to allow for document uploads/handling too).

    • @davidfa7363
      @davidfa7363 5 หลายเดือนก่อน

      @@TrelisResearch I got that but somehow when i am passing the prompt and the context, the model seems to not understand it properly. Why is that?

    • @TrelisResearch
      @TrelisResearch  5 หลายเดือนก่อน

      @@davidfa7363 ahh, I wonder if it's because you are using a TGI endpoint.
      TGI requires you to apply the chat template prior to submitting the messages.
      By contrast, vLLM, (or TGI, if you hit the openai style endpoint, see TGI docs), will apply the chat template to the array of messages you submit.

    • @TrelisResearch
      @TrelisResearch  4 หลายเดือนก่อน

      @@davidfa7363 hmm, hard for me to say without seeing the exact simple prompt and then prompt with the context. I wonder if the prompt with context is breaking syntax or something

  • @LinkSF1
    @LinkSF1 11 หลายเดือนก่อน +2

    Do you have some kind of lifetime membership? I've become a fan and want to continue following you as you create more content and tutorials,

    • @TrelisResearch
      @TrelisResearch  11 หลายเดือนก่อน

      Howdy! A few options:
      - Free option (Trelis LLM Updates Newsletter - get on it at Trelis.Substack.com)
      - Advanced fine-tuning repo (it's a lifetime membership to the fine-tuning scripts I make and regularly update). trelis.com/advanced-fine-tuning-scripts/
      - Advanced inference repo (again, a lifetime membership that includes inference scripts). trelis.com/enterprise-server-api-and-inference-guide/
      Access to either of those repos also allows you to create Issues to get some support.

  • @paolo-e-basta
    @paolo-e-basta 11 หลายเดือนก่อน +1

    the video content I was looking for, very nice. however the repo is not accessible anymore

    • @TrelisResearch
      @TrelisResearch  11 หลายเดือนก่อน +1

      Howdy. The repo is private so you won’t see it until after the purchase completes. See the top of the description for the link.

  • @AlexBerg1
    @AlexBerg1 11 หลายเดือนก่อน +2

    Economies of scale for a company like OpenAI, which specializes in efficiently serving a single, general purpose model, is so cheap. It makes serving a tuned model so much more expensive, it is unfortunate.

    • @AlexBerg1
      @AlexBerg1 11 หลายเดือนก่อน +1

      Actually, I see Anyscale has seemingly affordable fine-tuning on a Llama 2 base model.
      "Fine Tuning is billed at a fixed cost of $5 per run and $/million-tokens. For example, a fine tuning job of Llama-2-13b-chat-hf with 10M tokens would cost $5 + $2x10 = $25. Querying the fine-tuned models is billed on a $/million-tokens basis."

    • @TrelisResearch
      @TrelisResearch  11 หลายเดือนก่อน +2

      Yup, even though this vid is about serving custom models, I felt I had to say it (twice), that in most cases it's best to use openai/gemini.
      That said:
      a) If your business has a lot of customers, then you also benefit from economies of scale on serving (once you're doing parallel requests you can start getting towards good economics).
      b) if you have a high value use case for a custom model, then it's not a problem paying $0.1/hour or $0.5/hr for your own GPU.

  • @OptaIgin
    @OptaIgin 7 หลายเดือนก่อน

    Question:.. if say i change a server or migrate to different provider, all session info and logs are gone?

    • @TrelisResearch
      @TrelisResearch  7 หลายเดือนก่อน

      Best options are to:
      1. Push session info and logs elsewhere while running.
      2. Use a persistent volume (either from runpod or by connecting up another service). I believe you can connect up most cloud services as your data volume.

  • @eric-theodore-cartman6151
    @eric-theodore-cartman6151 8 หลายเดือนก่อน

    I am looking for something extremely cheap and somewhat fast. Natural language to sql project. Hardly 30-40 concurrent users , less than 100 visitors a day. What do you suggest?

    • @TrelisResearch
      @TrelisResearch  8 หลายเดือนก่อน +1

      Openchat 3.5 7B model! You can run it on an A10 on vast ai . Check out one-click-llms on Trelis Github

  • @jonathancat
    @jonathancat 10 หลายเดือนก่อน

    Hey what about google cloud?

    • @TrelisResearch
      @TrelisResearch  10 หลายเดือนก่อน +1

      Initially I tried out google cloud and AWS and Azure and it was really hard and expensive to get GPU access.
      That could be wrong now. Have you experience? What's the hourly price of an A6000 on demand?

  • @fkxfkx
    @fkxfkx 11 หลายเดือนก่อน

    Why can't any of these u tubers afford a decent haircut?

    • @jonathanvandenberg3571
      @jonathanvandenberg3571 11 หลายเดือนก่อน

      ??

    • @MMABeijing
      @MMABeijing 11 หลายเดือนก่อน

      Because none of the people talking about ai care about ur opinion. Who would have guessed

    • @anglikai9517
      @anglikai9517 11 หลายเดือนก่อน +1

      If they find haircut is important, they won't be intelligent enough for AI stuff.

    • @TrelisResearch
      @TrelisResearch  11 หลายเดือนก่อน +1

      @@anglikai9517 🤣

  • @MrQuicast
    @MrQuicast 7 หลายเดือนก่อน

    great video