Great video. Almost no one talks about how to create a server and API with a customizable LLM. I'd love to see more videos on this. Your channel is awesome.
This is exactly what I needed, everyone simply covers the super basics aspect. So, good to see someone going beyond. Please keep it coming. Thank you again :)
Very very very comprehensive detailed explanation! Could i request for a video on calculating how much vram is required when trying to fine tune mistral for example?
Well Mistral is 7B model. Mixtral is about 45 GB. To fine-tine in 16-bit (bf16), you need at least 2x the model size in VRAM (because there are two bytes in 16 bits). So you need 14 GB or 90 GB. In practise, for Mixtral, you probalby need 2x A6000 or 2X A100. Now, you can fine-tune with QLoRA (see my earlier vid on that). Actually you'll notice in this video that there is a line when loading the model that is commented out when loading the model. I fyou comment this in (bitsandbytes nf4) then you can cut the VRAM in roughly 3x. So now you could train mixtral on about 5 GB VRAM or 30 GB. Last thing, you need some VRAM for the sequence length, which depends on the seq length you're training with. Maybe add 20% to the VRAM for buffer.
@@WinsonDabbles no Discord community. I just use TH-cam as the public forum and then offer paid lifetime membership (and scripts) to an Inference and also a Fine-tuning repo. There are quite a few members that post issues there. There's more info on Trelis.com
Have you explored serverles on runpod? It seems like this would be a good way of minimizing idle time and saving costs in production as you would only pay for what your customers are actually using. This might bring costs closer to a per token calculation and be competitive to OpenAI. I think for single concurrent requests, it is still much more expensive than OpenAI but curious about the economics of saturating a serverless GPU server and only paying for when it is active (scale down to 0). It would be great to see a video on this as well as what the impact is on startup latency for the overall api call. I have worked with non-gpu serverless and usually it only adds a couple of seconds to go from 0 to 1 instance. I would also be curious how many parallel requests one of these could handle.
Many thanks, that's a solid idea and I'm going to think through how to make a vid on it. Yes, serverless is about 4x more expensive per second, but downtime on a full GPU is very problematic so you're absolutely right.
@@TrelisResearch Fantastic, looking forward to watching it! I'm also probably going to buy your inference repo so would love some starter code for this. My specific use-case would be running Mixtral in 8 bit precision using runpod's "48 GB GPU" option but I can work through a general case too. One other thing I am curious about is how you might pre-bake the models into the docker image so that the load times are reasonable since downloading the model every time in serverless is a no-go. 80+ GB seems like a pretty massive docker image but they must have figured out how to make that efficient with their "Quick Deploy" models.
This is super cool !! I've tried to check the performance when using a RunPod template with multiple GPUs, but adding a flag `--gpus all` to docker command as per vllm docs did not work. Did you try running even more requests across N-GPU's?
Runpod makes it hard to add flags without updating the image. Can you use a TGI template instead from here: github.com/TrelisResearch/one-click-llms It’s faster and supports multi GPU out of the box. If you really want vLLM I have the gpus flag set on the Vast.AI one click template.
Hi. Very nice explanation. It is working prety good while you just want to ask questions from the models knowledge base. I am wondering is it possible to use this solution for a RAG system. Can you answer me that?
Yes! You can. This is just an endpoint that you can make queries too. When you do RAG, you're just including extra context within the prompt. The prompt (now with RAG) would be sent to the endpoint. So: -- No, this endpoint won't automatically take in prompts AND documents and automatically do rag, but -- Yes, if you are doing RAG on your server or locally, you can send those prompts in to the client. If you want an endpoint that takes in both documents and prompts, you need to build out a more advanced server (quite a bit more advanced as you need to allow for document uploads/handling too).
@@davidfa7363 ahh, I wonder if it's because you are using a TGI endpoint. TGI requires you to apply the chat template prior to submitting the messages. By contrast, vLLM, (or TGI, if you hit the openai style endpoint, see TGI docs), will apply the chat template to the array of messages you submit.
@@davidfa7363 hmm, hard for me to say without seeing the exact simple prompt and then prompt with the context. I wonder if the prompt with context is breaking syntax or something
Howdy! A few options: - Free option (Trelis LLM Updates Newsletter - get on it at Trelis.Substack.com) - Advanced fine-tuning repo (it's a lifetime membership to the fine-tuning scripts I make and regularly update). trelis.com/advanced-fine-tuning-scripts/ - Advanced inference repo (again, a lifetime membership that includes inference scripts). trelis.com/enterprise-server-api-and-inference-guide/ Access to either of those repos also allows you to create Issues to get some support.
Economies of scale for a company like OpenAI, which specializes in efficiently serving a single, general purpose model, is so cheap. It makes serving a tuned model so much more expensive, it is unfortunate.
Actually, I see Anyscale has seemingly affordable fine-tuning on a Llama 2 base model. "Fine Tuning is billed at a fixed cost of $5 per run and $/million-tokens. For example, a fine tuning job of Llama-2-13b-chat-hf with 10M tokens would cost $5 + $2x10 = $25. Querying the fine-tuned models is billed on a $/million-tokens basis."
Yup, even though this vid is about serving custom models, I felt I had to say it (twice), that in most cases it's best to use openai/gemini. That said: a) If your business has a lot of customers, then you also benefit from economies of scale on serving (once you're doing parallel requests you can start getting towards good economics). b) if you have a high value use case for a custom model, then it's not a problem paying $0.1/hour or $0.5/hr for your own GPU.
Best options are to: 1. Push session info and logs elsewhere while running. 2. Use a persistent volume (either from runpod or by connecting up another service). I believe you can connect up most cloud services as your data volume.
I am looking for something extremely cheap and somewhat fast. Natural language to sql project. Hardly 30-40 concurrent users , less than 100 visitors a day. What do you suggest?
Initially I tried out google cloud and AWS and Azure and it was really hard and expensive to get GPU access. That could be wrong now. Have you experience? What's the hourly price of an A6000 on demand?
Great video. Almost no one talks about how to create a server and API with a customizable LLM. I'd love to see more videos on this. Your channel is awesome.
just started watching, excited for the vid!
This is exactly what I needed, everyone simply covers the super basics aspect. So, good to see someone going beyond.
Please keep it coming. Thank you again :)
Thank you for this comprehensive and easy-to-understand guide. I will be serving LLM for my friends.
Yes, many thanks
One of the best and most comprehensive explanations, thank you!
Good Video. Explains most of the parameters required to deploy the solution. Thank you. :)
Very nice explanation in all the videos I saw. I subscribed. Keep the good work!
We love you man! Good job. I was lost, but now really understood everything
Excellent video. Very useful.
Thank you so much for your video and all the great content you put out. Your channel is a gold mine of knowledge.
Very very very comprehensive detailed explanation! Could i request for a video on calculating how much vram is required when trying to fine tune mistral for example?
Well Mistral is 7B model. Mixtral is about 45 GB.
To fine-tine in 16-bit (bf16), you need at least 2x the model size in VRAM (because there are two bytes in 16 bits). So you need 14 GB or 90 GB. In practise, for Mixtral, you probalby need 2x A6000 or 2X A100.
Now, you can fine-tune with QLoRA (see my earlier vid on that). Actually you'll notice in this video that there is a line when loading the model that is commented out when loading the model. I fyou comment this in (bitsandbytes nf4) then you can cut the VRAM in roughly 3x. So now you could train mixtral on about 5 GB VRAM or 30 GB.
Last thing, you need some VRAM for the sequence length, which depends on the seq length you're training with. Maybe add 20% to the VRAM for buffer.
@@TrelisResearch thank you for the detailed response! Do you have a discord community or something?
@@WinsonDabbles no Discord community. I just use TH-cam as the public forum and then offer paid lifetime membership (and scripts) to an Inference and also a Fine-tuning repo. There are quite a few members that post issues there. There's more info on Trelis.com
@@TrelisResearch thank you for your insights and responses! Very helpful and much appreciated!
Fantastic video sir, very informative.
thank you sir
you can pipe curl api calls that returns json to jq utility to colorize / format
Have you explored serverles on runpod? It seems like this would be a good way of minimizing idle time and saving costs in production as you would only pay for what your customers are actually using. This might bring costs closer to a per token calculation and be competitive to OpenAI. I think for single concurrent requests, it is still much more expensive than OpenAI but curious about the economics of saturating a serverless GPU server and only paying for when it is active (scale down to 0). It would be great to see a video on this as well as what the impact is on startup latency for the overall api call. I have worked with non-gpu serverless and usually it only adds a couple of seconds to go from 0 to 1 instance. I would also be curious how many parallel requests one of these could handle.
Many thanks, that's a solid idea and I'm going to think through how to make a vid on it.
Yes, serverless is about 4x more expensive per second, but downtime on a full GPU is very problematic so you're absolutely right.
@@TrelisResearch Fantastic, looking forward to watching it! I'm also probably going to buy your inference repo so would love some starter code for this. My specific use-case would be running Mixtral in 8 bit precision using runpod's "48 GB GPU" option but I can work through a general case too. One other thing I am curious about is how you might pre-bake the models into the docker image so that the load times are reasonable since downloading the model every time in serverless is a no-go. 80+ GB seems like a pretty massive docker image but they must have figured out how to make that efficient with their "Quick Deploy" models.
@@danieldemillard9412yeah I'm going to dig in on the serverless options
Fantastic stuff. Thank you!
Great content!
This is super cool !!
I've tried to check the performance when using a RunPod template with multiple GPUs, but adding a flag `--gpus all` to docker command as per vllm docs did not work. Did you try running even more requests across N-GPU's?
Runpod makes it hard to add flags without updating the image. Can you use a TGI template instead from here: github.com/TrelisResearch/one-click-llms
It’s faster and supports multi GPU out of the box.
If you really want vLLM I have the gpus flag set on the Vast.AI one click template.
Awesome tutorial.
Thanks you! Outstanding video, bro!
Hi. Very nice explanation. It is working prety good while you just want to ask questions from the models knowledge base. I am wondering is it possible to use this solution for a RAG system. Can you answer me that?
Yes! You can.
This is just an endpoint that you can make queries too.
When you do RAG, you're just including extra context within the prompt. The prompt (now with RAG) would be sent to the endpoint.
So:
-- No, this endpoint won't automatically take in prompts AND documents and automatically do rag, but
-- Yes, if you are doing RAG on your server or locally, you can send those prompts in to the client.
If you want an endpoint that takes in both documents and prompts, you need to build out a more advanced server (quite a bit more advanced as you need to allow for document uploads/handling too).
@@TrelisResearch I got that but somehow when i am passing the prompt and the context, the model seems to not understand it properly. Why is that?
@@davidfa7363 ahh, I wonder if it's because you are using a TGI endpoint.
TGI requires you to apply the chat template prior to submitting the messages.
By contrast, vLLM, (or TGI, if you hit the openai style endpoint, see TGI docs), will apply the chat template to the array of messages you submit.
@@davidfa7363 hmm, hard for me to say without seeing the exact simple prompt and then prompt with the context. I wonder if the prompt with context is breaking syntax or something
Do you have some kind of lifetime membership? I've become a fan and want to continue following you as you create more content and tutorials,
Howdy! A few options:
- Free option (Trelis LLM Updates Newsletter - get on it at Trelis.Substack.com)
- Advanced fine-tuning repo (it's a lifetime membership to the fine-tuning scripts I make and regularly update). trelis.com/advanced-fine-tuning-scripts/
- Advanced inference repo (again, a lifetime membership that includes inference scripts). trelis.com/enterprise-server-api-and-inference-guide/
Access to either of those repos also allows you to create Issues to get some support.
the video content I was looking for, very nice. however the repo is not accessible anymore
Howdy. The repo is private so you won’t see it until after the purchase completes. See the top of the description for the link.
Economies of scale for a company like OpenAI, which specializes in efficiently serving a single, general purpose model, is so cheap. It makes serving a tuned model so much more expensive, it is unfortunate.
Actually, I see Anyscale has seemingly affordable fine-tuning on a Llama 2 base model.
"Fine Tuning is billed at a fixed cost of $5 per run and $/million-tokens. For example, a fine tuning job of Llama-2-13b-chat-hf with 10M tokens would cost $5 + $2x10 = $25. Querying the fine-tuned models is billed on a $/million-tokens basis."
Yup, even though this vid is about serving custom models, I felt I had to say it (twice), that in most cases it's best to use openai/gemini.
That said:
a) If your business has a lot of customers, then you also benefit from economies of scale on serving (once you're doing parallel requests you can start getting towards good economics).
b) if you have a high value use case for a custom model, then it's not a problem paying $0.1/hour or $0.5/hr for your own GPU.
Question:.. if say i change a server or migrate to different provider, all session info and logs are gone?
Best options are to:
1. Push session info and logs elsewhere while running.
2. Use a persistent volume (either from runpod or by connecting up another service). I believe you can connect up most cloud services as your data volume.
I am looking for something extremely cheap and somewhat fast. Natural language to sql project. Hardly 30-40 concurrent users , less than 100 visitors a day. What do you suggest?
Openchat 3.5 7B model! You can run it on an A10 on vast ai . Check out one-click-llms on Trelis Github
Hey what about google cloud?
Initially I tried out google cloud and AWS and Azure and it was really hard and expensive to get GPU access.
That could be wrong now. Have you experience? What's the hourly price of an A6000 on demand?
Why can't any of these u tubers afford a decent haircut?
??
Because none of the people talking about ai care about ur opinion. Who would have guessed
If they find haircut is important, they won't be intelligent enough for AI stuff.
@@anglikai9517 🤣
great video