Thanks didn't want to feel too unproductive on a thanksgiving but didn't want to commit to a full video series. Always releasing timely and great stuff!
Amazing content! I watched entire video to understand the big picture and now I will do a step by step "play and code" poc for methods for different scenarios as proposed.
Thanks for including the colab, and I wasn't aware of AWQ before this video. Would you consider making a video on the efficiency on each, especially when using gpu on gguf model?
Outstanding! (To put things in perspective: I've seen a LOT of praise for wrapping the obvious or marketing-only BS into lengthy videos and I'm not shy to speak my mind there too!)
Thanks for the video! What I don't understand is that people always say that AWQ is faster than GPTQ, but in my 3060 12gb they are usually quite slow, around 3t/s, while in gptq I can get from 5 to 20t/s
Thanks, Maarteen. I wish you could share some performance comparison between the difference methods. I have been trying to find some but I couldn't. I do know that AWQ is better than GPTQ, but wish to compare it to GGUF.
RuntimeError: cutlassF: no kernel found to launch! - it what I'm getting when trying to run your example at step 4: outputs = pipe( prompt, max_new_tokens=256, do_sample=True, temperature=0.1, top_p=0.95 )
Thanks for this video - was a great explanation on the difference between the three models. How's the support for AWQ now? Also I would love it if you could make a video on how to deploy these quantized models for production
inference is tooo slow on the T4 gpu on collab, i fed a football commentary transcribe text line to llm the pre quantized one, it took 3 minutes to obtain the result
Great content! I am wondering whether nowadays we should choose LLMs over BERT models on most tasks or use seperately based on specific use cases? That could be an interest topic to discuss!
Thank you for the great explanations :) Does it make sense to do this before trainning? Quantize the model with these tecniques befor doing peft, qlora, p-tunning etc?
It definitely helps with training if the full model does not fit on the GPU. With many of these methods, efficiency is important and quantization is seldom not used.
My understanding was GPTQ is recommended when quantized model can fit entirely in vRAM and that with GGUF one can still offload layers to a GPU. If I wanted to try mixtral dolphin ~40GB version (On a system w/4080 16GB VRAM, 64GB RAM) what would be a better choice GPTQ or GGUF?
Definitely GGUF. Mixtral 4bit doesnt dit within 16GB Vram, so offloading layers would be necessary. I remember you could offload around 20 layers if I'm not mistaken. I do think the quantised variants, including the dolphin fine tune, are worth checking out.
Hi Maarten, You are my only hope :) ! I have an Alienware m18 R2 with an Intel i9-14900HX, NVIDIA RTX 4090 (24GB), 64GB RAM, and 8TB storage, but I would like try to run LLaMA 70B models. Since no one on TH-cam has yet covered this topic, they only seem to cater for the 8B models, could you possibly you create a step by step video for users like me on optimizing setups (8-bit quantization, mixed precision, etc.) to run large models efficiently? Your help would be greatly appreciated. Fingers crossed and Many Thanks!
@@theuniversityofthemind6347 with 24GB of VRAM, I don't think you can fit an entire 70B model unless you quantize down to 3bits. This would also hurt performance significantly. Instead, you could offload some layers to the CPU with a gguf model but that would result in a slower model.
@@MaartenGrootendorst This is really sad news :( do you think its still worth a try to make it fit using your kindly supplied method anyway? and if i do what can i expect?
@@theuniversityofthemind6347 I believe you can still expect a couple of tokens per second which might be enough for your use case. That said, there are many sub 70B models that still perform well I believe. Definitely worth looking into.
@@MaartenGrootendorst For extra information I don't plan to use this for high intensive tasks like model training or any other such high intensive computing tasks, i only mainly will be using it for analysing my business documents and also writing 20 minute elaborate stories based on a five step story structure. I wanted to use the 70B model to generate the best possible results for these smaller less intensive type tasks. Does this change things at all?
Brilliant video; you have a style that explains things nicely. Thank you. Sub'd. If you are looking for ideas, I think an overview of what "weights, biases and parameters" mean for models would be great.
Well what about the title of the video? I still don't know wich one is right for me wich was the point of watching this video. All you did was explain what each method is, not wich one we should use :s You have good information here in this video but still you missed the entire point of the title in my opinion.
well nevermind, I paused the video cause it was about to finish and go over to another on the playlist, and guess you do talk about it in the end, my bad. Thats what I get for not finishing the video before coming with stupid questions.
@@kiiikoooPT well i came here for similar reasons and although this is indeed very well explained, i still miss something i was hoping for (kinda like you): which quant works best for what mode? as i have read the other day that for example gptq is less suitable for rp than exl2. why? i dont know. thats why i came here. do the modes chat, cai and instruct have their preferred quantizations? if so, why? im leaving without those answers but with further base knowledge, which is more important in the long run. but hey.. there is one answer i was interested in: why do some models assume both the character- and the user role and progress the rp-story themselves instead of letting me guidie it? because now i know thats sort of typical for the awq quant
from the litle I understand, the fact that a model prompt is diferent from model to model, has nothing to do with it being awq or gguf or whatever type off file, that is just about loading the model with diferent loaders. what you are talking about is another topic, you need to see videos about something like mistral vs mistral instruct. or diference betwen those. cause what you are saying has to do with how the model learned stuff, or how it was trained. the instruct models like the name says, are models that are based on instructions, so your prompt cannot be as a role play, cause it will give step by step anwsers instead of a simple conversation. What I really wanted to know is what kind of file is better for low end hardware, since my laptop has an nvidea graphic card with cuda cores, but is so hold that there is not even up to date drivers for it, so I can't install the stupid pytorch with cuda, only cpu mode, and I thought the type of file, gguf and so on, had diferent ways of loading the models, so I could manage to load it without the libraries everyone is using cause they have recent hardware. @@CitizUnReal
Amazing content! Most youtube tutorials just go into trying out the outputs of pre-made LLMs but rarely dive into this level of technical details.
Thanks!
Thanks didn't want to feel too unproductive on a thanksgiving but didn't want to commit to a full video series. Always releasing timely and great stuff!
Amazing content! I watched entire video to understand the big picture and now I will do a step by step "play and code" poc for methods for different scenarios as proposed.
Thanks Maarten! I was searching for quantizing exactly the zephyr-7b-beta and I realized you used it halfway in the video!
Thanks a lot for clarifying the main differences between quantization methods and also for sharing your code.
Thanks for including the colab, and I wasn't aware of AWQ before this video.
Would you consider making a video on the efficiency on each, especially when using gpu on gguf model?
Outstanding! (To put things in perspective: I've seen a LOT of praise for wrapping the obvious or marketing-only BS into lengthy videos and I'm not shy to speak my mind there too!)
I was struggling with quantization last weekend! very timely! thanks
Well hey, the day you spoke of when we have 1B param models is here. Llama3.2-1B is incredible for its size
Thanks for the video! What I don't understand is that people always say that AWQ is faster than GPTQ, but in my 3060 12gb they are usually quite slow, around 3t/s, while in gptq I can get from 5 to 20t/s
Useful information and well made video.
Thank you for the informative video. I understood how I made a huge mistake using gguf when having the VRAM to use GPU primarily.
Thanks, Maarteen. I wish you could share some performance comparison between the difference methods. I have been trying to find some but I couldn't. I do know that AWQ is better than GPTQ, but wish to compare it to GGUF.
RuntimeError: cutlassF: no kernel found to launch! - it what I'm getting when trying to run your example at step 4:
outputs = pipe(
prompt,
max_new_tokens=256,
do_sample=True,
temperature=0.1,
top_p=0.95
)
Thanks for this video - was a great explanation on the difference between the three models. How's the support for AWQ now? Also I would love it if you could make a video on how to deploy these quantized models for production
Thank you so much for the video, i would like to know which method is faster at inference time.
Hi, WHich one is faster ?
GGUF with CUDA or AWQ ?
Really enjoyed your video. It was very informative. Just wanted to know can finetuning be done on these pre-quantized models ??
Really enjoyed this session! Any chance you can continue this by showing how to fine-tune this versions of the models?
Great video and great comparisons. Can you make a video on how to quantize a model oneself as well?
inference is tooo slow on the T4 gpu on collab, i fed a football commentary transcribe text line to llm the pre quantized one, it took 3 minutes to obtain the result
Thank you for the differences and the code.
Great content! I am wondering whether nowadays we should choose LLMs over BERT models on most tasks or use seperately based on specific use cases? That could be an interest topic to discuss!
Thank you for the great explanations :)
Does it make sense to do this before trainning? Quantize the model with these tecniques befor doing peft, qlora, p-tunning etc?
It definitely helps with training if the full model does not fit on the GPU. With many of these methods, efficiency is important and quantization is seldom not used.
My understanding was GPTQ is recommended when quantized model can fit entirely in vRAM and that with GGUF one can still offload layers to a GPU. If I wanted to try mixtral dolphin ~40GB version (On a system w/4080 16GB VRAM, 64GB RAM) what would be a better choice GPTQ or GGUF?
Definitely GGUF. Mixtral 4bit doesnt dit within 16GB Vram, so offloading layers would be necessary. I remember you could offload around 20 layers if I'm not mistaken. I do think the quantised variants, including the dolphin fine tune, are worth checking out.
@@MaartenGrootendorst Thanks!
Bfloat 16-bit is error 1+8+7 format.
Great video, how ever its quite frustrating trying to run this code in production the dependencies are never correct.
Great video❤
Amazing!
Hi Maarten, You are my only hope :) ! I have an Alienware m18 R2 with an Intel i9-14900HX, NVIDIA RTX 4090 (24GB), 64GB RAM, and 8TB storage, but I would like try to run LLaMA 70B models. Since no one on TH-cam has yet covered this topic, they only seem to cater for the 8B models, could you possibly you create a step by step video for users like me on optimizing setups (8-bit quantization, mixed precision, etc.) to run large models efficiently? Your help would be greatly appreciated. Fingers crossed and Many Thanks!
@@theuniversityofthemind6347 with 24GB of VRAM, I don't think you can fit an entire 70B model unless you quantize down to 3bits. This would also hurt performance significantly. Instead, you could offload some layers to the CPU with a gguf model but that would result in a slower model.
@@MaartenGrootendorst This is really sad news :( do you think its still worth a try to make it fit using your kindly supplied method anyway? and if i do what can i expect?
@@theuniversityofthemind6347 I believe you can still expect a couple of tokens per second which might be enough for your use case. That said, there are many sub 70B models that still perform well I believe. Definitely worth looking into.
@@MaartenGrootendorst For extra information I don't plan to use this for high intensive tasks like model training or any other such high intensive computing tasks, i only mainly will be using it for analysing my business documents and also writing 20 minute elaborate stories based on a five step story structure. I wanted to use the 70B model to generate the best possible results for these smaller less intensive type tasks. Does this change things at all?
@@theuniversityofthemind6347 Not much since you will need to load the model in the first place. Lower bits or offloading layers are the main options.
what about exllamav2?
Brilliant video; you have a style that explains things nicely. Thank you. Sub'd.
If you are looking for ideas, I think an overview of what "weights, biases and parameters" mean for models would be great.
Good one..
Thanks
FYI, someone may steal your colab: th-cam.com/video/dbNcKnj6H5Q/w-d-xo.html
@@qizheng5594 Thanks! I'll definitely reach out to see if I can get proper credit here.
Well what about the title of the video? I still don't know wich one is right for me wich was the point of watching this video. All you did was explain what each method is, not wich one we should use :s
You have good information here in this video but still you missed the entire point of the title in my opinion.
well nevermind, I paused the video cause it was about to finish and go over to another on the playlist, and guess you do talk about it in the end, my bad. Thats what I get for not finishing the video before coming with stupid questions.
@@kiiikoooPT well i came here for similar reasons and although this is indeed very well explained, i still miss something i was hoping for (kinda like you): which quant works best for what mode? as i have read the other day that for example gptq is less suitable for rp than exl2. why? i dont know. thats why i came here. do the modes chat, cai and instruct have their preferred quantizations? if so, why?
im leaving without those answers but with further base knowledge, which is more important in the long run. but hey.. there is one answer i was interested in: why do some models assume both the character- and the user role and progress the rp-story themselves instead of letting me guidie it? because now i know thats sort of typical for the awq quant
from the litle I understand, the fact that a model prompt is diferent from model to model, has nothing to do with it being awq or gguf or whatever type off file, that is just about loading the model with diferent loaders. what you are talking about is another topic, you need to see videos about something like mistral vs mistral instruct. or diference betwen those. cause what you are saying has to do with how the model learned stuff, or how it was trained. the instruct models like the name says, are models that are based on instructions, so your prompt cannot be as a role play, cause it will give step by step anwsers instead of a simple conversation. What I really wanted to know is what kind of file is better for low end hardware, since my laptop has an nvidea graphic card with cuda cores, but is so hold that there is not even up to date drivers for it, so I can't install the stupid pytorch with cuda, only cpu mode, and I thought the type of file, gguf and so on, had diferent ways of loading the models, so I could manage to load it without the libraries everyone is using cause they have recent hardware. @@CitizUnReal