Thanks for the love. If you're doing the deep dive you should definitely touch base on PEFL and how block windowing is achieved in GPTQ's kernel code with transformation matrices and diagonals. This is how we are able to define block size and make training 65B+ models possible by loading only what's being worked on into VRAM as a transform block and freezing the rest of the weights. HuggingFace docs lovingly touch on PEFL in the transformer library. Great stride has been taken to make this accessible to the every day person. Occ34ns fork contains the kernel code with PEFL for LLaMA variants by mosaic. I had to have a wet math moment when i dug through it since i work with transformation matricies in 3D graphics acceleration. Didnt think you could do that to tensors but yes, you can once they have been quantized. :D. Credits to he for thinking outside of the box and making training happen on cpu only and such. He forked GPTQ and did absolutely magic things to it. :)
I did a quick video on LoRA PEFT, already! Though, I did intentionally keep the rank decomposition on the LoRA matrices a bit higher level and only discussed attaching them to the Feedforward layer. I think a very technical deep dive series would be a lot of fun, though I’m still trying to find a balance between technical depth and keeping the videos generally consumable. It’s challenging to find the balance that engages software engineers like you and I but can also be enjoyed by enthusiasts. Thanks for the comment and thanks for watching!
@AemonAlgiz of course. Glad someone's able to get it out in digestible fashion. Bonus: get yourself signed up for Microsoft Build if not already. They will be granting Copilot X and GPT4 plugin access to RSVPs. I'm running this stuff in azure. They're going to be discussing a lot of AI news, handing out MCA subscription credits and all that good stuff to play around with.
Your intelligence is impressive as it compensates for my lack of understanding 😅, but thanks to your articulate explanations, I believe I'm grasping it. I'm grateful to you for imparting such incredible content.
This is ridiculously well explained and easy to understand for someone only beginning to explore this rabbit hole. Whatever motivates you to keep making this videos I hope it continues to. I am gonna go ahead and check rest of your library. I also hope you continue to explain concepts around the subject of these models. Thank you.
Wow, really good explanation, the part of encoding the 16 bit float as 8 bit integer by scaling is pretty intuitive, but the process of adding the error to make small values less likely to fail its mind-blowing I didn't expect it to work but if it is a thing that is being implemented thing right now is because it does.
Underrated Channel, You sir, deserve more subs. P.S: Could you do the same for GGML? and If already did, Playlist with GGML, GPT-Q, LoRA, QLoRA, 4bit vs 8bit, Performance based on parameters(3B, 7B, etc.) would be a nice to have. A lot of channels cover the model as a whole but most of them would never cover the process behind the models. Your video was easy follow and understand the basics behind the LLM Quantization. Keep it up.
Thank you for this nice introduction to GPTQ. Can you also explain how this quantized parameter is finally run on GPU? I am more interested in inference process. What type of variable and operation is used in GPU and if the quantized params are de-quantixed before using or it is used in the quantized state. How the scaling factors are saved and restored.
Thanks a lot for this explanation. How can you even out the errors via the next weights, when you do not now in advance what activation value the weights will be multiplied with?
Great explanation. Some questions: When we are quantising and computing the quantisation loss, do we not need to supply some data for it to compute the loss against? If not, how exactly is this loss computed? (surely we need some inputs and expected outputs to compute this loss, is this why all of the weight errors were 0 when you quantised? ) If we do, could this be interpreted as a form of post training, quantisation 'fine-tuning'? By this I mean that we can use domain data in the quantisation process to help preserve the emergent features in the model that are most useful for our specific domain dataset? Thanks!
At 1.45 where you defined the range of values for the 8bit 0. Quantization. The way I understand it is that we only use 8 bits to store our weight-values. So this would make us use an intervall of 256 values. So wouldn't it be [-127,128] instead of [-127,127] ?
Hey! Thank you so much for the video! I wanted to ask, what exact role does a dataset while quantizing play? The code you showed uses wikitext2 as a dataset for quantization. I am looking very forward to your response!
Do you know what the weights distribution looks like for LM transformers ? For Convolutional Neural Networks the weight distribution tends to be sort Gaussian/Laplacian. Meaning that there are many smaller weights and increasingly less larger ones. This has implication on the compressability of said weights and more.
I suspect the weight distribution follows some normal curve since we can detect outlier features. There is also a new paper, Hyena, which describes a way to find what they describe as "Hyena matrices" for the weights where the weights can be diagonalized. So, we may be looking at O(n*ln(n)) computational (time and memory) complexity alongside 100k+ token contexts very soon. Paper: arxiv.org/pdf/2302.10866v3.pdf
I tried to duplicate this, everything seems fine up to the point when I run it and get killed by the system my gpu is A2000 8 GB, I guess you need atleast 14 GB, I tried to reduce 32 bits to 16 but killed on that too. Any ideas?
One point I am not clear you divide by 127 which represents 8-bits, then your are saying you gonna do 4-bits quantization, do you actually divide the matrices by 63 for 4-bits or still 127
@@HostileRespite I didn’t even think of checking wolfram for something on it! Do they have some documentation on it? My PhD is in Physics, so I tend to just stare at papers until I figure it out haha
@@AemonAlgiz REALLY? Respect! Ex nuclear munitions tech here. Glorified torque wrench twister, nothing so glamorous as you but I know more of the... uh... impractical side... of your studies. I went into as a dumb kid who loved science, came out a lot wiser but still love science. Got interested in Zero-point energy back in the day before RL side tracked me. Anyway, enjoy your stuff! I should have figured you were a bit like me, birds of a feather stubbornly die-hard together... when someone has already done the math for us. 🤣
Thanks. You speak fast. Do you mind slow down little bit? Background sound needs to be removed. Also, please zoom your code write sections. It is impossible to see what you write.
Hey there! Which sections are difficult for you to see? This isn’t a complaint I’ve gotten before, I set the font pretty large. Edit: Rewatched this one and I did keep the font too small. Sorry, this was a mistake on this one!
It's so sad you abandoned your channel. Your explanations are gems
You really put a lot of your time and effort into these highly informative videos. Thank you so much
this is exactly the level of explanation that I need being able to pick up on key concepts and dive deeper in other ways at my own pace. keep it up!
I sincerely appreciate your willingness to share the results of your research and understanding!
Thanks for publishing this. I am glad someone is breaking it down as i have been talking over heads quite a lot about this the last three weeks.
Thanks for the love. If you're doing the deep dive you should definitely touch base on PEFL and how block windowing is achieved in GPTQ's kernel code with transformation matrices and diagonals. This is how we are able to define block size and make training 65B+ models possible by loading only what's being worked on into VRAM as a transform block and freezing the rest of the weights. HuggingFace docs lovingly touch on PEFL in the transformer library. Great stride has been taken to make this accessible to the every day person. Occ34ns fork contains the kernel code with PEFL for LLaMA variants by mosaic. I had to have a wet math moment when i dug through it since i work with transformation matricies in 3D graphics acceleration. Didnt think you could do that to tensors but yes, you can once they have been quantized. :D. Credits to he for thinking outside of the box and making training happen on cpu only and such. He forked GPTQ and did absolutely magic things to it. :)
I did a quick video on LoRA PEFT, already! Though, I did intentionally keep the rank decomposition on the LoRA matrices a bit higher level and only discussed attaching them to the Feedforward layer.
I think a very technical deep dive series would be a lot of fun, though I’m still trying to find a balance between technical depth and keeping the videos generally consumable. It’s challenging to find the balance that engages software engineers like you and I but can also be enjoyed by enthusiasts.
Thanks for the comment and thanks for watching!
@AemonAlgiz of course. Glad someone's able to get it out in digestible fashion.
Bonus: get yourself signed up for Microsoft Build if not already. They will be granting Copilot X and GPT4 plugin access to RSVPs. I'm running this stuff in azure. They're going to be discussing a lot of AI news, handing out MCA subscription credits and all that good stuff to play around with.
Really nice. As someone who barely knows how matrixes and such work, you made these quantization concepts easy to understand..
Thanks for the clear and concise explanation, it was perfect.
Your intelligence is impressive as it compensates for my lack of understanding 😅, but thanks to your articulate explanations, I believe I'm grasping it. I'm grateful to you for imparting such incredible content.
Nice explanation, thank you!!
This is ridiculously well explained and easy to understand for someone only beginning to explore this rabbit hole. Whatever motivates you to keep making this videos I hope it continues to. I am gonna go ahead and check rest of your library. I also hope you continue to explain concepts around the subject of these models. Thank you.
Thank you! I'm glad it was helpful and I am definitely here to stay!
@@AemonAlgiz 👀 No activity on github since June either. Really appreciate this video and hope you make your way back soon. Thanks!
Fantastic explanation and great tutorial! Hoping this channel grows a lot in the future!
this channel is a gold mine
Thank you so much!
@@AemonAlgiz I should be thanking you lol this is wonderful education in a well explained manner
Thanks for all the effort that went into making this video. Very informative indeed.
Dude keep these great videos up. We appreciate you
Wow, really good explanation, the part of encoding the 16 bit float as 8 bit integer by scaling is pretty intuitive, but the process of adding the error to make small values less likely to fail its mind-blowing I didn't expect it to work but if it is a thing that is being implemented thing right now is because it does.
It took me a bit to realize that’s why the inverse hessian was there! It blew my mind when I realized it
Thank you so much for simplifying this to such extent. Subscribed
What’s your view on bitsandbytes NF4 versus GPTQ for quantisation?
Great video as always. Thanks for sharing your knowledge.
Thanks, Jonathon!
this was a great explaination, thank you
Underrated Channel, You sir, deserve more subs.
P.S: Could you do the same for GGML? and If already did, Playlist with GGML, GPT-Q, LoRA, QLoRA, 4bit vs 8bit, Performance based on parameters(3B, 7B, etc.) would be a nice to have. A lot of channels cover the model as a whole but most of them would never cover the process behind the models. Your video was easy follow and understand the basics behind the LLM Quantization. Keep it up.
Thank you! I will be covering GGML in the video after the next one! I think it’s an incredibly powerful tool.
Amazing, loved it
Wow this is a great video
this is so cool!
Thanks. Where can I find the model 'lmsys_vicuna-7b-delta-v1.1' that you mentioned in your demonstration?
Thank you for this nice introduction to GPTQ. Can you also explain how this quantized parameter is finally run on GPU? I am more interested in inference process. What type of variable and operation is used in GPU and if the quantized params are de-quantixed before using or it is used in the quantized state. How the scaling factors are saved and restored.
This is a great question! If you want to check out AutoGPTQ's cuda kernels, it's a great example of how these values are cached and used
Very helpful, also if you could sync your voice to the video more precisely it will improve the overall quality!
Thanks a lot for this explanation. How can you even out the errors via the next weights, when you do not now in advance what activation value the weights will be multiplied with?
Great explanation.
Some questions:
When we are quantising and computing the quantisation loss, do we not need to supply some data for it to compute the loss against? If not, how exactly is this loss computed? (surely we need some inputs and expected outputs to compute this loss, is this why all of the weight errors were 0 when you quantised? )
If we do, could this be interpreted as a form of post training, quantisation 'fine-tuning'? By this I mean that we can use domain data in the quantisation process to help preserve the emergent features in the model that are most useful for our specific domain dataset?
Thanks!
Thanks. How to run the converted model?
At 1.45 where you defined the range of values for the 8bit 0. Quantization. The way I understand it is that we only use 8 bits to store our weight-values. So this would make us use an intervall of 256 values. So wouldn't it be [-127,128] instead of [-127,127] ?
Hey! Thank you so much for the video! I wanted to ask, what exact role does a dataset while quantizing play? The code you showed uses wikitext2 as a dataset for quantization.
I am looking very forward to your response!
Do you know what the weights distribution looks like for LM transformers ? For Convolutional Neural Networks the weight distribution tends to be sort Gaussian/Laplacian. Meaning that there are many smaller weights and increasingly less larger ones. This has implication on the compressability of said weights and more.
I suspect the weight distribution follows some normal curve since we can detect outlier features. There is also a new paper, Hyena, which describes a way to find what they describe as "Hyena matrices" for the weights where the weights can be diagonalized. So, we may be looking at O(n*ln(n)) computational (time and memory) complexity alongside 100k+ token contexts very soon.
Paper: arxiv.org/pdf/2302.10866v3.pdf
@@AemonAlgiz Thanks!
Do you divide my the largest number to get the scaling factor, or do you divide by the modulus?
For zero point you use the largest value, though there are other techniques liking binning.
I tried to duplicate this, everything seems fine up to the point when I run it and get killed by the system my gpu is A2000 8 GB, I guess you need atleast 14 GB, I tried to reduce 32 bits to 16 but killed on that too. Any ideas?
You wrote -127 to 127. 8-bit integers are -128 to -127. The quantization method you described loses slightly more precision than necessary.
One point I am not clear you divide by 127 which represents 8-bits, then your are saying you gonna do 4-bits quantization, do you actually divide the matrices by 63 for 4-bits or still 127
What is the significance of 127
It’s the range for signed 8-bit integers
OMG the maths! 🤣
It took me a whole day to realize why the inverse hessian was there, haha.
@@AemonAlgiz Thank God for Mr. Wolfram... 🤣
@@HostileRespite I didn’t even think of checking wolfram for something on it! Do they have some documentation on it?
My PhD is in Physics, so I tend to just stare at papers until I figure it out haha
@@AemonAlgiz REALLY? Respect! Ex nuclear munitions tech here. Glorified torque wrench twister, nothing so glamorous as you but I know more of the... uh... impractical side... of your studies. I went into as a dumb kid who loved science, came out a lot wiser but still love science. Got interested in Zero-point energy back in the day before RL side tracked me. Anyway, enjoy your stuff! I should have figured you were a bit like me, birds of a feather stubbornly die-hard together... when someone has already done the math for us. 🤣
Thanks. You speak fast. Do you mind slow down little bit? Background sound needs to be removed. Also, please zoom your code write sections. It is impossible to see what you write.
Hey there! Which sections are difficult for you to see? This isn’t a complaint I’ve gotten before, I set the font pretty large.
Edit: Rewatched this one and I did keep the font too small. Sorry, this was a mistake on this one!
I’m really digging your mathematical explanations! Keep it up and subscribed for the mafs!🦾🤓
Thanks, the one coming out today is pretty math heavy!