Quantization explained with PyTorch - Post-Training Quantization, Quantization-Aware Training

Umar Jamil

มุมมอง 15 127

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 1 ก.ค. 2024
In this video I will introduce and explain quantization: we will first start with a little introduction on numerical representation of integers and floating-point numbers in computers, then see what is quantization and how it works. I will explore topics like Asymmetric and Symmetric Quantization, Quantization Range, Quantization Granularity, Dynamic and Static Quantization, Post-Training Quantization and Quantization-Aware Training.
Code: github.com/hkproj/quantizatio...
PDF slides: github.com/hkproj/quantizatio...
Chapters
00:00 - Introduction
01:10 - What is quantization?
03:42 - Integer representation
07:25 - Floating-point representation
09:16 - Quantization (details)
13:50 - Asymmetric vs Symmetric Quantization
15:38 - Asymmetric Quantization
18:34 - Symmetric Quantization
20:57 - Asymmetric vs Symmetric Quantization (Python Code)
24:16 - Dynamic Quantization & Calibration
27:57 - Multiply-Accumulate Block
30:05 - Range selection strategies
34:40 - Quantization granularity
35:49 - Post-Training Quantization
43:05 - Training-Aware Quantization
วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 72

@zendr0 6 หลายเดือนก่อน ⁺³²
If you are not aware let me tell you. You are helping a generation of ML practitioners learn all this for free. Huge respect to you Umar. Thank you for all your hard work ❤
@savvysuraj 4 หลายเดือนก่อน
The content made by Umar is helping me alot.Kudos to Umar.
@vik2189 2 หลายเดือนก่อน ⁺³
Fantastic video! Probably the best 50 minutes spent on AI related concepts in the past 1 year or so.
@dariovicenzo8139 2 หลายเดือนก่อน ⁺³
Great job, in particular the examples regarding the conversion from/to integer not only with formulas but with true numbers too!
@ankush4617 6 หลายเดือนก่อน ⁺¹⁰
I keep hearing about quantization so much, this is the first time i have seen someone go so deep into this topic and come up with such clear explanations! Keep up all your great work, you are a gem to the AI community!!
I’m hoping that you will have a video on Mixtral MoE soon 😊
@umarjamilai 6 หลายเดือนก่อน
You read my mind about Mistral. Stay tuned! 😺
@ankush4617 6 หลายเดือนก่อน
@@umarjamilai❤
@user-rk5mk7jm7r 5 หลายเดือนก่อน ⁺¹
Thanks a lot for the fantastic tutorial. Looking forward to the more series on the LLM quantization!👏
@jiahaosu 5 หลายเดือนก่อน ⁺¹
The best video about quantization, thank you very much!!!! It really helps!
@myaseena 6 หลายเดือนก่อน ⁺¹
Really high quality exposition. Also thanks for providing the slides.
@AbdennacerAyeb 6 หลายเดือนก่อน ⁺⁴
Keep Going. This is perfect. Thank you for the effort you are making
@asra1kumar 3 หลายเดือนก่อน ⁺¹
This channel features exceptional lectures, and the quality of explanation is truly outstanding. 👌
@Aaron-hs4gj 3 หลายเดือนก่อน ⁺¹
Excellent explanation, very intuitive. Thanks so much! ❤
@user-qo7vr3ml4c หลายเดือนก่อน ⁺¹
Thank you for the great content. Especially the goal of QAT to have a wider loss function and how that makes it robust to errors due to quantization. Thank you.
@mandarinboy 5 หลายเดือนก่อน
Great introductory video! Looking forward to GPTQ and AWQ
@user-lg3jo6ih1t 3 หลายเดือนก่อน ⁺¹
I was searching for Quantization basics and could not find relevant videos... this is a life-saver!! thanks and please keep up the amazing work!
@user-td8vz8cn1h 3 หลายเดือนก่อน ⁺¹
This is one of a few channels that I subscribed to after watching one video. Your content is very easy to follow and you are covering topic holistically with additional clarifications, what a man)
@jaymn5318 4 หลายเดือนก่อน ⁺¹
Great lecture. Clean explanation of the field and gives an excellent perspective on these technical topics. Love your lectures. Thanks !
@krystofjakubek9376 6 หลายเดือนก่อน ⁺⁷
Great video!
Just a clarification: on modern processors floating point operations are NOT slower than integer operations. It very much depends on the exact processor and even then the difference is usually extremely small compared to the other overheads of executing the code.
HOWEVER the reduction of size from 32 bit float to 8 bit integer does itself make the operations faster a lot faster. The cause is two fold:
1) modern CPUs and GPUs are typically memory bound and so simply put if we reduce the amount of data the processor needs to load in by 4x we expect the time the processor spends waiting for another set of data to come by to shrink by 4x as well.
2) pretty much all machine learning code is vectorized. This means the processor instead of executing each instruction on a single number grabs N numbers and executes the instruction on all of them at once (SIMD instructions).
However most processors dont have N set instead have set the total number of bits all N numbers occupy (for example AVX2 can do operations on 256 bits at a time) so if we go from 32 bits to 8 bits we can do 4x more instructions at once! This is likely what you mean by operations being faster.
Note thag CPUs or GPUs are very much similar in this regard, only GPUs have much more SIMD lanes (much more bits).
@umarjamilai 6 หลายเดือนก่อน ⁺²
Thanks for the clarification! I was even going to talk about the internal hardware of adders (Carry-lookahead adder) to show how a simple operation like addition works and compare it with the many steps required for the floating-point number (which also involves normalization). You explanation nailed it! Thanks again!
@jaymn5318 4 หลายเดือนก่อน
Great lecture. Clean explanation of the field and gives a excellent perspective on these technical topics.
@HeyFaheem 6 หลายเดือนก่อน ⁺¹
You are a hidden gem, my brotherr
@RaviPrakash-dz9fm หลายเดือนก่อน ⁺¹
Legendary content!!
@NJCLM 5 หลายเดือนก่อน ⁺¹
Great video ! Thank you !!
@sebastientetaud7485 4 หลายเดือนก่อน ⁺¹
Excellent Video ! Grazie !
@koushikkumardey882 6 หลายเดือนก่อน
becoming a big fan of your work!!
@ojay666 3 หลายเดือนก่อน ⁺¹
Fantastic tutorial！！！👍👍👍I’m hoping that you will post a tutorial on model pruning soon🤩
@manishsharma2211 6 หลายเดือนก่อน
beautiful again, thanks for sharing these
@bluecup25 6 หลายเดือนก่อน ⁺¹
Thank you, super clear
@Youngzeez1 6 หลายเดือนก่อน ⁺¹
wow, what an eye-opener! I read lots of research papers but mostly confusing! but your explanation just opened my eyes! Thank you. Please can you do a video on the quantization of vision transformers for object detection?
@ngmson 6 หลายเดือนก่อน ⁺¹
Thank your for your sharing.
@aminamoudjar4561 6 หลายเดือนก่อน ⁺¹
Very helpful thank you so much
@user-pe3mt1td6y 4 หลายเดือนก่อน
Need more advanced videos about advanced Quantization!
@TheEldadcohen 5 หลายเดือนก่อน
Umar I've seen many of your videos and you are a great teacher! Thank you for your effort in explaining in plain (Italian accent) English all of these complicated topics.
Regarding the content of the video - you showed the quantization-aware training and you were surprised of the worse result it showed in comparison to the post-training quantization in the concrete example you made.
I think it is because you trained the post-training quantization on the same data that you tested it on, so the parameters learned (alpha, beta) are overfitted to the test data, that's why the accuracy was better. I think that if you had tested it with true test data, you probably would have seen the result you anticipated.
@andrewchen7710 5 หลายเดือนก่อน ⁺²
Umar, I've watched your videos on llama, mistral, and now quantization. They're absolutely brilliant and I've shared your channel to my colleagues. If you're in Shanghai, allow me to buy you a meal haha!
I'm curious of your research process. During the preparation of your next video, I think it would be neat if you document the timeline of your research/learning, and share it with us in a separate video!
@umarjamilai 5 หลายเดือนก่อน ⁺¹
Hi Andrew! Connect with me on LinkedIn and we can share our WeChat. Have a nice day!
@Patrick-wn6uj 3 หลายเดือนก่อน
Glad to see fellow shanghai people here hhhhhhh
@amitshukla1495 6 หลายเดือนก่อน ⁺¹
wohooo ❤
@user-kg9zs1xh3u 6 หลายเดือนก่อน ⁺¹
vary good
@ziyadmuhammad3734 29 วันที่ผ่านมา
Thanks!
@tetnojj2483 5 หลายเดือนก่อน
Nice video :) A video on the .gguf file format for models would be very interesting :)
@asra1kumar 3 หลายเดือนก่อน
Thanks
@lukeskywalker7029 3 หลายเดือนก่อน
@Umar Jamil you said most embedded devices dont support floating point operatins at all? Is that right? What would be an example and how is that chip architecture called? Does an RaspberryPi or an Arduino operate on only integer operations internally?
@tubercn 6 หลายเดือนก่อน
Thanks, Great video🐱‍🏍🐱‍🏍
But I have a question, because we'll dequantize the output of the last layer by calibration, why we need another "torch.quantization.DeQuantStub()" layer in the model to dequantize the output, it seems we have two dequantizes consequently
@user-hd7xp1qg3j 6 หลายเดือนก่อน ⁺¹
One request could you explain mixture of experts I bet you can breakdown the explanation good
@pravingaikwad1337 2 หลายเดือนก่อน
For one layer Y = XW + b, if X, W and b are quantized so we get Y in the quantized form, then what is the need of dequantizing this Y to feed it to the next layer?
@AleksandarCvetkovic-db7lm 2 หลายเดือนก่อน
Could the difference in accuracy between Static/Dynamic quantization and Quantization Aware Training be because the model was trained for 5 epochs for Static/Dynamic Quant and only one epoch for Quant Aware training? I tend to think that 4 more epochs make more difference than Quantization method
@swiftmindai 6 หลายเดือนก่อน
I noticed a small correction needs to done at timestamp @28:53 [slide: Low precision matrix multiplication]. In the first line, the dot products between each row of X with each column of Y [Instead of Y, it should be W - the weight matrix]
@umarjamilai 6 หลายเดือนก่อน ⁺¹
You're right, thanks! Thankfully the diagram of the multiply block is correct. I'll fix the slides
@DiegoSilva-dv9uf 6 หลายเดือนก่อน
Valeu!
@Erosis 6 หลายเดือนก่อน ⁺¹
You're making all of my lecture materials pointless! (But keep up the great work!)
@venkateshr6127 6 หลายเดือนก่อน
Could you please make a video on how to make tokenizers for other languages than English please.
@bamless95 4 หลายเดือนก่อน
Be careful, cpython does not do JIT compilation, it is a pretty stragithforward stack-based bytecode interpreter
@umarjamilai 4 หลายเดือนก่อน
Bytecode has to be converted into machine code somehow. That's also how .NET works: first C# gets compiled into MSIL (an intermediate representation), and then it just-in-time compiles the MSIL into the machine code for the underlying architecture.
@bamless95 3 หลายเดือนก่อน
Not necessarily, bytecode can just be interpreted in place. In a loose sense it is being "converted" to machine code, meaning that we are executing different snippets of machine code through branching, but JIT compilation has a very different meaning in the compiler and interpreter field. What python is really doing is executing a loop and a switch branching on every possible opcode. By looking at the interpreter implementation on the cpython github repo in `Python/ceval.c` and `Python/generated_cases.c.h` (alas youtube is not letting me post links) you can clearly see there is no JIT compilation involved.
@bamless95 3 หลายเดือนก่อน
What you are saying about C# (and for that matter java and some other languages like luaJIT or v8 javascript) is indeed true, they typically JIT the code either before or during interpretation. But cpython is a much simpler (and thus slower) implementation of a bytecode interprer, that does not implement neither JIT compilation nor any form of serious code optimization (aside from a fairly rudimentary peephole optimization step)
@bamless95 3 หลายเดือนก่อน
Don't get me wrong, I think the video is phenomenal. Just wanted to correct a little imperfection that, as a programming language nerd, I feel it is important to get right. Also, greetings from italy! It is good for once to see a fellow Italian doing content that is worth watching on YT 😄
@dzvsow2643 6 หลายเดือนก่อน
Aslamu aleykum Brother.
Thanks for your videos!
I have been working on game development using pygame for a while and I just want to start deep learning in python so could you make a road map video?! Thank you again
@umarjamilai 6 หลายเดือนก่อน ⁺¹
Hi! I will do my best! Stay tuned
@theguyinthevideo4183 4 หลายเดือนก่อน
This may be a stupid question, but what's stopping us from just setting the weights and biases to be in integer form? Is it due to the nature of backprop?
@umarjamilai 4 หลายเดือนก่อน ⁺¹
Forcing the weights and biases to be integers means adding more constraints to the gradient descent algorithm, which is not easy and computationally expensive. It's like I ask you to solve the equation x^2 - 5x + 4 = 0 but only for integer X. This means you can't just use the formula you learnt in high school for quadratic equations, because that returns real numbers.
Hope it helps
@elieelezra2734 6 หลายเดือนก่อน ⁺¹
Umar, thanks for all your content. I step up a lot thanks to your work! But there is something I don't get about quantization. Let's say you quantize all the weights of your large model. The prediction is not the same anymore! Does it mean you need to dequantize the prediction? If yes, you do not talk about it right? Can I have your email to get more details please?
@umarjamilai 6 หลายเดือนก่อน ⁺¹
Hi! Since the output of the last layer (the matrix Y) will be dequantized, the prediction of the output will be "the same" (very similar) as the dequantized model. The Y matrix of each layer is always dequantized, so that the output of each layer is more or less equal to the dequantized model
@alainrieger6905 6 หลายเดือนก่อน
Hi thanks for your answer@@umarjamilai
Does it mean, for the post training quantization, that the more the layers in a model, the greater is the difference between the quantized and dequantized model since the error accumulates at each New layer? Thanks in advance
@umarjamilai 6 หลายเดือนก่อน
@@alainrieger6905 That's not necessarily true, because the error in one layer may be "positive", and in another "negative", and they may compensate for each other. For sure the number of bits used for quantization is a good metric on the quality of quantization: if you use less bits, you will have more error. It's like you have an image that is originally 10 MB, and you try to compress it to 1 MB or 1 KB. Of course in the latter case you'd lose much more quality than the first one.
@alainrieger6905 6 หลายเดือนก่อน
@@umarjamilaithanks you Sir! Last question : when you talk about dequantizing layer's activations, does it mean that the values go back to 32 bits format ?
@umarjamilai 6 หลายเดือนก่อน ⁺¹
@@alainrieger6905 yes, it means going back to floating-point format
@sabainaharoon7050 4 หลายเดือนก่อน
Thanks!
@umarjamilai 4 หลายเดือนก่อน
Thanks for your support!
@007Paulius 6 หลายเดือนก่อน
Thanks

ต่อไป

เล่นอัตโนมัติ

Mamba and S4 Explained: Architecture, Parallel Scan, Kernel Fusion, Recurrent, Convolution, Math