Great Explanation with detailed notations. Most of the videos found in the TH-cam were some kind of oral explanation. But this kind of symbolic notation is very helpful for garbing the real picture, specially if anyone want to re-implement or add new idea with it. Thank you so much. Please continuing helping us by making these kind of videos for us.
These are some of the best, hands-on and simple explanations I've seen in a while on a new CS method. Straight to the point with no superfluous details, and at a pace that let me consider and visualize each step in my mind without having to constantly pause or rewind the video. Thanks a lot for your amazing work! :)
This is a great explanation video. One nit : you are misusing the term 'dimension'. If a classification vector is linear with 8 values, that's not '8-dimensional' -- it is a 1-dimensional vector with 8 values.
Thank you so much for this amazing presentation. You have a very clear explanation, I have learnt so much. I will definitely watch your Attention models playlist.
The class token 0 is in the embed dim, does that mean we should add a linear layer from embed to number of classes before the softmax for the classification?
Super clear explanation! Thanks! I want to understand how attention is applied to the images. I mean, using cnn you can "see" where the neural network is focusing, but with transformers?
Can you explain yhis paper please “your classifier is secretly an energy based model and you should treat it like one “ i want understand these energy based model
Amazing video. It helped me to really understand the vision transformers. Thanks a lot. But i have a question why we only use token cls for classifier .
Looks like due to attention layers cls token is able to extract all the data it needs for a good classification from other tokens. Using all tokens for classification would just unnecessarily increase computation.
@@NeketShark that’s a good answer. At 9:40, any idea how a softmax function was able to increase (or decrease) the dimension of vector “c” into “p”? I thought softmax would only change the entries of a vector, not its dimensions
@@Darkev77 I think it first goes through a linear layer which then goes through a softmax, so its the linear layer that changes the dimention. In the video this info were probably ommited for simplification.
@@ShusenWangEng Actually I have my project proposal due today.. I was proposing this on the dataset of FOOD-101 it has 101000 images So it can’t be done? What size dataset can we train on our local PC
@@parveenkaur2747 If your dataset is very different from ImageNet, Google's pretrained model may not transfer well to your problem. The performance can be bad.
If you remove the positional encoding step, the whole thing is almost equivalent to a CNN, right? I mean those dense layers are just as filters of a CNN.
Why do the authors evaluate and compare their results with the old ResNet architecure? Why not to use EfficientNets for comparison? Looks like not the best result...
1) you mentioned pretrain model, it uses large scale dataset, and then using a smaller dataset for finetuning. Does it mean, they c0 is almost the same, except the last layer softmax will be adjusted based on the class_num ? and then train on fine-tuning dataset ? Or there're other different settings ? 2)Another doubt for me is, there's completely no mask in ViT, right? since it is from MLM ... um ...
Great Explanation with detailed notations. Most of the videos found in the TH-cam were some kind of oral explanation. But this kind of symbolic notation is very helpful for garbing the real picture, specially if anyone want to re-implement or add new idea with it. Thank you so much. Please continuing helping us by making these kind of videos for us.
Can't stress enough on how easy to understand you made it
great expalation! Good for you! Don't stop giving ML guides!
These are some of the best, hands-on and simple explanations I've seen in a while on a new CS method. Straight to the point with no superfluous details, and at a pace that let me consider and visualize each step in my mind without having to constantly pause or rewind the video. Thanks a lot for your amazing work! :)
Clear, concise, and overall easy to understand for a newbie like me. Thanks!
The best video so far. The animation is easy to follow and the explaination is very straight forward.
The best ViT explanation available. Also key to understand this for understanding Dino and Dino V2
Amazing, I am in a rush to implement vision transformer as an assignement, and this saved me so much time !
lol , same
Amazing video. It helped me to really understand the vision transformers. Thanks a lot.
Man, you made my day! These lectures were golden. I hope you continue to make more of these
15 minutes of heaven 🌿. Thanks a lot understood clearly!
Thank you. Best ViT video I found.
Very good explanation, better that many other videos on TH-cam, thank you!
good video ,what a splendid presentation , wang shusen yyds.
This was a great video. Thanks for your time producing great content.
This reminds me of Encarta encyclopedia clips when I was a kid lol! Good job mate!
Thank you for your Attention Models playlist. Well explained.
Thank you, your video is way underrated. Keep it up!
This is a great explanation video.
One nit : you are misusing the term 'dimension'. If a classification vector is linear with 8 values, that's not '8-dimensional' -- it is a 1-dimensional vector with 8 values.
Best ViT explanation ever!!!!!!
Wonderful explanation!👏
Nicely explained. Appreciate your efforts.
Very nice job, Shusen, thanks!
Thank you so much for this amazing presentation. You have a very clear explanation, I have learnt so much. I will definitely watch your Attention models playlist.
Excellent explanation 👌
amazing precise explanation
Thank you for the clear explanation!!☺
You have explained ViT in simple words. Thanks
Amazing video. Please do one for Swin Transformers if possible. Thanks alot
Awesome Explanation.
Thank you
Very clear, thanks for your work.
Very good explanation
subscribed!
CNN on images + positional info = Transformers for images
Brilliant explanation, thank you.
thank you so much for the clear explanation
Great Explanation.Thanqu
If we ignore output c1 ... cn, what c1 ... cn represent then?
Nice video!!, Just a question what is the argue behind to rid of the vectors c1 to cn, and just remain with c0? Thanks
The class token 0 is in the embed dim, does that mean we should add a linear layer from embed to number of classes before the softmax for the classification?
Awesome explanation man thanks a tonne!!!
Great explanation
really great explaination , thankyou
@9:30 Why do we discard c1... cn and use only c0? How is it that all the necessary information from the image gets collected & preserved in c0? Thanks
Hey, did you get answer to your question?
Brilliant. Thanks a million
Wonderful talk
How data A is trained? I mean what is the loss function? Is it only using encoder or both e/decoder?
great video. thanks. could u plz explain swin transformer too?
that was educational!
The simplest and more interesting explanation, Many Thanks. I am asking about object detection models, did you explain it before?
In the job market, do data scientists use transformers?
Great explanation :)
great video!
Good job! Thanks
Really good, thx.
Super clear explanation! Thanks! I want to understand how attention is applied to the images. I mean, using cnn you can "see" where the neural network is focusing, but with transformers?
Not All Heroes Wear Capes
Can you explain yhis paper please “your classifier is secretly an energy based model and you should treat it like one “ i want understand these energy based model
Amazing video. It helped me to really understand the vision transformers. Thanks a lot. But i have a question why we only use token cls for classifier .
Looks like due to attention layers cls token is able to extract all the data it needs for a good classification from other tokens. Using all tokens for classification would just unnecessarily increase computation.
@@NeketShark that’s a good answer. At 9:40, any idea how a softmax function was able to increase (or decrease) the dimension of vector “c” into “p”? I thought softmax would only change the entries of a vector, not its dimensions
@@Darkev77 I think it first goes through a linear layer which then goes through a softmax, so its the linear layer that changes the dimention. In the video this info were probably ommited for simplification.
The concept has similarities to TCP protocol in terms of segmentation and positional encoding. 😅😅😅
其实我觉得up主说中文更好🥰🤣
也有中文版的( th-cam.com/video/BbzOZ9THriY/w-d-xo.html ),不同的语言有不同的听众
WHY is the transformer requiring so many images to train?? and why is resnet not becoming better with ore training vs ViT?
Very good explanation! Can you please explain how we can fine tune these models to our dataset. Is it possible on our local computer
Unfortunately, no. Google has TPU clusters. The amount of computation is insane.
@@ShusenWangEng Actually I have my project proposal due today.. I was proposing this on the dataset of FOOD-101 it has 101000 images
So it can’t be done?
What size dataset can we train on our local PC
Can you please reply?
Stuck at the moment..
Thanks
@@parveenkaur2747 If your dataset is very different from ImageNet, Google's pretrained model may not transfer well to your problem. The performance can be bad.
If you remove the positional encoding step, the whole thing is almost equivalent to a CNN, right?
I mean those dense layers are just as filters of a CNN.
Why do the authors evaluate and compare their results with the old ResNet architecure? Why not to use EfficientNets for comparison? Looks like not the best result...
ResNet is a family of CNNs. Many tricks are applied to make ResNet work better. The reported are indeed the best accuracies that CNNs can achieve.
great
Great great great
👏
1) you mentioned pretrain model, it uses large scale dataset, and then using a smaller dataset for finetuning. Does it mean, they c0 is almost the same, except the last layer softmax will be adjusted based on the class_num ? and then train on fine-tuning dataset ? Or there're other different settings ? 2)Another doubt for me is, there's completely no mask in ViT, right? since it is from MLM ... um ...
这英语也是醉了
this is supposed to be english?
That was great and helpful 🤌🏻
Very clear, thanks for your work.
Thank you for the clear explanation