Vision Transformer for Image Classification

Shusen Wang

มุมมอง 122 232

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 29 พ.ย. 2024

ความคิดเห็น • 84

@UzzalPodder 3 ปีที่แล้ว ⁺⁴⁰
Great Explanation with detailed notations. Most of the videos found in the TH-cam were some kind of oral explanation. But this kind of symbolic notation is very helpful for garbing the real picture, specially if anyone want to re-implement or add new idea with it. Thank you so much. Please continuing helping us by making these kind of videos for us.
@mmpattnaik97 2 ปีที่แล้ว ⁺²
Can't stress enough on how easy to understand you made it
@ai_lite 8 หลายเดือนก่อน ⁺¹
great expalation! Good for you! Don't stop giving ML guides!
@drakehinst271 2 ปีที่แล้ว ⁺⁶
These are some of the best, hands-on and simple explanations I've seen in a while on a new CS method. Straight to the point with no superfluous details, and at a pace that let me consider and visualize each step in my mind without having to constantly pause or rewind the video. Thanks a lot for your amazing work! :)
@adityapillai3091 9 หลายเดือนก่อน
Clear, concise, and overall easy to understand for a newbie like me. Thanks!
@drelvenkee1885 11 หลายเดือนก่อน
The best video so far. The animation is easy to follow and the explaination is very straight forward.
@thecheekychinaman6713 ปีที่แล้ว
The best ViT explanation available. Also key to understand this for understanding Dino and Dino V2
@valentinfontanger4962 2 ปีที่แล้ว ⁺¹
Amazing, I am in a rush to implement vision transformer as an assignement, and this saved me so much time !
@randomperson5303 2 ปีที่แล้ว
lol , same
@aimeroundiaye1378 3 ปีที่แล้ว ⁺¹⁰
Amazing video. It helped me to really understand the vision transformers. Thanks a lot.
@sheikhshafayat6984 2 ปีที่แล้ว ⁺³
Man, you made my day! These lectures were golden. I hope you continue to make more of these
@thepresistence5935 2 ปีที่แล้ว ⁺²
15 minutes of heaven 🌿. Thanks a lot understood clearly!
@soumyajitdatta9203 ปีที่แล้ว
Thank you. Best ViT video I found.
@vladi21k 2 ปีที่แล้ว
Very good explanation, better that many other videos on TH-cam, thank you!
@wengxiaoxiong666 ปีที่แล้ว
good video ,what a splendid presentation , wang shusen yyds.
@MonaJalal 3 ปีที่แล้ว ⁺²
This was a great video. Thanks for your time producing great content.
@swishgtv7827 3 ปีที่แล้ว
This reminds me of Encarta encyclopedia clips when I was a kid lol! Good job mate!
@arash_mehrabi ปีที่แล้ว
Thank you for your Attention Models playlist. Well explained.
@DerekChiach 3 ปีที่แล้ว
Thank you, your video is way underrated. Keep it up!
@ervinperetz5973 2 ปีที่แล้ว ⁺¹
This is a great explanation video.
One nit : you are misusing the term 'dimension'. If a classification vector is linear with 8 values, that's not '8-dimensional' -- it is a 1-dimensional vector with 8 values.
@Peiying-h4m ปีที่แล้ว
Best ViT explanation ever!!!!!!
@ronalkobi4356 5 หลายเดือนก่อน
Wonderful explanation!👏
@nehalkalita ปีที่แล้ว
Nicely explained. Appreciate your efforts.
@NisseOhlsen 3 ปีที่แล้ว
Very nice job, Shusen, thanks!
@sehaba9531 2 ปีที่แล้ว
Thank you so much for this amazing presentation. You have a very clear explanation, I have learnt so much. I will definitely watch your Attention models playlist.
@mmazher5826 ปีที่แล้ว
Excellent explanation 👌
@xXMaDGaMeR ปีที่แล้ว
amazing precise explanation
@hongkyulee9724 ปีที่แล้ว
Thank you for the clear explanation!!☺
@rajgothi2633 ปีที่แล้ว
You have explained ViT in simple words. Thanks
@aryanmobiny7340 3 ปีที่แล้ว ⁺¹
Amazing video. Please do one for Swin Transformers if possible. Thanks alot
@muhammadfaseeh5810 2 ปีที่แล้ว
Awesome Explanation.
Thank you
@lionhuang9209 2 ปีที่แล้ว
Very clear, thanks for your work.
@deeplearn6584 2 ปีที่แล้ว
Very good explanation
subscribed!
@sevovo ปีที่แล้ว ⁺¹
CNN on images + positional info = Transformers for images
@tallwaters9708 2 ปีที่แล้ว
Brilliant explanation, thank you.
@boemioofworld 3 ปีที่แล้ว
thank you so much for the clear explanation
@parmanandchauhan6182 4 หลายเดือนก่อน
Great Explanation.Thanqu
@nova2577 2 ปีที่แล้ว ⁺²
If we ignore output c1 ... cn, what c1 ... cn represent then?
@ASdASd-kr1ft ปีที่แล้ว
Nice video!!, Just a question what is the argue behind to rid of the vectors c1 to cn, and just remain with c0? Thanks
@chawkinasrallah7269 7 หลายเดือนก่อน
The class token 0 is in the embed dim, does that mean we should add a linear layer from embed to number of classes before the softmax for the classification?
@t.pranav2834 3 ปีที่แล้ว
Awesome explanation man thanks a tonne!!!
@medomed1105 2 ปีที่แล้ว
Great explanation
@mariamwaleed2132 2 ปีที่แล้ว
really great explaination , thankyou
@MenTaLLyMenTaL 2 ปีที่แล้ว ⁺¹
@9:30 Why do we discard c1... cn and use only c0? How is it that all the necessary information from the image gets collected & preserved in c0? Thanks
@abhinavgarg5611 2 ปีที่แล้ว
Hey, did you get answer to your question?
@jidd32 2 ปีที่แล้ว
Brilliant. Thanks a million
@sudhakartummala4701 2 ปีที่แล้ว
Wonderful talk
@DrAIScience 6 หลายเดือนก่อน
How data A is trained? I mean what is the loss function? Is it only using encoder or both e/decoder?
@saeedataei269 2 ปีที่แล้ว
great video. thanks. could u plz explain swin transformer too?
@BeytullahAhmetKINDAN ปีที่แล้ว
that was educational!
@DrAhmedShahin_707 2 ปีที่แล้ว
The simplest and more interesting explanation, Many Thanks. I am asking about object detection models, did you explain it before?
@user-wr4yl7tx3w ปีที่แล้ว
In the job market, do data scientists use transformers?
@ansharora3248 3 ปีที่แล้ว
Great explanation :)
@zeweichu550 2 ปีที่แล้ว
great video!
@ogsconnect1312 2 ปีที่แล้ว
Good job! Thanks
@ME-mp3ne 3 ปีที่แล้ว
Really good, thx.
@fedegonzal 3 ปีที่แล้ว
Super clear explanation! Thanks! I want to understand how attention is applied to the images. I mean, using cnn you can "see" where the neural network is focusing, but with transformers?
@randomperson5303 2 ปีที่แล้ว
Not All Heroes Wear Capes
@bbss8758 3 ปีที่แล้ว
Can you explain yhis paper please “your classifier is secretly an energy based model and you should treat it like one “ i want understand these energy based model
@DungPham-ai 3 ปีที่แล้ว ⁺¹
Amazing video. It helped me to really understand the vision transformers. Thanks a lot. But i have a question why we only use token cls for classifier .
@NeketShark 3 ปีที่แล้ว
Looks like due to attention layers cls token is able to extract all the data it needs for a good classification from other tokens. Using all tokens for classification would just unnecessarily increase computation.
@Darkev77 3 ปีที่แล้ว
@@NeketShark that’s a good answer. At 9:40, any idea how a softmax function was able to increase (or decrease) the dimension of vector “c” into “p”? I thought softmax would only change the entries of a vector, not its dimensions
@NeketShark 3 ปีที่แล้ว ⁺¹
@@Darkev77 I think it first goes through a linear layer which then goes through a softmax, so its the linear layer that changes the dimention. In the video this info were probably ommited for simplification.
@swishgtv7827 3 ปีที่แล้ว ⁺¹
The concept has similarities to TCP protocol in terms of segmentation and positional encoding. 😅😅😅
@seakan6835 2 ปีที่แล้ว
其实我觉得up主说中文更好🥰🤣
@boyang6105 2 ปีที่แล้ว
也有中文版的（ th-cam.com/video/BbzOZ9THriY/w-d-xo.html ），不同的语言有不同的听众
@ThamizhanDaa1 2 ปีที่แล้ว
WHY is the transformer requiring so many images to train?? and why is resnet not becoming better with ore training vs ViT?
@parveenkaur2747 3 ปีที่แล้ว
Very good explanation! Can you please explain how we can fine tune these models to our dataset. Is it possible on our local computer
@ShusenWangEng 3 ปีที่แล้ว ⁺⁴
Unfortunately, no. Google has TPU clusters. The amount of computation is insane.
@parveenkaur2747 3 ปีที่แล้ว
@@ShusenWangEng Actually I have my project proposal due today.. I was proposing this on the dataset of FOOD-101 it has 101000 images
So it can’t be done?
What size dataset can we train on our local PC
@parveenkaur2747 3 ปีที่แล้ว
Can you please reply?
Stuck at the moment..
Thanks
@ShusenWangEng 3 ปีที่แล้ว
@@parveenkaur2747 If your dataset is very different from ImageNet, Google's pretrained model may not transfer well to your problem. The performance can be bad.
@shamsarfeen2729 3 ปีที่แล้ว
If you remove the positional encoding step, the whole thing is almost equivalent to a CNN, right?
I mean those dense layers are just as filters of a CNN.
@palyashuk42 3 ปีที่แล้ว
Why do the authors evaluate and compare their results with the old ResNet architecure? Why not to use EfficientNets for comparison? Looks like not the best result...
@ShusenWangEng 3 ปีที่แล้ว ⁺¹
ResNet is a family of CNNs. Many tricks are applied to make ResNet work better. The reported are indeed the best accuracies that CNNs can achieve.
@mahmoudtarek6859 2 ปีที่แล้ว
great
@st-hs2ve 3 ปีที่แล้ว
Great great great
@顾小杰 ปีที่แล้ว
👏
@yinghaohu8784 3 ปีที่แล้ว
1) you mentioned pretrain model, it uses large scale dataset, and then using a smaller dataset for finetuning. Does it mean, they c0 is almost the same, except the last layer softmax will be adjusted based on the class_num ? and then train on fine-tuning dataset ? Or there're other different settings ? 2)Another doubt for me is, there's completely no mask in ViT, right? since it is from MLM ... um ...
@yuan6950 2 ปีที่แล้ว
这英语也是醉了
@kutilkol 7 หลายเดือนก่อน
this is supposed to be english?
@mahdiyehbasereh ปีที่แล้ว
That was great and helpful 🤌🏻
@tianbaoxie2324 2 ปีที่แล้ว
Very clear, thanks for your work.
@Raulvic 3 ปีที่แล้ว
Thank you for the clear explanation

ต่อไป

เล่นอัตโนมัติ

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)