Vision Transformer for Image Classification

ATTENTION | An Image is Worth 16x16 Words | Vision Transformers (ViT) Explanation and Implementation

Vision Transformers (ViT) Explained + Fine-tuning in Python

หนูกับเต้ รัก ”พี่อู๋จูน“ นะ

Bloxfruits player after Dragon update🐲| Doge Gaming

🔴LIVE สด! PGC 2024 ศึกชิงแชมป์โลกพับจี Circuit 3 วันที่ 2

Image Classification Using Vision Transformer | An Image is Worth 16x16 Words

ExplainingAI

มุมมอง 1 700

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 27 ม.ค. 2025

ความคิดเห็น • 13

@lucaherrmann77 ปีที่แล้ว
this was the best video on ViTs I found - and I looked at way too many. Seeing the clearly written code with comments makes everything so much clearer.
Thank you so much for sharing!
@Explaining-AI ปีที่แล้ว ⁺¹
Really happy that the video was of some help to you!
@lucaherrmann77 ปีที่แล้ว
@@Explaining-AI All three ViT-videos were fantastic :D
@Explaining-AI ปีที่แล้ว
@@lucaherrmann77 Thank you :)
@BhrantoPathik 5 หลายเดือนก่อน
Cls token should be *out[0, :]*, correct? You have selected the row instead.
Also, the part where you explained how the attention layers are learning patterns doesn't seem clear to me. Would you mind clarifying it? Does this visualization imply that different attention heads learn different components from the image?
@Explaining-AI 5 หลายเดือนก่อน
In the implementation, the first index is the batch index. The second index is the sequence of tokens, which is why selecting 0th token will give us the CLS tokens representation.
Regarding the attention layers, could you tell the specific visualization/timestamp you are referring to. Is it the images @6:15 ?
Different attention heads do learn to capture different notions of similarity which are then combined to give the contextual representation for each token. However, in this video I did not get into analyzing each head separately, rather the goal was to use rollout(arxiv.org/pdf/2005.00928) to visualize which spatial tokens, the CLS token was attending to. And like the paper of rollout, we averaged attention weights over all heads for a layer.
@BhrantoPathik 5 หลายเดือนก่อน
@@Explaining-AI Correct, I missed the batch part.
Yeah, I was talking about the attention head itself. I will go through the paper once. Thank you.
Will you please cover the paper RT-DETR?
Currently I am working on a project myself,where I am trying to extract text from images. The tesseract ocr wasn't helpful. So I tried to use object detection models(yolo, RT-DETR) and passed the bboxes through the tesseract engine for text extraction. Although it was a somewhat successful experience,although it's not much accurate. Any suggestion for this?
@AshishKumar-ye7dw 9 หลายเดือนก่อน
Very nicely explained, Kudos
@DrAIScience 8 หลายเดือนก่อน
Amazing. I only did not understand the classification part. Does this zero shot learning achieves that, we need to fine tune the pretrained model with hard labels to make it a classifier? Thanks.. amazing transformers series.. best best best!!!!
@Explaining-AI 8 หลายเดือนก่อน
Yes you would need to fine tune/train it on your dataset. Typically you would have a fc layer on top of the CLS token representation and through training the model(say on mnist), it will learn to attend on the right patches and build a CLS representation that allows it to correctly classify which digit it is.
@Explaining-AI ปีที่แล้ว
*Github Code* - github.com/explainingai-code/VIT-Pytorch
*Patch Embedding* - Vision Transformer (Part One) - th-cam.com/video/lBicvB4iyYU/w-d-xo.html
*Attention* in Vision Transformer (Part Two) - th-cam.com/video/zT_el_cjiJw/w-d-xo.html
*Implementing Vision Transformer* (Part Three) - th-cam.com/video/G6_IA5vKXRI/w-d-xo.html
@signitureDGK 11 หลายเดือนก่อน
Hey really great video. Could you make a video explaining latent diffusion models (DDIM samplers) and how inpainting works in latent space etc. Also with OpenAI Sora released i think Diffusion models will be even more popular and I saw Sora works on a sort of ViT architecture. Thanks!
@Explaining-AI 11 หลายเดือนก่อน
Thank you @signitureDGK . Yes, I have a playlist th-cam.com/play/PL8VDJoEXIjpo2S7X-1YKZnbHyLGyESDCe.html covering diffusion models and in my next video in that series(Stable Diffusion Part II), I will cover conditioning in LDM's in which I will make sure to also go over inpainting. Will soon follow that up with a video on different sampling techniques.

ต่อไป

เล่นอัตโนมัติ

Vision Transformer for Image Classification

Vision Transformer for Image Classification

ATTENTION | An Image is Worth 16x16 Words | Vision Transformers (ViT) Explanation and Implementation

ATTENTION | An Image is Worth 16x16 Words | Vision Transformers (ViT) Explanation and Implementation

Vision Transformers (ViT) Explained + Fine-tuning in Python

Vision Transformers (ViT) Explained + Fine-tuning in Python

หนูกับเต้ รัก ”พี่อู๋จูน“ นะ

หนูกับเต้ รัก ”พี่อู๋จูน“ นะ

Bloxfruits player after Dragon update🐲| Doge Gaming

Bloxfruits player after Dragon update🐲| Doge Gaming

🔴LIVE สด! PGC 2024 ศึกชิงแชมป์โลกพับจี Circuit 3 วันที่ 2

🔴LIVE สด! PGC 2024 ศึกชิงแชมป์โลกพับจี Circuit 3 วันที่ 2

รวม10 เจ้าพ่อบ้านใหญ่! ลุ้น "โกทร" เกมหรือรอด? : 14-12-67 | iNN Top Story

รวม10 เจ้าพ่อบ้านใหญ่! ลุ้น "โกทร" เกมหรือรอด? : 14-12-67 | iNN Top Story

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)

Vision Transformer Quick Guide - Theory and Code in (almost) 15 min

Vision Transformer Quick Guide - Theory and Code in (almost) 15 min

Denoising Diffusion Probabilistic Models | DDPM Explained

Denoising Diffusion Probabilistic Models | DDPM Explained

Attention in transformers, step-by-step | DL6

Attention in transformers, step-by-step | DL6

R-CNN Explained

R-CNN Explained

Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained

Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)

Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)

Vision Transformer

Vision Transformer

【หนังพากย์ไทย】ยอดฝีมือสังหารนักโทษ แต่นักโทษเป็นปรมาจารย์กังฟูที่ซ่อนอยู่ เขาจัดการทั้งหมดในทันที

【หนังพากย์ไทย】ยอดฝีมือสังหารนักโทษ แต่นักโทษเป็นปรมาจารย์กังฟูที่ซ่อนอยู่ เขาจัดการทั้งหมดในทันที

How Strong Is Tape?

How Strong Is Tape?

แหกหน้าพ่อค้าจีน 2 #hagatestudio #fun #funny #พากย์นรก

แหกหน้าพ่อค้าจีน 2 #hagatestudio #fun #funny #พากย์นรก

ไม่มีใครรักหนูเลย #shorts #แม่สุน้องซูกัส

ไม่มีใครรักหนูเลย #shorts #แม่สุน้องซูกัส

guncharlie - จากกันโดยสมบูรณ์ | OFFICIAL MV

guncharlie - จากกันโดยสมบูรณ์ | OFFICIAL MV

หนีบ้านมากาดงัว

หนีบ้านมากาดงัว

Oren helps Durple escape Pinki in a way you wouldn't expect

Oren helps Durple escape Pinki in a way you wouldn't expect

ふわふわシフォン大作戦🩷スイーツ戦隊のキラキラミッション✨【銀座コージーコーナー】 #shorts #シフォンケーキ #クリスマスケーキ #クリスマス #ケーキ #チョコケーキ #christmas

ふわふわシフォン大作戦🩷スイーツ戦隊のキラキラミッション✨【銀座コージーコーナー】 #shorts #シフォンケーキ #クリスマスケーキ #クリスマス #ケーキ #チョコケーキ #christmas