- 26
- 220 519
ExplainingAI
India
เข้าร่วมเมื่อ 8 ก.ย. 2023
Hello, I am Tushar and this channel is a product of two things that I am very passionate about.
Learning something new everyday specifically related to my field which is ML, Deep Learning & Computer Vision(hoping to expand this list with this channel :) )
Teaching and explaining things in the most simple manner that I can, to people who are interested in knowing what I already know. Making their learning process little bit easier and lot more fun.
If you are interested in continuously learning and improving and have any intersection with my interests, then do subscribe (obviously only if you like the content otherwise just ignore me for now and comeback when my content has become worthy of your subscription, which it will someday :) )
Thank you so much for visiting my channel
Tushar,
x-Amazon | Last seen building AI for cooking robot @ Nymble (www.eatwithnymble.com/)
www.linkedin.com/in/tushar-kumar-40299b19/
Learning something new everyday specifically related to my field which is ML, Deep Learning & Computer Vision(hoping to expand this list with this channel :) )
Teaching and explaining things in the most simple manner that I can, to people who are interested in knowing what I already know. Making their learning process little bit easier and lot more fun.
If you are interested in continuously learning and improving and have any intersection with my interests, then do subscribe (obviously only if you like the content otherwise just ignore me for now and comeback when my content has become worthy of your subscription, which it will someday :) )
Thank you so much for visiting my channel
Tushar,
x-Amazon | Last seen building AI for cooking robot @ Nymble (www.eatwithnymble.com/)
www.linkedin.com/in/tushar-kumar-40299b19/
YOLOv4 Explained | CIOU Loss, CSPDarknet53, SPP, PANet | Everything about it
This video aims to explain YOLOv4, real-time object detection model including all features and techniques used in it. In this video, we thoroughly get into YOLOv4 architecture, its unique features such as the Dropblock, cross mini bn, SPP (Spatial Pyramid Pooling) module, CSP(cross stage partial connections) and how they all improves object detection performance. We start the video covering all features that improve backbone performance like cutmix, mosaic, label smoothing and cross stage partial connections. Each of these features are covered in great detail to give you an idea of how yolov4 works.
Then dive deep into dropblock, ciou loss(complete iou loss), self adversarial training, grid sensitivity, diou nms and so on.
We then end with a complete review of yolov4 architecture and performance of yolov4 to understand how it fares as a real time object detector specifically and also compare it to yolov3
⏱️ Timestamps:
00:00 Intro
01:23 Typical Object Detection Model Architecture
03:03 YOLOv4 - Bag of freebies and Bag of specials
05:15 Cutmix Data Augmentation
07:10 Mosaic Data Augmentation
09:32 DropBlock Regularization in YOLOv4
20:19 Class Label Smoothing in YOLO-v4
23:40 Mish in Backbone
24:53 Cross Stage Partial Connections
29:26 MiWRC
31:27 Cross Mini Batch Normalization in YOLOv4
39:33 CIOU Loss (Complete IOU Loss)
47:47 Self Adversarial Training
49:11 Eliminating Grid Sensitivity in YOLO-v4
53:33 Genetic Algorithm
56:26 Spatial Pyramid Pooling
57:36 Spatial Attention Module for YOLOv4
59:50 Path Aggregation Network in YOLOv4
01:02:33 DIOU NMS
01:04:52 Performance of YOLOv4
01:05:43 YOLOv4 Architecture Explained
📖 Resources:
YOLOv4 Paper - arxiv.org/pdf/2004.10934
YOLOv4 Repo - github.com/AlexeyAB/darknet
Cutmix Paper - arxiv.org/pdf/1905.04899
Spatial Dropout Paper - arxiv.org/pdf/1411.4280
DropBlock Paper - arxiv.org/pdf/1810.12890
Mish Paper - arxiv.org/pdf/1908.08681
Cross stage Partial Connections Paper - arxiv.org/pdf/1911.11929
Efficient Det Paper - arxiv.org/pdf/1911.09070
Cross Iteration Batch Normalization Paper - arxiv.org/pdf/2002.05712
Generalized IOU Loss Paper - arxiv.org/pdf/1902.09630
DIOU and Complete IOU Loss Paper - arxiv.org/pdf/1911.08287
Grid Sensitivity Issue Link - github.com/AlexeyAB/darknet/issues/3293
Path Aggregation Paper - arxiv.org/pdf/1803.01534
🔔 Subscribe:
tinyurl.com/exai-channel-link
Email - explainingai.official@gmail.com
Then dive deep into dropblock, ciou loss(complete iou loss), self adversarial training, grid sensitivity, diou nms and so on.
We then end with a complete review of yolov4 architecture and performance of yolov4 to understand how it fares as a real time object detector specifically and also compare it to yolov3
⏱️ Timestamps:
00:00 Intro
01:23 Typical Object Detection Model Architecture
03:03 YOLOv4 - Bag of freebies and Bag of specials
05:15 Cutmix Data Augmentation
07:10 Mosaic Data Augmentation
09:32 DropBlock Regularization in YOLOv4
20:19 Class Label Smoothing in YOLO-v4
23:40 Mish in Backbone
24:53 Cross Stage Partial Connections
29:26 MiWRC
31:27 Cross Mini Batch Normalization in YOLOv4
39:33 CIOU Loss (Complete IOU Loss)
47:47 Self Adversarial Training
49:11 Eliminating Grid Sensitivity in YOLO-v4
53:33 Genetic Algorithm
56:26 Spatial Pyramid Pooling
57:36 Spatial Attention Module for YOLOv4
59:50 Path Aggregation Network in YOLOv4
01:02:33 DIOU NMS
01:04:52 Performance of YOLOv4
01:05:43 YOLOv4 Architecture Explained
📖 Resources:
YOLOv4 Paper - arxiv.org/pdf/2004.10934
YOLOv4 Repo - github.com/AlexeyAB/darknet
Cutmix Paper - arxiv.org/pdf/1905.04899
Spatial Dropout Paper - arxiv.org/pdf/1411.4280
DropBlock Paper - arxiv.org/pdf/1810.12890
Mish Paper - arxiv.org/pdf/1908.08681
Cross stage Partial Connections Paper - arxiv.org/pdf/1911.11929
Efficient Det Paper - arxiv.org/pdf/1911.09070
Cross Iteration Batch Normalization Paper - arxiv.org/pdf/2002.05712
Generalized IOU Loss Paper - arxiv.org/pdf/1902.09630
DIOU and Complete IOU Loss Paper - arxiv.org/pdf/1911.08287
Grid Sensitivity Issue Link - github.com/AlexeyAB/darknet/issues/3293
Path Aggregation Paper - arxiv.org/pdf/1803.01534
🔔 Subscribe:
tinyurl.com/exai-channel-link
Email - explainingai.official@gmail.com
มุมมอง: 250
วีดีโอ
YOLOv2 (YOLO9000) and YOLOv3 Explained
มุมมอง 529หลายเดือนก่อน
In this yolo object detection series tutorial, we dive into the details of YOLOv2 (YOLO9000) and YOLOv3 model for object detection . The video explores how yolov2 and yolov3 models work, their architectures, losses for training them, and their advancements over earlier versions like YOLOv1. We will get into features that make YOLOv2 better, faster, and stronger, as described in the YOLO9000 pap...
Building a Video Generation Model with Diffusion Transformers | Explanation and Implementation
มุมมอง 2.1K2 หลายเดือนก่อน
In this video, we dive deep into Latte, a latent diffusion transformer for video generation. This generative video diffusion model combines diffusion techniques with transformer architecture and is trained on latent frames of videos. We start with a quick recap of diffusion transformers, as the core building block of this latent transformer for video generation is similar to the adaptive layer ...
Single Shot Multibox Detector | SSD Object Detection Explained and Implemented
มุมมอง 3.1K3 หลายเดือนก่อน
In this video, I get into Single Shot Multibox Detector or SSD, a popular real-time object detection model. We will understand how Single Shot Multibox Detector algorithm works, and also do step by step walkthrough of implementation of SSD in PyTorch. This video is part of my object detection series, where I’ve previously covered YOLO, and now we’re exploring SSD object detection to get an unde...
Scalable Diffusion Models with Transformers | DiT Explanation and Implementation
มุมมอง 6K3 หลายเดือนก่อน
In this video, we’ll dive deep into Diffusion with Transformers (DiT), a scalable approach to diffusion models that leverages the transformer architecture. We will first get an overview of vision transformer, then see the changes the author make to get to DiT. We will look in detail the different block designs that the DiT authors explore for Diffusion Transformers and also see the results of e...
YOLO Object Detection | YoloV1 Explanation and Implementation Tutorial
มุมมอง 4.4K4 หลายเดือนก่อน
This video is on YOLO object detection, specifically yolov1 object detection algorithm. In this tutorial we try to understand how the YOLO algorithm works, from its real-time object detection capabilities to its approach of bounding box predictions. We will also go through YOLOv1 implementation from scratch in PyTorch. By the end of this video you would be able to get a complete explanation of ...
ControlNet with Diffusion Models | Explanation and PyTorch Implementation
มุมมอง 3.5K5 หลายเดือนก่อน
In this tutorial we get into ControlNet for diffusion models. We delve into the architecture of ControlNet for Stable Diffusion, explaining how it enhances final model performance on conditional dataset. We cover need for controlnet and goal it tries to achieve, architecture overview of controlnet for a simple block. Then we get into how to use controlnet for controlling generation output of di...
Faster RCNN PyTorch Code Walkthrough | Fine-Tuning and Custom Dataset Training
มุมมอง 3.8K6 หลายเดือนก่อน
This tutorial covers all the details of Faster R-CNN with an in-depth PyTorch code walkthrough! This will guide you through the implementation of Faster R-CNN in PyTorch, including training on custom datasets and fine-tuning faster r cnn techniques. We first do a walkthrough of Faster RCNN with resnet50 FPN backbone wherein we cover the backbone initialization part, RPN, ROI head and also dive ...
Faster R-CNN PyTorch Implementation
มุมมอง 6K7 หลายเดือนก่อน
In this tutorial, I go step-by-step into how to implement Faster R-CNN for object detection using PyTorch . I cover everything from building Faster R-CNN from scratch to training the model and running object detection. This video builds the code for Faster R-CNN in Python and provides detailed explanations of different components involved in implementing Faster R-CNN. We start with building RPN...
Faster R-CNN Explanation | Region Proposal Network
มุมมอง 7K8 หลายเดือนก่อน
In this tutorial we cover Faster R-CNN for object detection. Its an attempt to provide in-depth faster rcnn explanation. The video covers what is faster rcnn, how faster rcnn training works and we also dive deep into its architecture. We start with difference between fast rcnn and faster rcnn , understand anchor boxes and region proposal networks (RPNs) step by step, the two main components of ...
Fast R-CNN Explained | ROI Pooling
มุมมอง 4.7K9 หลายเดือนก่อน
In this tutorial, I dive deep into Fast R-CNN , explaining its architecture, the role of ROI pooling and how it differs from R-CNN. Through this video you will learn how Fast R-CNN works, understand Region Of Interest (ROI) pooling, and discover the advantages it brings to object detection tasks over previous approaches. I specifically go through how Fast R-CNN compares over R-CNN in terms of p...
Mean Average Precision (mAP) | Explanation and Implementation for Object Detection
มุมมอง 4.2K9 หลายเดือนก่อน
In this video we go over Mean Average Precision (mAP) , Non-Maximum Suppression (NMS), anIn this video we go over Mean Average Precision (mAP) , Non-Maximum Suppression (NMS), and Intersection over Union (IOU) in object detection. We dive deep into understanding these crucial concepts for improving the accuracy of object detection algorithms. We first discuss Intersection over Union (IOU) as a ...
R-CNN Explained
มุมมอง 8K10 หลายเดือนก่อน
This is a R CNN tutorial video in which I dive deep into what is R CNN and r cnn basics. This video is a part of object detection series and the first one in that is RCNN for object detection. By the end of this video you would be able to understand the R CNN algorithm in detail to understand clearly as to how rcnn works . We start with what selective search is and how rcnn uses selective searc...
Stable Diffusion from Scratch in PyTorch | Conditional Latent Diffusion Models
มุมมอง 12K10 หลายเดือนก่อน
In this video, we'll cover all the different types of conditioning in latent diffusion and finish stable diffusion implementation in PyTorch and after this you would be able to build and train Stable Diffusion from scratch. This is Part II of the tutorial where I get into conditioning in latent diffusion models. We dive deep into class conditioning in latent diffusion models, implementing class...
Stable Diffusion from Scratch in PyTorch | Unconditional Latent Diffusion Models
มุมมอง 21K11 หลายเดือนก่อน
In this video, we'll cover everything from the building blocks of stable diffusion to its implementation in PyTorch and see how to build and train Stable Diffusion from scratch. This is Part I of the tutorial where I explain latent diffusion models specifically unconditional latent diffusion models. We dive deep into what is latent diffusion , how latent diffusion works , what are the component...
DCGAN Tutorial with PyTorch Implementation
มุมมอง 1.8Kปีที่แล้ว
DCGAN Tutorial with PyTorch Implementation
Generative Adversarial Networks | Tutorial with Math Explanation and PyTorch Implementation
มุมมอง 3.5Kปีที่แล้ว
Generative Adversarial Networks | Tutorial with Math Explanation and PyTorch Implementation
Denoising Diffusion Probabilistic Models Code | DDPM Pytorch Implementation
มุมมอง 25Kปีที่แล้ว
Denoising Diffusion Probabilistic Models Code | DDPM Pytorch Implementation
Denoising Diffusion Probabilistic Models | DDPM Explained
มุมมอง 53Kปีที่แล้ว
Denoising Diffusion Probabilistic Models | DDPM Explained
Image Classification Using Vision Transformer | An Image is Worth 16x16 Words
มุมมอง 1.7Kปีที่แล้ว
Image Classification Using Vision Transformer | An Image is Worth 16x16 Words
ATTENTION | An Image is Worth 16x16 Words | Vision Transformers (ViT) Explanation and Implementation
มุมมอง 3.5Kปีที่แล้ว
ATTENTION | An Image is Worth 16x16 Words | Vision Transformers (ViT) Explanation and Implementation
PATCH EMBEDDING | Vision Transformers explained
มุมมอง 7Kปีที่แล้ว
PATCH EMBEDDING | Vision Transformers explained
I implement DALLE 1 from SCRATCH on MNIST
มุมมอง 2.5Kปีที่แล้ว
I implement DALLE 1 from SCRATCH on MNIST
VQ-VAE | Everything you need to know about it | Explanation and Implementation
มุมมอง 19Kปีที่แล้ว
VQ-VAE | Everything you need to know about it | Explanation and Implementation
Implementing Variational Auto Encoder from Scratch in Pytorch
มุมมอง 6Kปีที่แล้ว
Implementing Variational Auto Encoder from Scratch in Pytorch
Understanding Variational Autoencoder | VAE Explained
มุมมอง 10Kปีที่แล้ว
Understanding Variational Autoencoder | VAE Explained
Could you please explain why at 7:48 that last term is a constant? Thank you!
Very informative! Step by step approach for OD in CV. Best video so far 👍❤
Thank You :)
This explanation is absolutely fantastic.
Amazing video. Probably the best explanation I've seen on the internet. However, I'm still struggling to understand the encoding and statistical explanation of the model. You say that we need to compute p(z|x), but we can't because its computationally intractable, so instead we estimate it using q(z|x). However, my question is firstly: how do we calculate the KL-divergence between q(z|x) and p(z|x) if we don't actually know p(z|x) (at 7:34)? If we did, why couldn't we just use that instead? Next, you say that we sample from the distribution p(z) to generate new pieces of data. This does not make sense to me. Isn't p(z) a standard gaussian? If we sampled from p(z) wouldn't we just get non-sensical, random results? Why don't we sample from the learned distribution of q(z|x) instead? Here's my thought process, please correct me where my understanding is wrong: 1) Imagine we have images of circles that a VAE must reconstruct. 2) We encode it into a 2-dimensional latent space. 3) The decoder decodes the a point sampled from the latent space and generates a new image of a circle. In step 2 we encode images by estimating p(z|x) through q(z|x). Let's say the encoder learns that the two dimensions of the latent space are radius and position. Then, for every single image the encoder finds the latent variables z and turns it into a distribution of radius-position combinations which approximate a standard gaussian (p(z)) (I imagine this looks something like a joint distribution between two normally-spread random variables). We do this enough times and eventually we have a latent space represented by the probability density function q(z|x) which organizes the space of all seen circles into varying spaces within q(z|x) (I imagine this looks like approximately standard gaussian clusters spread out throughout a 2D latent space with axis radius and position where each cluster represents a certain type of circle x). In step 3 we decode these images by sampling from q(z|x) (which is differentiable through the reparameterization trick). Then, we can conduct the reconstruction loss between the generated output x' (or f(x)) and original input x (this makes sense to me) and calculate the KL-divergence between our estimation of the latent space q(z|x) and a standard gaussian p(z) (this doesn't make sense to me). - In this part, why do we take the KL-divergence between these two terms? To my understanding q(z|x) is our prediction of the true latent space p(z|x). If we tried to make q(z|x) as similar to p(z) as possible, would we not just see q(z|x) turn into a giant standard gaussian? Why would we want that? Am I understanding things correctly? (probably not) But more importantly, could you please correct me on what I'm misunderstanding and answer my questions? Again, excellent video. It just seems like there are some kinks which I have yet to work out because of my inexperience.
Nevermind, I figured most of it out.
Man, I love you. You're the person I was looking for. Thanks for great explanation without omitting details and things that can be unclear.
Thank you for this comment :) Really happy that my videos are of help to you.
Amazing !!! v8 is coming...
Soon :)
what a clean explanation!!!!!!!!!!
Awesome explanation, Thanks!
please make video Mask R-CNN sir 🙏🙏
can i get the parameter file of this DIT model (trained on mnist) directly?
you have made my literature survey 10 times easier. May I reccomend you to look into transformer and attention based object detection model, starting with DETR. Love your content <3
Happy that my content was helpful to you :) The next one for the detection series is DETR only.
God thanks
:) thanks!!!
Does YOLOv7 have a similar architecture?
Hello, there are some similarities in terms of presence of pyramid pooling and top down bottom up pathways, but the design of those blocks are quite different. Also yolov7 uses E-ELAN rather than CSP residual blocks. If you are interested then do take a look at this paper - arxiv.org/pdf/2304.00501 . It provides highlights and changes of all different yolo versions. For YOLOv7, refer to Figure 16 (Page 21).
@ ☺️👍
yoyo
Can you please share prediction code
Hello, the repo (github.com/explainingai-code/SSD-PyTorch/blob/main/tools/infer.py) has prediction as well as evaluation code.
Best explnation of Denoising Diffusion Probabilistic Models!
So much great info in 8 minutes. Thank you so much!
Thank you for the appreciation :)
The quality of the writings too poor to see the equations
thank you so much, this video is very helpful to me. you are very generous.
I'm happy that the video ended up being of help to you :)
Hey I am a new subscriber can you explain the implementations of LayoutLMV3 and UDOP and help implementing from scratch
Can you explain to me why there is a break on line 114 in the train_torchvision_frcnn.py. It now looks like because of the break it will only use one batch and than break out of the epoch? I really like your videos thankss
The only explanation is my oversight :D I must have been debugging something before pushing the code, and ended up forgetting to remove the 'break' at the end. My apologies for the confusion, and thank you so much for pointing it out. Have fixed it now in the repo.
I am trying to use this model to train coco, but I am having issues using it, seems the model is veru structured to be trained on PASCAL VOC, any idea how can I adapt it to COCO? great video
Hello, apologies for the late reply. I think the model should work once you set the right number of classes. But you would need changes in the dataset class. If you are still facing problems after making the dataset class changes(or if you need help with that), can you please open an issue on the repo and I can try to help resolve that.
If you look at the latent space images at 37:26 you cannot believe the decoder can re-generate the original image from it. As there is simply a lot of missing information. Any explanation on how it does it? First I thought this is due to original image information leaking through through skip-connections between down and up blocks. But we are not using those in the auto-encoder.
In the sampling algorithm (algo 2), I don't understand why we have to add noise z back in. Can anyone explain this to me?
In the reverse process, at each time step we have a distribution (P(xt-1|xt)), which is a gaussian(N(mu_t-1, sigma)). We use the prediction of noise at each timestep to compute the predicted mean, mu_theta. The adding noise part is actually reparameterization trick to sample from the predicted P(xt-1) distribution. Which is why we sample a random noise z, shift it by the mean of this predicted distribution and then scale it by sigma. Also, if we straightaway use mu_theta(so always return mean instead of reparameterization trick to sample from P(xt-1)), then the entire reverse process would end up being deterministic.
@Explaining-AI that makes sense, thank you very much!
Hi, great explanation of RCNN with very useful insights which often are skipped. I am especially grateful for answering questions like "Why SVM, Why different IOU Thr, etc."
Happy that you found the explanation helpful!
Brother so good keep it up
Can we train our dataset with vq-gan (without transformer) and then use it in train_ddpm_vqve?
Hello, I might be misunderstanding your question so do let me know if thats the case. But for stable diffusion we don't need transformer. In the repo (github.com/explainingai-code/StableDiffusion-PyTorch?tab=readme-ov-file#training), what you are mentioning, is exactly what I implemented. Train VQVAE+PerceptualLoss+Discriminator (same as VQ-GAN without transformer, which is only needed for generating new latent images) on a dataset. Once auto-encoder part of VQGAN is trained, we then save the latent representations for all the training images, using the trained encoder of VQGAN. Finally, use these latent representations, to train LDM. We don't need to train transformer, as thats for generating new latent images , for which we are using DDPM.
@@Explaining-AI Thank you so much for your kind reply. Yeah, you are right.
Can you explain us how Multi-view Diffusion Base Model works please
Hello, have added this to my list. But since I am not familiar with it, as of now, will take me some time to cover this.
@@Explaining-AI thank you, you are amazing
Can you please share the notes of all object detection videos
Just what i was needing. Thanks 🙏🏻
Please answer me, how can they train CLASS SPECIFIC bounding box regressor? So they input class as input in one model and regress the bounding box or they build multiple(if they detect 6 class then we build 6 model) model and each model they train on specific bounding box regressor? Please answer me
Hello, I have tried to explain a bit on this, do let me know if this does not clarify everything for you. This is how the official rcnn repo does it. We create as many box regressor models as there are classes. Then we train each of these regressors separately using proposals assigned to the respective classes. github.com/rbgirshick/rcnn/blob/master/bbox_regression/rcnn_train_bbox_regressor.m#L76 During inference, given the predicted classes for proposals, we use the trained regressor for that class to modify the proposal . github.com/rbgirshick/rcnn/blob/master/bbox_regression/rcnn_test_bbox_regressor.m#L58-L65
Btw you could also do this by one fc layer. Lets say you have 10 classes. Then your bounding box regressor fc layer predicts 10 times 4 , 40 values. These are tx ty tw th for all 10 classes. Then during training, the bounding box regression loss will be computed between the ground truth transformation targets and prediction values at indexes corresponding to ground truth class. At inference, you take the class index with highest predicted probability value. The predicted tx, ty,tw, th are then the 4 values(from 40) corresponding to this highest probable class.
@@Explaining-AIthank you alot!!!! I am fully understand it now. So they do train multiple models and choose the model based on class. Thats crazy though!
Please create yolo panoptic sir, it would be a huge help and it has so many applications
Added this to my list. Will try to get to this as soon as I can.
It was great! Good luck!
why gussian noise only added. Not Rician, Laplacian etc.. there are so many other probability distribution.
Hello, have replied to something similar here(highlighted comment) - th-cam.com/video/H45lF4sUgiE/w-d-xo.html&lc=Ugznn1UksOPa3NfWLXR4AaABAg
Bro no one in this platform explained clearly as much as you Thankyou for providing these lectures for free of cost . I think for paid courses also no one can explain this much thankyou again
Really happy that you found the explanation helpful :)
Subscribed
one of the finest videos on yolo available on internet. contains intuitive as well as detailed explanation (right from research paper). concepts like these are hard to explain in so much detail. thanks a lot for the amazing work, cheers!
Thank you for this comment :)
Fantastic video! You’re undoubtedly on your way to becoming one of the top lecturers in Generative AI. I’m excited to see more of your work in the future!
Thank you so much for your words of encouragement and support :)
Thanj you very much for the video, it is very interesting. Though, I have one question, at 15:40 timestanp. You mention that there may be a situation, where ground truth box doesn't have big IoU with any of anchor boxes. How do we pick these anchor boxes (i just cant get which methodology we have to follow when picking the dimensions for anchor boxes)?
Thank You! For faster rcnn, that is the reason why we add low overlap anchor boxes as well(if they are indeed the best anchor box available). Here the authors did not tune anchor box selection at all for a dataset, they just pick one which captures large enough variation in terms of scale and aspect ratio. In models like yolov2 , they use the anchor box strategy but use k means to pick the the best anchor boxes. So once you use k means on your ground truth box dimensions, you end up with cluster centres that are good representatives of bxo dimensions in your dataset. These cluster centres then become a good choice for your anchor boxes width and height.
How does self attention work in convnets (instead of transformers)? 😊
After a reshape of the input, the self attention works exactly same as transformers. Assuming you have a BxCxHxW feature map at a certain stage of network. Then during self attention you reshape it into Bx(H*W)xC. Now it becomes very similar to how you would have seen it in transformers. H*W is the number of grid cells(tokens) and C is the embedding dimension of each token. We just compute attention between all spatial grid cells.
@ Thank you 😊
Great video, but why do you add 1E-6 when calculating your IOU?
Thank You! That is just for ensuring the iou method never ends up doing a division by 0, like say in some degenerate case where bounding box area is zero(of both gt and prediction). That just makes the iou computation numerically stable no matter what the predicted and ground truth box is.
what do you mean by topk proposals 2000, is this from the single image we take 2000 proposals?
Hello @raihanpahlevi6870, Yes thats correct. 2000 proposals are taken from a single image.
Thanks for this wonderfully intuitive video! It provided a fantastic breakdown of the fundamentals of diffusion models. Let me try to answer your question about why the reverse process in diffusion models is also a (reverse) diffusion with Gaussian transitions. Why Reverse Diffusion Uses Gaussian Transitions 1. Forward Diffusion Introduces Noise Gradually Remember the β term? In the forward process, β is chosen to be very small (close to 0). This ensures that Gaussian noise is added gradually to the data over many steps. Each step introduces only a tiny amount of noise, meaning the transition from the original image to pure noise happens slowly and smoothly. This gradual noise addition is crucial because it preserves the structure of the data for longer, making it easier for the reverse process to reconstruct high-quality images. If we added large amounts of noise in one go, like in VAEs, the original structure would be harder to recover, leading to blurrier reconstructions. 2. Reverse Diffusion Needs "Gaussian-Like" Inputs The forward process only involves adding isotropic Gaussian noise at each step. This means the model learns to work with samples that are progressively noised in a Gaussian way. However, in the reverse process, when the model predicts the noise at each step, the resulting sample isn't guaranteed to remain Gaussian-like. To fix this, after subtracting the model's predicted noise, we add a small Gaussian noise with a carefully chosen variance. This step helps "Gaussianize" the sample, ensuring it aligns with what the model expects at the next time step. This small added noise smoothens any irregularities and makes the reverse process more stable, resulting in higher-quality outputs. Step-by-Step Noise Removal The reverse process works by removing noise step-by-step, moving from pure noise back to a clean image (closer to x0 ). This gradual approach is crucial because predicting small changes (i.e., removing a little noise at a time) is much easier for the model than trying to reconstruct the clean image in one big jump. This is why diffusion models produce sharper and more realistic images compared to VAEs, where predictions often result in blurry outputs due to the lack of such gradual refinement.
Excellent. Complete, clear and to the point
I am new to this field can anyone provide me with the prerequisites to understand this video
Hello @GouravJoshi-z7j, I think this list covers the pre-requisites . Gaussian Distribution and its properties .......Mean/variance of adding two independent gaussians Reparameterization trick Maximum Likelihood Estimation Variational Lower Bound Bayes theorem, conditional independence KL Divergence, KL divergence between two gaussians VAE(cause the video incorrectly assumes knowledge about it) I may have missed something so in case there is some aspect of the video that you aren't able to understand even after that please do let me know
thank you very much!!! great explanation ❤
Thank You :)
Thank you! It was amazing. While there are limited content available for diffusion models, you did pretty nice.❤
Thank you for your kind words :)
Excellent Work ! Please Don't Stop :)
Thank you so much for your support
Please don't use music in the background, it's very distracting, thanks.
Thank you for the feedback. Have taken care of this in my recent videos.