R-CNN Explained

ExplainingAI

มุมมอง 9 078

311

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 10 ก.พ. 2025

ความคิดเห็น • 52

@sladewinter 11 หลายเดือนก่อน ⁺⁶
Thank you very much for this series, and the overall amazing content, genuinely appreciated !
@Explaining-AI 11 หลายเดือนก่อน
Thank you so much for this comment :)
@zakirkerimov3421 16 วันที่ผ่านมา
Man, I love you. You're the person I was looking for. Thanks for great explanation without omitting details and things that can be unclear.
@Explaining-AI 15 วันที่ผ่านมา
Thank you for this comment :) Really happy that my videos are of help to you.
@Kamalsai369 หลายเดือนก่อน ⁺²
Bro no one in this platform explained clearly as much as you Thankyou for providing these lectures for free of cost . I think for paid courses also no one can explain this much thankyou again
@Explaining-AI หลายเดือนก่อน
Really happy that you found the explanation helpful :)
@ShreyaBanik-fb2gd 15 วันที่ผ่านมา
Very informative! Step by step approach for OD in CV. Best video so far 👍❤
@Explaining-AI 15 วันที่ผ่านมา
Thank You :)
@Martingrossman78 หลายเดือนก่อน
Hi, great explanation of RCNN with very useful insights which often are skipped. I am especially grateful for answering questions like "Why SVM, Why different IOU Thr, etc."
@Explaining-AI หลายเดือนก่อน
Happy that you found the explanation helpful!
@mananshah2140 4 หลายเดือนก่อน ⁺²
Your channel is super highly underrated. Keep doing these videos on different CV topics. Your are really helping out a lot of people.
@Explaining-AI 4 หลายเดือนก่อน
Thank you for these words of appreciation Manan :)
@0xD4rky 4 หลายเดือนก่อน ⁺²
dude a legit nice explanation of the topic, keep it up and give us more such content!
@Explaining-AI 4 หลายเดือนก่อน
Thank You!
@sauravns1224 6 หลายเดือนก่อน
Gotta tell you man, amazing content and presentation. Also to add the background music is very soothening 🍃. Waiting for the YOLO series
@Explaining-AI 6 หลายเดือนก่อน
Thank you! Working on YOLO video only as of now. Agree on the background music, I too find it calming.
@ArpitAnand-yd7tr 10 หลายเดือนก่อน ⁺¹
Great video as always. Appreciate the way you logically break down the reasons for architectural choices and smoothly transition to successive steps
Eagerly waiting for the next video in the series!
Just wondering if you intend to cover MobileNetV2 and EfficientNetV2 in this series
@Explaining-AI 10 หลายเดือนก่อน ⁺²
Thank you so much for that! Actually those two wont be covered in this. I plan to do a separate one on popular backbone architectures like vgg/inception/resnet/mobilenet/efficientnet/darknet/swin e.t.c so I will cover them in that series.
@hammadkhan7927 หลายเดือนก่อน
Can you please share the notes of all object detection videos
@princekhunt1 2 หลายเดือนก่อน
Nice tutorial keep it up
@Explaining-AI 2 หลายเดือนก่อน
Thank You!
@hoangduong5954 หลายเดือนก่อน
Please answer me, how can they train CLASS SPECIFIC bounding box regressor? So they input class as input in one model and regress the bounding box or they build multiple(if they detect 6 class then we build 6 model) model and each model they train on specific bounding box regressor? Please answer me
@Explaining-AI หลายเดือนก่อน
Hello, I have tried to explain a bit on this, do let me know if this does not clarify everything for you.
This is how the official rcnn repo does it. We create as many box regressor models as there are classes.
Then we train each of these regressors separately using proposals assigned to the respective classes.
github.com/rbgirshick/rcnn/blob/master/bbox_regression/rcnn_train_bbox_regressor.m#L76
During inference, given the predicted classes for proposals, we use the trained regressor for that class to modify the proposal .
github.com/rbgirshick/rcnn/blob/master/bbox_regression/rcnn_test_bbox_regressor.m#L58-L65
@Explaining-AI หลายเดือนก่อน
Btw you could also do this by one fc layer.
Lets say you have 10 classes. Then your bounding box regressor fc layer predicts 10 times 4 , 40 values. These are tx ty tw th for all 10 classes.
Then during training, the bounding box regression loss will be computed between the ground truth transformation targets and prediction values at indexes corresponding to ground truth class.
At inference, you take the class index with highest predicted probability value. The predicted tx, ty,tw, th are then the 4 values(from 40) corresponding to this highest probable class.
@hoangduong5954 หลายเดือนก่อน
@@Explaining-AIthank you alot!!!! I am fully understand it now. So they do train multiple models and choose the model based on class. Thats crazy though!
@khadimhussain6155 10 หลายเดือนก่อน ⁺¹
can you also explain pytorch code for RCNN
@Explaining-AI 10 หลายเดือนก่อน ⁺²
Hello, I will soon be doing a video on implementation of faster rcnn, in which I will cover the PyTorch code as well.
@rishidixit7939 หลายเดือนก่อน
Subscribed
@rrrfaa 5 หลายเดือนก่อน
Will you also do a video on EfficientDet?
@KetanBansode-n8w 5 หลายเดือนก่อน
Can you please do a video on yolo object detection and do code implementation from scratch
@Explaining-AI 5 หลายเดือนก่อน
Hello, Yes, thats the video that I am working on right now. Will first do a Yolov1 explanation and implementation video and then will later follow up with other yolo versions.
@wolfpack7330 9 หลายเดือนก่อน
Very well done
@Explaining-AI 9 หลายเดือนก่อน
Thank You!
@yuuno__ 10 หลายเดือนก่อน
will you cover MAMBA implementation later? I think there's no current video with clear explanation. It would be very nice if you do it.
@Explaining-AI 10 หลายเดือนก่อน ⁺¹
Hello,
I indeed plan to cover it but it wont be part of this series.
I have 3-4 topics that I intend to cover first and then after that will do a video on Mamba.
@goneshivachandhra7470 7 หลายเดือนก่อน
Is detr covered in this series
@Explaining-AI 7 หลายเดือนก่อน
yes it would cover DETR as well. After FasterRCNN, I plan to do Yolo/SSD/FPN and then I will get into DETR.
@bugbountyhunter9203 7 หลายเดือนก่อน
Great video, but the background music is a bit distracting, imo
@Explaining-AI 7 หลายเดือนก่อน ⁺¹
Thank you for this feedback. Will take care of this in future videos of this series.
@cryes9774 7 หลายเดือนก่อน
i think your object detection series is awesome but you should not put background sound :D
@Explaining-AI 7 หลายเดือนก่อน
Thank you for this feedback. I assume the background music becomes a distraction. Is that right ?
Do you think reducing the background sound would work fine or you would prefer not having it altogether .
@asutoshrath3648 7 หลายเดือนก่อน
@@Explaining-AIbackground is fine i guess
@Explaining-AI 7 หลายเดือนก่อน
@@asutoshrath3648 Thank you for this input
@anshumansinha5874 3 หลายเดือนก่อน
Hi, thanks again for the lectures, I wish to ask you that, @26:02 what's the thinking behind training such a bounding box regressor? Like what will the regressor learn in general? I am thinking about this because I don't understand how the BBregressor be able to correct the input proposals during inference? Like during training it kind of learns about how to shift a bounding box to a better fitting one because we have a loss function there.
But how will it do the same thing well during inference, I'm not able to fully understand the learning and explainability of this regressor. Can you please help with this? Thanks again.
@Explaining-AI 3 หลายเดือนก่อน
Thanks for the appreciation @anshumansinha5874 :)
Regarding the BB regressor, during training like you said using the loss, the bb regressor would get better at modifying the starting proposal box to ground truth box using the pool5 features for the proposal box.
So intuitively, taking the person example in the video, the regressor would be learning how to use the features of the person, like for example location of legs, eyes, hands e.t.c to better estimate size/extent of persona and hence better estimate a tightly fitting box(ground truth). To be precise rather than estimating a box, it will estimate the parameters for converting input proposal box to ground truth box.
During inference also, this learning will allow it to achieve the same. At test time it will use these detected features and the learnt function to predict the output box(one which should better fit the underlying object).
Does this help clarify things a bit ?
@anshumansinha5874 7 หลายเดือนก่อน
Hi @14:39 , you said if our image have 2 classes then the network would have 3 outputs. But how would you know that all the images have only 2 and these 2 classes only? Or is this network only made for a specific set of images which only have cars and persons as 2 distinct objects?
@anshumansinha5874 7 หลายเดือนก่อน
Is the selective search mechanism fine-tuned for a specific set of images? (Like : 1. (Person, Car) , 2.(Bird, Buildings, Lights) etc. But would that not need a different network for a different set?
@Explaining-AI 7 หลายเดือนก่อน ⁺¹
Hello @@anshumansinha5874, the number of categories are predefined and the network is only trained for detecting these predefined categories. So the hypothetical example that I was mentioning, refers to some dataset that has annotations only for person and car and post training you will end up with a network which given an image can detect car or person(only these two objects) in it. This model will ignore any other categories say buildings/bird in the image and will basically predict regions having such objects as background.
Regarding the selective search question, it is neither trained or fine tuned. Its a proposal generation algorithm that latches on hints like presence of edge, change of texture e.t.c to divide the image into different regions and bound those regions within bounding boxes to give us region proposals.
So it does not really depend on your dataset or the kind of categories you have in your image.
@anshumansinha5874 7 หลายเดือนก่อน
@Explaining-AI Perfect, thanks a lot for the answers, I had one follow up question. From my understanding, we train K binary SVMs after we have fine-tuned the CNN backbone with the multi-class classification objective.
I'm a bit confused on what the SVM will pass as +ve? will it only give +ve label for a perfect ground truth input image (input image = a ground truth bounding box adjusted to 227x227 input dimmension) i.e an IOU = 1.0? What happens to the instances which lie between Iou of 1.0 and 0.3? What does the SVM classify them into?
Lastly, if the SVM only gives +ve to the input image with iou = 1.0 ; should it not be better to correct the images for localisation error as soon as we get the region proposals? i.e having a trained bounding box regressor (as it's already being trained separately) and then passing on the corrected image to the CNN+SVM model for training/ predictions?
I'm a bit confused because @26:12 you've mentioned if the selective search performs bad and doesn't give any proposal with iou = 1.0, then our predicted region will be this itself. However since the SVM only gives +ve result for iou = 1.0 this should not be the case.
@Explaining-AI 7 หลายเดือนก่อน
@@anshumansinha5874 what the SVM will pass as +ve ->
During training SVM is going to get the following data points for each class(lets say car).
Positive - ALL Ground truth boxes for car class
Negative - Selective search region proposals < 0.3 IOU with ground truth boxes that belong to car class
Rest all are ignored
Then SVM in the 4096 feature dimensional space learns a boundary that separate these positive and negative labelled data points. So during inference even regions that do not exactly capture the object(IOU = 1) but capture a large enough part of it(IOU = 0.8), such regions will still be predicted to be on the positive side of the decision boundary.
should it not be better to correct the images for localisation error as soon as we get the region proposals ->
There are two parts to this. First is that SVM is going to give a score, so during inference, even if a region proposal is not perfect box containing car(but contains a large enough part of it), it will still have a return a higher score for 'car' than for background.
The second part regarding modifying regions prior to feeding it to SVM. Its theoretically correct but rather than trying to first modify the proposals(because then you would have to be feed ALL 2000 proposals to bbox regression layers), the authors instead get svm score, get newly predicted box and then try to rescore again(feed again to SVM) using the newly predicted box. However, that doesnt lead to any benefits.
From paper "In principle, we could iterate this procedure (i.e., re-score the newly predicted bounding
box, and then predict a new bounding box from it, and so on). However, we found that iterating does not improve results."
@anshumansinha5874 7 หลายเดือนก่อน
@@Explaining-AI 1. SVM: Oh, okay. I get it, I think you're talking about the SVM margin which can help the model include some samples with a slight less IOU as well. Do you think this one of the advantages of using a margin based method like SVM? (Honestly, I'm not able to recollect any other method with a max-margin/ hinge loss). I mean they could've used any other binary classifier as well.
2. Makes sense after I got the margin concept of SVM, thanks for the help. And great videos.

ต่อไป

เล่นอัตโนมัติ

Mean Average Precision (mAP) | Explanation and Implementation for Object Detection