Bro no one in this platform explained clearly as much as you Thankyou for providing these lectures for free of cost . I think for paid courses also no one can explain this much thankyou again
Hi, great explanation of RCNN with very useful insights which often are skipped. I am especially grateful for answering questions like "Why SVM, Why different IOU Thr, etc."
Great video as always. Appreciate the way you logically break down the reasons for architectural choices and smoothly transition to successive steps Eagerly waiting for the next video in the series! Just wondering if you intend to cover MobileNetV2 and EfficientNetV2 in this series
Thank you so much for that! Actually those two wont be covered in this. I plan to do a separate one on popular backbone architectures like vgg/inception/resnet/mobilenet/efficientnet/darknet/swin e.t.c so I will cover them in that series.
Please answer me, how can they train CLASS SPECIFIC bounding box regressor? So they input class as input in one model and regress the bounding box or they build multiple(if they detect 6 class then we build 6 model) model and each model they train on specific bounding box regressor? Please answer me
Hello, I have tried to explain a bit on this, do let me know if this does not clarify everything for you. This is how the official rcnn repo does it. We create as many box regressor models as there are classes. Then we train each of these regressors separately using proposals assigned to the respective classes. github.com/rbgirshick/rcnn/blob/master/bbox_regression/rcnn_train_bbox_regressor.m#L76 During inference, given the predicted classes for proposals, we use the trained regressor for that class to modify the proposal . github.com/rbgirshick/rcnn/blob/master/bbox_regression/rcnn_test_bbox_regressor.m#L58-L65
Btw you could also do this by one fc layer. Lets say you have 10 classes. Then your bounding box regressor fc layer predicts 10 times 4 , 40 values. These are tx ty tw th for all 10 classes. Then during training, the bounding box regression loss will be computed between the ground truth transformation targets and prediction values at indexes corresponding to ground truth class. At inference, you take the class index with highest predicted probability value. The predicted tx, ty,tw, th are then the 4 values(from 40) corresponding to this highest probable class.
@@Explaining-AIthank you alot!!!! I am fully understand it now. So they do train multiple models and choose the model based on class. Thats crazy though!
Hello, Yes, thats the video that I am working on right now. Will first do a Yolov1 explanation and implementation video and then will later follow up with other yolo versions.
Hello, I indeed plan to cover it but it wont be part of this series. I have 3-4 topics that I intend to cover first and then after that will do a video on Mamba.
Thank you for this feedback. I assume the background music becomes a distraction. Is that right ? Do you think reducing the background sound would work fine or you would prefer not having it altogether .
Hi, thanks again for the lectures, I wish to ask you that, @26:02 what's the thinking behind training such a bounding box regressor? Like what will the regressor learn in general? I am thinking about this because I don't understand how the BBregressor be able to correct the input proposals during inference? Like during training it kind of learns about how to shift a bounding box to a better fitting one because we have a loss function there. But how will it do the same thing well during inference, I'm not able to fully understand the learning and explainability of this regressor. Can you please help with this? Thanks again.
Thanks for the appreciation @anshumansinha5874 :) Regarding the BB regressor, during training like you said using the loss, the bb regressor would get better at modifying the starting proposal box to ground truth box using the pool5 features for the proposal box. So intuitively, taking the person example in the video, the regressor would be learning how to use the features of the person, like for example location of legs, eyes, hands e.t.c to better estimate size/extent of persona and hence better estimate a tightly fitting box(ground truth). To be precise rather than estimating a box, it will estimate the parameters for converting input proposal box to ground truth box. During inference also, this learning will allow it to achieve the same. At test time it will use these detected features and the learnt function to predict the output box(one which should better fit the underlying object). Does this help clarify things a bit ?
Hi @14:39 , you said if our image have 2 classes then the network would have 3 outputs. But how would you know that all the images have only 2 and these 2 classes only? Or is this network only made for a specific set of images which only have cars and persons as 2 distinct objects?
Is the selective search mechanism fine-tuned for a specific set of images? (Like : 1. (Person, Car) , 2.(Bird, Buildings, Lights) etc. But would that not need a different network for a different set?
Hello @@anshumansinha5874, the number of categories are predefined and the network is only trained for detecting these predefined categories. So the hypothetical example that I was mentioning, refers to some dataset that has annotations only for person and car and post training you will end up with a network which given an image can detect car or person(only these two objects) in it. This model will ignore any other categories say buildings/bird in the image and will basically predict regions having such objects as background. Regarding the selective search question, it is neither trained or fine tuned. Its a proposal generation algorithm that latches on hints like presence of edge, change of texture e.t.c to divide the image into different regions and bound those regions within bounding boxes to give us region proposals. So it does not really depend on your dataset or the kind of categories you have in your image.
@Explaining-AI Perfect, thanks a lot for the answers, I had one follow up question. From my understanding, we train K binary SVMs after we have fine-tuned the CNN backbone with the multi-class classification objective. I'm a bit confused on what the SVM will pass as +ve? will it only give +ve label for a perfect ground truth input image (input image = a ground truth bounding box adjusted to 227x227 input dimmension) i.e an IOU = 1.0? What happens to the instances which lie between Iou of 1.0 and 0.3? What does the SVM classify them into? Lastly, if the SVM only gives +ve to the input image with iou = 1.0 ; should it not be better to correct the images for localisation error as soon as we get the region proposals? i.e having a trained bounding box regressor (as it's already being trained separately) and then passing on the corrected image to the CNN+SVM model for training/ predictions? I'm a bit confused because @26:12 you've mentioned if the selective search performs bad and doesn't give any proposal with iou = 1.0, then our predicted region will be this itself. However since the SVM only gives +ve result for iou = 1.0 this should not be the case.
@@anshumansinha5874 what the SVM will pass as +ve -> During training SVM is going to get the following data points for each class(lets say car). Positive - ALL Ground truth boxes for car class Negative - Selective search region proposals < 0.3 IOU with ground truth boxes that belong to car class Rest all are ignored Then SVM in the 4096 feature dimensional space learns a boundary that separate these positive and negative labelled data points. So during inference even regions that do not exactly capture the object(IOU = 1) but capture a large enough part of it(IOU = 0.8), such regions will still be predicted to be on the positive side of the decision boundary. should it not be better to correct the images for localisation error as soon as we get the region proposals -> There are two parts to this. First is that SVM is going to give a score, so during inference, even if a region proposal is not perfect box containing car(but contains a large enough part of it), it will still have a return a higher score for 'car' than for background. The second part regarding modifying regions prior to feeding it to SVM. Its theoretically correct but rather than trying to first modify the proposals(because then you would have to be feed ALL 2000 proposals to bbox regression layers), the authors instead get svm score, get newly predicted box and then try to rescore again(feed again to SVM) using the newly predicted box. However, that doesnt lead to any benefits. From paper "In principle, we could iterate this procedure (i.e., re-score the newly predicted bounding box, and then predict a new bounding box from it, and so on). However, we found that iterating does not improve results."
@@Explaining-AI 1. SVM: Oh, okay. I get it, I think you're talking about the SVM margin which can help the model include some samples with a slight less IOU as well. Do you think this one of the advantages of using a margin based method like SVM? (Honestly, I'm not able to recollect any other method with a max-margin/ hinge loss). I mean they could've used any other binary classifier as well. 2. Makes sense after I got the margin concept of SVM, thanks for the help. And great videos.
Thank you very much for this series, and the overall amazing content, genuinely appreciated !
Thank you so much for this comment :)
Man, I love you. You're the person I was looking for. Thanks for great explanation without omitting details and things that can be unclear.
Thank you for this comment :) Really happy that my videos are of help to you.
Bro no one in this platform explained clearly as much as you Thankyou for providing these lectures for free of cost . I think for paid courses also no one can explain this much thankyou again
Really happy that you found the explanation helpful :)
Very informative! Step by step approach for OD in CV. Best video so far 👍❤
Thank You :)
Hi, great explanation of RCNN with very useful insights which often are skipped. I am especially grateful for answering questions like "Why SVM, Why different IOU Thr, etc."
Happy that you found the explanation helpful!
Your channel is super highly underrated. Keep doing these videos on different CV topics. Your are really helping out a lot of people.
Thank you for these words of appreciation Manan :)
dude a legit nice explanation of the topic, keep it up and give us more such content!
Thank You!
Gotta tell you man, amazing content and presentation. Also to add the background music is very soothening 🍃. Waiting for the YOLO series
Thank you! Working on YOLO video only as of now. Agree on the background music, I too find it calming.
Great video as always. Appreciate the way you logically break down the reasons for architectural choices and smoothly transition to successive steps
Eagerly waiting for the next video in the series!
Just wondering if you intend to cover MobileNetV2 and EfficientNetV2 in this series
Thank you so much for that! Actually those two wont be covered in this. I plan to do a separate one on popular backbone architectures like vgg/inception/resnet/mobilenet/efficientnet/darknet/swin e.t.c so I will cover them in that series.
Can you please share the notes of all object detection videos
Nice tutorial keep it up
Thank You!
Please answer me, how can they train CLASS SPECIFIC bounding box regressor? So they input class as input in one model and regress the bounding box or they build multiple(if they detect 6 class then we build 6 model) model and each model they train on specific bounding box regressor? Please answer me
Hello, I have tried to explain a bit on this, do let me know if this does not clarify everything for you.
This is how the official rcnn repo does it. We create as many box regressor models as there are classes.
Then we train each of these regressors separately using proposals assigned to the respective classes.
github.com/rbgirshick/rcnn/blob/master/bbox_regression/rcnn_train_bbox_regressor.m#L76
During inference, given the predicted classes for proposals, we use the trained regressor for that class to modify the proposal .
github.com/rbgirshick/rcnn/blob/master/bbox_regression/rcnn_test_bbox_regressor.m#L58-L65
Btw you could also do this by one fc layer.
Lets say you have 10 classes. Then your bounding box regressor fc layer predicts 10 times 4 , 40 values. These are tx ty tw th for all 10 classes.
Then during training, the bounding box regression loss will be computed between the ground truth transformation targets and prediction values at indexes corresponding to ground truth class.
At inference, you take the class index with highest predicted probability value. The predicted tx, ty,tw, th are then the 4 values(from 40) corresponding to this highest probable class.
@@Explaining-AIthank you alot!!!! I am fully understand it now. So they do train multiple models and choose the model based on class. Thats crazy though!
can you also explain pytorch code for RCNN
Hello, I will soon be doing a video on implementation of faster rcnn, in which I will cover the PyTorch code as well.
Subscribed
Will you also do a video on EfficientDet?
Can you please do a video on yolo object detection and do code implementation from scratch
Hello, Yes, thats the video that I am working on right now. Will first do a Yolov1 explanation and implementation video and then will later follow up with other yolo versions.
Very well done
Thank You!
will you cover MAMBA implementation later? I think there's no current video with clear explanation. It would be very nice if you do it.
Hello,
I indeed plan to cover it but it wont be part of this series.
I have 3-4 topics that I intend to cover first and then after that will do a video on Mamba.
Is detr covered in this series
yes it would cover DETR as well. After FasterRCNN, I plan to do Yolo/SSD/FPN and then I will get into DETR.
Great video, but the background music is a bit distracting, imo
Thank you for this feedback. Will take care of this in future videos of this series.
i think your object detection series is awesome but you should not put background sound :D
Thank you for this feedback. I assume the background music becomes a distraction. Is that right ?
Do you think reducing the background sound would work fine or you would prefer not having it altogether .
@@Explaining-AIbackground is fine i guess
@@asutoshrath3648 Thank you for this input
Hi, thanks again for the lectures, I wish to ask you that, @26:02 what's the thinking behind training such a bounding box regressor? Like what will the regressor learn in general? I am thinking about this because I don't understand how the BBregressor be able to correct the input proposals during inference? Like during training it kind of learns about how to shift a bounding box to a better fitting one because we have a loss function there.
But how will it do the same thing well during inference, I'm not able to fully understand the learning and explainability of this regressor. Can you please help with this? Thanks again.
Thanks for the appreciation @anshumansinha5874 :)
Regarding the BB regressor, during training like you said using the loss, the bb regressor would get better at modifying the starting proposal box to ground truth box using the pool5 features for the proposal box.
So intuitively, taking the person example in the video, the regressor would be learning how to use the features of the person, like for example location of legs, eyes, hands e.t.c to better estimate size/extent of persona and hence better estimate a tightly fitting box(ground truth). To be precise rather than estimating a box, it will estimate the parameters for converting input proposal box to ground truth box.
During inference also, this learning will allow it to achieve the same. At test time it will use these detected features and the learnt function to predict the output box(one which should better fit the underlying object).
Does this help clarify things a bit ?
Hi @14:39 , you said if our image have 2 classes then the network would have 3 outputs. But how would you know that all the images have only 2 and these 2 classes only? Or is this network only made for a specific set of images which only have cars and persons as 2 distinct objects?
Is the selective search mechanism fine-tuned for a specific set of images? (Like : 1. (Person, Car) , 2.(Bird, Buildings, Lights) etc. But would that not need a different network for a different set?
Hello @@anshumansinha5874, the number of categories are predefined and the network is only trained for detecting these predefined categories. So the hypothetical example that I was mentioning, refers to some dataset that has annotations only for person and car and post training you will end up with a network which given an image can detect car or person(only these two objects) in it. This model will ignore any other categories say buildings/bird in the image and will basically predict regions having such objects as background.
Regarding the selective search question, it is neither trained or fine tuned. Its a proposal generation algorithm that latches on hints like presence of edge, change of texture e.t.c to divide the image into different regions and bound those regions within bounding boxes to give us region proposals.
So it does not really depend on your dataset or the kind of categories you have in your image.
@Explaining-AI Perfect, thanks a lot for the answers, I had one follow up question. From my understanding, we train K binary SVMs after we have fine-tuned the CNN backbone with the multi-class classification objective.
I'm a bit confused on what the SVM will pass as +ve? will it only give +ve label for a perfect ground truth input image (input image = a ground truth bounding box adjusted to 227x227 input dimmension) i.e an IOU = 1.0? What happens to the instances which lie between Iou of 1.0 and 0.3? What does the SVM classify them into?
Lastly, if the SVM only gives +ve to the input image with iou = 1.0 ; should it not be better to correct the images for localisation error as soon as we get the region proposals? i.e having a trained bounding box regressor (as it's already being trained separately) and then passing on the corrected image to the CNN+SVM model for training/ predictions?
I'm a bit confused because @26:12 you've mentioned if the selective search performs bad and doesn't give any proposal with iou = 1.0, then our predicted region will be this itself. However since the SVM only gives +ve result for iou = 1.0 this should not be the case.
@@anshumansinha5874 what the SVM will pass as +ve ->
During training SVM is going to get the following data points for each class(lets say car).
Positive - ALL Ground truth boxes for car class
Negative - Selective search region proposals < 0.3 IOU with ground truth boxes that belong to car class
Rest all are ignored
Then SVM in the 4096 feature dimensional space learns a boundary that separate these positive and negative labelled data points. So during inference even regions that do not exactly capture the object(IOU = 1) but capture a large enough part of it(IOU = 0.8), such regions will still be predicted to be on the positive side of the decision boundary.
should it not be better to correct the images for localisation error as soon as we get the region proposals ->
There are two parts to this. First is that SVM is going to give a score, so during inference, even if a region proposal is not perfect box containing car(but contains a large enough part of it), it will still have a return a higher score for 'car' than for background.
The second part regarding modifying regions prior to feeding it to SVM. Its theoretically correct but rather than trying to first modify the proposals(because then you would have to be feed ALL 2000 proposals to bbox regression layers), the authors instead get svm score, get newly predicted box and then try to rescore again(feed again to SVM) using the newly predicted box. However, that doesnt lead to any benefits.
From paper "In principle, we could iterate this procedure (i.e., re-score the newly predicted bounding
box, and then predict a new bounding box from it, and so on). However, we found that iterating does not improve results."
@@Explaining-AI 1. SVM: Oh, okay. I get it, I think you're talking about the SVM margin which can help the model include some samples with a slight less IOU as well. Do you think this one of the advantages of using a margin based method like SVM? (Honestly, I'm not able to recollect any other method with a max-margin/ hinge loss). I mean they could've used any other binary classifier as well.
2. Makes sense after I got the margin concept of SVM, thanks for the help. And great videos.