3. How RPN (Region Proposal Networks) Works

แชร์
ฝัง
  • เผยแพร่เมื่อ 28 ก.ย. 2024

ความคิดเห็น • 127

  • @ductranminh3824
    @ductranminh3824 6 ปีที่แล้ว +7

    Thank you for your video, it helps me a lot to understand more. However, I am still confused with RPN and I have questions, I hope you could explain them for me:
    1. In 5:30, how can we choose 512 anchors from 2400 anchors? I mean what is the standard to choose those? And following the paper, the author wrote that the output sliding window is mapped to a lower dimensional feature from convolutional operation of sliding window, not anchors as you told. So are features the author mentioned and your anchors the same? And if those are features, I can't understand how after convolutional operation, we can have 512-d feature. (Actually, 512-d feature is a vector or a matrix?)
    2. Sliding-window slides on the feature map and it contains 9 anchor boxes in it. So Xa, Ya, Wa, Ha is based on image aspect or feature map aspect? And if is based on image aspect, how do we know Xa and Ya (Coordinates of anchor boxes) to compute Tx and Ty (Coordinated of predicted box) when we just know the location of sliding window based on feature map?

    • @ArdianUmam
      @ArdianUmam  6 ปีที่แล้ว +5

      Hi,
      Find my replies here.
      1. Yes, the last feature maps is down sampled to 512-d (in case of vgg-16) per anchor. In case of 40x60 last feature maps, using a 3x3 anchor in a standard convolution operation fashion, it will generate 40x60 =2400-d output features, right? How can they get 512-d features? They didn't mention the details in the paper. Maybe using stride more than 1 to perform down sample? We don't really now until we dive to the code directly (but I didn't do it since this video is only a paper review). "How can we choose 512 anchors from 2400 anchors?" --> Sorry if this part may create an ambiguity. I wanna clarify: (i) for the 512-d output features, they are down sampled from the last features map without choosing from 2400 anchors, (ii) for the choosing of 512 anchors (256 each for pos and neg anchors), it is in the loss function calculation. Since the negative anchors will dominate if we use all the 2400 anchors, thus, they decide to use 256 for the pos anchors and 256 for the neg anchors only from a total of 2400 anchors to make the pos and neg ratio of 1:1. In paper, they said choosing them randomly.
      2. In my understanding, Xa, Ya, Wa, Ha are based on the image input since they will be compared to the bbox ground truth which is in the image too (not in the feature map) when calculating the loss function. So, how to compute Tx,Ty? In the paper, Ty = (Y-Ya)/Ha. We can think like this, e.g., in the heights of image input and last feature map are 600 and 40 respectively. So, one stride in the last feature map is equals to 15-stride in the input images (600/40=15). Using this mechanism, we can obtain the value of "Ya", right? Whereas, "Ha" is known by the size of anchor, and "Y" is taken from the bbox prediction output during training phase.
      Hope this reply helps and makes clear.

    • @yakkayakka360
      @yakkayakka360 4 ปีที่แล้ว +2

      I think the 512 comes from the shape of the VGG16 network itself. In the faster RCNN paper sec 3.1 the authors state:
      "Each sliding
      window is mapped to a lower-dimensional feature (256-d for ZF and 512-d for VGG, with ReLU [33] following)"
      In python, you can print a summary of the network. Using an size of 800x800 for the input image and VGG16 we get:
      Model: "vgg16"
      _________________________________________________________________
      Layer (type) Output Shape Param #
      =================================================================
      input_1 (InputLayer) [(None, 800, 800, 3)] 0
      _________________________________________________________________
      block1_conv1 (Conv2D) (None, 800, 800, 64) 1792
      _________________________________________________________________
      block1_conv2 (Conv2D) (None, 800, 800, 64) 36928
      _________________________________________________________________
      block1_pool (MaxPooling2D) (None, 400, 400, 64) 0
      _________________________________________________________________
      block2_conv1 (Conv2D) (None, 400, 400, 128) 73856
      _________________________________________________________________
      block2_conv2 (Conv2D) (None, 400, 400, 128) 147584
      _________________________________________________________________
      block2_pool (MaxPooling2D) (None, 200, 200, 128) 0
      _________________________________________________________________
      block3_conv1 (Conv2D) (None, 200, 200, 256) 295168
      _________________________________________________________________
      block3_conv2 (Conv2D) (None, 200, 200, 256) 590080
      _________________________________________________________________
      block3_conv3 (Conv2D) (None, 200, 200, 256) 590080
      _________________________________________________________________
      block3_pool (MaxPooling2D) (None, 100, 100, 256) 0
      _________________________________________________________________
      block4_conv1 (Conv2D) (None, 100, 100, 512) 1180160
      _________________________________________________________________
      block4_conv2 (Conv2D) (None, 100, 100, 512) 2359808
      _________________________________________________________________
      block4_conv3 (Conv2D) (None, 100, 100, 512) 2359808
      _________________________________________________________________
      block4_pool (MaxPooling2D) (None, 50, 50, 512) 0
      _________________________________________________________________
      block5_conv1 (Conv2D) (None, 50, 50, 512) 2359808
      _________________________________________________________________
      block5_conv2 (Conv2D) (None, 50, 50, 512) 2359808
      _________________________________________________________________
      block5_conv3 (Conv2D) (None, 50, 50, 512) 2359808
      _________________________________________________________________
      block5_pool (MaxPooling2D) (None, 25, 25, 512) 0
      =================================================================
      Ignoring the last pooling layer, this is the backbone architecture used generate the feature maps as input to RPN. So you can see that the VGG16 network is defined in such a way that there are 512 different "features". I used an example image of a cat and was able to show the 512 features the VGG16 conv layers extracted (with pre-loaded weights) here: imgur.com/a/w4F5IJb
      In short, I think the answer is as simple as: "It's just the way the network was designed"
      Hope my understanding is correct. I am still new to this topic so take my learning with a grain of salt!

  • @layssi
    @layssi 4 ปีที่แล้ว

    This is the clearest explanation of F-RCNN. Excellent job.

  • @nishtakk
    @nishtakk 7 ปีที่แล้ว

    Thank you very much for all of the videos, helped me a lot.
    Two things that I I've interpreted them differently.
    1. The vector of 512-d as intermediate layer, has 512-d is because with run a the sliding window on every layer at the same location.
    Every layer gives us 1 number(dimension).
    Since VGG has 512 layers we will get 512-d vector.
    Notice that in ZF architecture they got 256-d.
    2. The part where the vector/matrix is 512 * 9.
    It is basically the same vector, but in the last layer we will have (4+2) * 9 output nodes, cls + reg for every anchor.
    And it is fully connected to every node in the intermediate layer.
    So this is why you have 512 * ((4+2) *9) connection there.
    Again, my interpretation of the paper.

    • @ArdianUmam
      @ArdianUmam  7 ปีที่แล้ว

      Glad to hear it helps you a lot :)
      1. Yes, I was thinking about that since VGG16 had 512 layers and ZF-net had 256 layer. But, when I discussed it with my friend that also did research on deep learning (but for NLP), we come up with that it was weird mechanism, because convolution usually was done using same depth filter/sliding window with the input layer's depth. So, we took this interpretation like in this video. If we want know what is the right one, we can explore it in the code directly, but that's not the purpose doing this paper review. We only want to take the insight behind this algorithm idea :)
      2. Yes, exactly. I explained that in the 5th video on how to train faster r-cnn in about 3:35 :)
      Are you PhD student doing research on object detection?

    • @nishtakk
      @nishtakk 7 ปีที่แล้ว

      Hi Adrian,
      Thanks for the reply, I agree that it is weird, that why it took me really long time to understand section 1.
      In the paper the say that they map this window to a lower dimensions, so that is why I think that they meant, running the same window on every layer.
      Not a PhD student, BIOS programmer, a remote field of work :)
      But I'm doing a Masters in my free time, took a seminar course, and was asked to present this paper.
      Since I had no expertise on the subject, I was searching the net for a more simple but thorough explanation, so this is why it was very helpful.
      Really interesting subject

    • @ArdianUmam
      @ArdianUmam  7 ปีที่แล้ว

      I thought you were student of someone comment in this video with name : Слава Мулюкин..xD
      Because at a glance your name were all russian names, and I have russian friend too here in my class.
      Yes, deep learning is an interesting and hot topic now!

    • @nishtakk
      @nishtakk 7 ปีที่แล้ว +1


      Yes Russian but from Israel

    • @ArdianUmam
      @ArdianUmam  7 ปีที่แล้ว

      Greeting from Taiwan from Indonesian people 😁

  • @tauha_azmat
    @tauha_azmat 7 ปีที่แล้ว

    VSR is a Visual Speech Recognition or Automatic Lipreading System. Can I
    focus lips region of a person in RPN?. Actually I want to focus a
    particular region in the image and and then I want to process that
    region only. Is this possible if yes kindly guide me.

  • @taoshatoo8680
    @taoshatoo8680 6 ปีที่แล้ว

    Hi,your presentation is great.But should the anchor number be 256 for ZFNet and 512 for VGG,namely 256 anchor is object and 256 anchor is not object(ZF) and 512 anchor is object and 512 anchor is not object ? Your presentation makes me confused,i thought the 512-d output of ConvNet consist of 256 anchor and 256 not anchor. Am i right? If not ,please correct me!

  • @ssy892
    @ssy892 5 ปีที่แล้ว

    Hi. I have seen good presentation.
    I do not know if this is an old article , but I will ask you with my best hope.
    When faster rcnn region proposal classify to determine whether it is an object or a background in the network, I understood that the RPN used the feature map from the cov layer forword through the 3x3 convolution irrespective of the shape of the anchor box. However, if you do not select anchor box here, does not the result of identifying the presence or absence of an object by 2k be the same? If there is one loss and one network for 9 anchors, but the anchor shape is not applied, how do you get the result of 9 objects differently?

    • @ArdianUmam
      @ArdianUmam  5 ปีที่แล้ว

      What do u mean by "don't select anchor box" and "anchor shape is not applied"? Anchor is like a ground truth labelling mechanism in the training process, to decide whether 'anchor' location is object or not.

    • @ssy892
      @ssy892 5 ปีที่แล้ว

      Ardian Umam thank you for answer. ^^

  • @georgeyuan4867
    @georgeyuan4867 6 ปีที่แล้ว +2

    I think it's a good video, but my questions couldn't be answered.
    For the part from last conv layer slices (3x3x512) to get 512-D feature(512x512x9 right?), it's not clear.
    In the paper it said used a mini network to acheive that but no details and the video is not clear neither.
    I also don't think 512 is selected from 2400 because VGG using 512 and ZF using 256 as the output dimension.
    But they should have same shape in the last conv layer. (40x60). So why ZF net also select 512 from 2400?

  • @zchenyu9797
    @zchenyu9797 6 ปีที่แล้ว

    I have a question about the predicted probability p. For the inference phase, as we don't have GT of objects for each image, in this case, how we calculate predicted probability? Thanks a lot for your answer!

    • @ArdianUmam
      @ArdianUmam  6 ปีที่แล้ว

      GT is used to calculate the loss function, which generally is a measurement how differ a current "prediction output" result with the "GT", and training process will use this value to train the networks for getting better weighting value giving lower loss function output value in the next iteration. So, to have the output prediction itself, we don't need a ground truth, we only need the trained weighting values and input (data testing).

    • @zchenyu9797
      @zchenyu9797 6 ปีที่แล้ว

      Thanks for your response. I know for a classifier CNN, the prediction scores are always calculated by softmax function, so here for the case of object detection, these scores are calculated always in same way? That means for the inference phase, if I get a boundingboxes which write 'cat:0.98', that means the object detection network predict that this object has 98% chance to be a cat?

    • @ArdianUmam
      @ArdianUmam  6 ปีที่แล้ว

      Again, faster rcnn consists of two stages: (i) region proposer which output are bbox and binary class prediction (object vs not, object can by cat, dog, car, -any object-), and (ii) fast rcnn as classifier for each bbox get in (i), plus refine the bbox get from (i). For stage (ii), yes, cat:98% means 98% chance to be a cat for that corresponding bbox get in (i).

    • @zchenyu9797
      @zchenyu9797 6 ปีที่แล้ว

      Thanks for your patience. So for the stage (i), the RPN will calculate object or not object based on IoU ( like >0.7 is considering an object and

    • @ArdianUmam
      @ArdianUmam  6 ปีที่แล้ว

      Yes, in the inference phase, we only need to get bbox (4 values which are center coordinates: x,y; width and height) and class prediction. To get those values, we don't need GT in inference phase. In training phase, we already get the weighting values of our networks, it can already predict the testing image input. For only if you want calculate the accuracy, of course you need to compare the prediction output with GT.

  • @zhishuaifeng3342
    @zhishuaifeng3342 7 ปีที่แล้ว

    Nice tutorials. Could you please tell more details about the 512 dimension output. Where does the number 512 come from?

    • @rachanadesai7984
      @rachanadesai7984 7 ปีที่แล้ว

      i have the same question :(

    • @ArdianUmam
      @ArdianUmam  7 ปีที่แล้ว

      512-d is given from the paper meaning that it is determined by the authors. In the paper, they investigate 2 architectures, namely VGG-16 and ZF-net with last convolution layer depth : 512 and 256 respectively. For VGG-16, the last output layer connected to FC-layer (Fully Connected Layer) is 512-d, and for ZF-net is 256-d. They don't explain the detail in the paper where those numbers come from. Like what I already discussed with Dennis in comment below, there are some interpretations in the paper unless you jump to the code directly to investigate. Here are some possibilities:
      1. 512-d is from convolution operation in each layer of last conv layer in VGG-16. Since last conv layer in VGG-16 is 512 depth, and each layer is convolved with kernel window yielding one value out, so 512 depth * 1 value = 512-d output. Likewise for ZF-net with 256 depth for its last conv layer. BUT, this is weird since standard convolution operation in CNN is same depth with the input depth. Thus, the second possibility is below.
      2. Conv operation uses same depth of the input depth. Thus, we have 40x60 = 2,400 possible locations for convolution operation in the last layer of VGG-16. Since it is out to 512-d, thus just estimate from 2,400 possible conv to out 512-d, such as can be done by take average in each (2,400/512) conv operation and out as one value.
      3. Like what I explain in this video.
      Again, if you wanna know the exact process, just jump directly to their official code. And since this video series is for paper review, it is not intended for that. Hope this answer is helpful :)

    • @wqtianjin
      @wqtianjin 7 ปีที่แล้ว

      could you please post a link to the code? Thanks!

    • @ArdianUmam
      @ArdianUmam  7 ปีที่แล้ว

      You can just go to the paper author github here:
      github.com/rbgirshick/py-faster-rcnn

  • @marxman1010
    @marxman1010 6 ปีที่แล้ว

    I wonder RPN is very much like the last stage R CNN, except for the classification.

  • @eungsockkim3865
    @eungsockkim3865 7 ปีที่แล้ว

    I can't hear what blue boxes indicate.. (00;50~00:51) Please let me know.. Thanks for your videos..

    • @ArdianUmam
      @ArdianUmam  7 ปีที่แล้ว

      Blue boxes there are GT (ground truth) boxes from the dataset.

    • @eungsockkim3865
      @eungsockkim3865 7 ปีที่แล้ว

      Thank a lot for your quick answer.. :)

  • @alfandosavant4639
    @alfandosavant4639 7 ปีที่แล้ว

    Oooh iki toh video ne 😁
    Sekedar saran, mungkin selain plain ppt lebih bagus kalo njenengan pake semacam layar kosong yg bisa diorek2 jadi lebih fleksibel, nggak sekedar slide demi slide isinya penuh yg bisa bikin pusing langsung, hehehe. Good job eniwei brah!

    • @ArdianUmam
      @ArdianUmam  7 ปีที่แล้ว

      Yoi Ndo. Cen niat e jane pengen pake drawing pen. Tar kl pas lg selow lah.😁
      Iki soale wingi bar presentasi lab, jadi gak ngubah opo langsung take suara wae, dadi gak makan waktu akeh.

  • @igorlfc
    @igorlfc 7 ปีที่แล้ว

    who defines or how will they be defined these ground-truth boxes? Is it result from Selective Search?

    • @ArdianUmam
      @ArdianUmam  7 ปีที่แล้ว

      GT boxes are already available in the dataset. They use Pascal VOC and MS COCO dataset. RPN will learn to propose object region from the dataset.
      As for Faster RCNN, it doesn't need external proposer like Selective Search again. Because it's already replaced by the functionality of RPN.

    • @igorlfc
      @igorlfc 7 ปีที่แล้ว

      thx Ardian. I got it) But can i aply faster rcnn on a continious flow of picture with the same ( but different:) ) objects e.g. vehicles to detect them? Or how can i create a dataset from a set of pictures?

    • @ArdianUmam
      @ArdianUmam  7 ปีที่แล้ว +1

      Of course. But, don't forget to consider the trade-off (accuracy vs fps/computing resource).
      Faster RCNN runs 5 fps using GPU nVidia Tesla m40 (in a par with geforce gtx titan x). Using SSD in geforce gtx titan x, you can get 22 FPS with lower accuracy, and YoloV2 can run at 40 fps with lower accuracy again.
      FYI : FB AI Research already published their new paper : Mask RCNN. For the detection, they used faster rcnn with modification in RoI-pooling layer changed to so-called RoI-align. And get better result (the result is the best now).
      For each dataset image, you can create your own GT with 4 values (width, height, coordinate x-y of top-left GT).

    • @igorlfc
      @igorlfc 7 ปีที่แล้ว

      soon i will get my GeForce 1050ti (my old one was to slow, i could not even start a demo). And i want to try to implement faster rcnn with my own pictures.
      I will be very happy if you give some infos or advices about the things i need or how to start with my own pictures. The only things i have are code from github and actually pictures))

    • @ArdianUmam
      @ArdianUmam  7 ปีที่แล้ว

      Maybe you can start with original code from their github. And then retrain with your own dataset in binary classes (car and not_car). You can try to train only FC layer plus last conv layer and the rest are fixed first, and see the result.

  • @MrBrij2385
    @MrBrij2385 7 ปีที่แล้ว

    Have you trained a faster RCNN? Could you replicate the results the authors could achieve? If yes, do you have the code on github or somewhere else?

    • @ArdianUmam
      @ArdianUmam  7 ปีที่แล้ว

      Brij Malhotra : I haven’t. This is actually only paper review for my regular lab meeting. Instead of only presented in lab meeting, I add voice and upload it in YT so that it may help others to understand the concept. And my thesis research is not in object detection, but about stereo vision.

    • @MrBrij2385
      @MrBrij2385 7 ปีที่แล้ว

      I would really like to know your work in stereo vision.

    • @ArdianUmam
      @ArdianUmam  7 ปีที่แล้ว

      Brij Malhotra : of course, I’ll let you know if the work is done. This would be similar what I’m doing now for my master thesis.
      th-cam.com/video/EEqCf_eno5c/w-d-xo.html

    • @MrBrij2385
      @MrBrij2385 7 ปีที่แล้ว

      Thank You so much!

  • @sukritipaul9499
    @sukritipaul9499 6 ปีที่แล้ว

    Thank you so much! This is super helpful :)

  • @TheBlackPenguin
    @TheBlackPenguin 6 ปีที่แล้ว

    mantul bang , kalo boleh jelasin skema kerja nya dari input sd prediction nya bang mantap

  • @viacheslavmuliukin1017
    @viacheslavmuliukin1017 7 ปีที่แล้ว

    Please add English subtitles. I would be happy to show this video to my students

    • @ArdianUmam
      @ArdianUmam  7 ปีที่แล้ว

      Glad to hear that :)
      Actually english isn't my native language by the way. But, is my english pronunciation here not clear enough to understand? If it's still difficult to be caught, I'll try to add subs when I have free time during my study.

    • @ex42k2
      @ex42k2 7 ปีที่แล้ว

      hehehehehe

    • @ArdianUmam
      @ArdianUmam  7 ปีที่แล้ว

      wah, ada mas eka

    • @ex42k2
      @ex42k2 7 ปีที่แล้ว

      Image processing is not my main field, but watching your videos, gave me some enlightenment on how things done. nice bro

    • @ArdianUmam
      @ArdianUmam  7 ปีที่แล้ว +2

      If you wanna know more from the basic, I'd recommend you this video series Mas :) It's a great lecture on Deep Learning from Standford Uni.
      --> th-cam.com/video/NfnWJUyUJYU/w-d-xo.html

  • @alifwicaksanaramadhan6358
    @alifwicaksanaramadhan6358 4 ปีที่แล้ว

    you're Indonesian right?

  • @haianhhoang6901
    @haianhhoang6901 2 ปีที่แล้ว

    CovidImages need to be invested more than half19

  • @ademord
    @ademord 4 ปีที่แล้ว

    the sound is so loud omg regulate your videos before upload please

  • @ujjaldas8805
    @ujjaldas8805 6 ปีที่แล้ว +1

    Hello, I really enjoyed this video. Just a query. In 4:13(time), the dimension of output convolution should be 40x60x1(I understand the depth 1 but didn't get how will it be of 512 d) and similarly in 9:14(time), for 9 possible anchor box ratios, the dimension will be 40x60x9, but 512d x 9 is shown. Can you please explain why did you choose this value?

  • @gelenasomez8641
    @gelenasomez8641 4 ปีที่แล้ว +1

    it sounds like you recorded the audio while inside a cave

  • @georgeyuan4867
    @georgeyuan4867 6 ปีที่แล้ว

    Still not sure how to generate the 512-d from last conv layer. Can you be more detail on this 512-d? It's a vector? Matrix?
    Let's say each slice you get from last conv layer is 3x3x512.
    What's the output shape of the 512-d? Is that 512x1x1? How we do that?
    As far as I udnerstand from your video, each row of this 512-d is a pos or nag anchor, we pick them up from 2400 sliding windows, right?
    But each anchor has 9 different size, where is the information (x,y,w,h) in 512-d can send to reg layer?

  • @AnushaShenoy
    @AnushaShenoy 3 ปีที่แล้ว

    Great explanation!
    Could you please provide the Slides of these for reference?

  • @emymimi7240
    @emymimi7240 6 ปีที่แล้ว +1

    hey thanks a lot but i think for each 3*3 sliding window we use 9 anchors ?

    • @ArdianUmam
      @ArdianUmam  6 ปีที่แล้ว

      Yes, you are right.

  • @georgeyuan4867
    @georgeyuan4867 6 ปีที่แล้ว

    Sorry, it's not clear how you generate 512-D output slide window from last conv layer and how you generate the anchor on original image. What you mean 3x3 on last conv layer? Did you generate the anchor from an area or a point? I don't think you really understand it.

    • @ArdianUmam
      @ArdianUmam  6 ปีที่แล้ว

      Please see the reply on the pinned comment. Cz this simiar quest is asked a lot.

  • @SzymonKlepacz
    @SzymonKlepacz 7 ปีที่แล้ว +1

    Thanks for the videos!
    However, I am a bit confused about the Anchors. So to use different anchors we change convolution 3x3 to 3x6, 6x3, 6x6, 6x12 and so on? Or how does it work?

    • @ArdianUmam
      @ArdianUmam  7 ปีที่แล้ว

      For this, the paper doesn't mention/explain it. It only says that the output for each ancor is 512-d for VGG-16 (and 256-d for ZF-net) using 3x3 conv. Maybe for other anchors, it still uses 3x3. If you want, just ask to the paper authors :)

    • @StarzzLAB
      @StarzzLAB 7 ปีที่แล้ว

      Szymon, sizes and scales of anchors does not matter because a RoI (region of interest) pooling layer converts every feature map input to a fixed size tensor.

    • @StarzzLAB
      @StarzzLAB 7 ปีที่แล้ว +1

      @Ardian Umam Well, this is a very important stage of Faster R-CNN's pipeline and you should now that if you want to make 'lectures' about object detection ;)

    • @ArdianUmam
      @ArdianUmam  7 ปีที่แล้ว

      @StarzzLAB : Yup, for Faster RCNN there is RoI-Pool layer to resize any last feature map size to FC layer in fixed size. There is also improvement for RoI Align in their last paper (Mask RCNN), so-called RoI-Align.

    • @alexandra-stefaniamoloiu2431
      @alexandra-stefaniamoloiu2431 6 ปีที่แล้ว +2

      "It only relies on images and feature maps of a single scale, and uses filters (sliding windows
      on the feature map) of a single size." - the convolution is only computed using 3x3 filter.
      The number of parameters is "3X3X512X512 + 512X6X9" - 3X3 is the filter size, 512 the depth of the 13th layer of VGG (the last 3 are fully connected), 521 for the output dimension - 6 the output (2 for background / object classification, 4 for bounding box regressors), 9 (3 scales, 3 aspect ratios)
      "In our formulation, the features used for regression are of the same spatial size (3 X 3) on the feature maps. To account for varying sizes, a set of k bounding-box regressors are learned. Each regressor is responsible for one scale and one aspect ratio, and the k regressors do not share weights."
      The citations are from: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
      subsection 3.1.1 and subsection 3.1.2 arxiv.org/abs/1506.01497

  • @LovedbyGod4ever
    @LovedbyGod4ever 3 ปีที่แล้ว

    2:00

  • @zchenyu9797
    @zchenyu9797 7 ปีที่แล้ว

    I have a question about the RPN loss function. In the first part of equation, the p_i * is already given ? And if I understand well, the p_i is calculated by the formula IoU? Thanks for your response!

    • @ArdianUmam
      @ArdianUmam  7 ปีที่แล้ว

      p_i* is ground truth and p_i is predicted probability. For classification branch in RPN, it is binary classification (object vs not_object), and in the original paper they used softmax function (two nodes output) instead of sigmoid function(one node output). So, according to how softmax function work, p_i is predicted probability that will be maximized toward the ground truth during backprop/learning. Here is more detail reference : cs231n.github.io/linear-classify/#softmax
      Read also explanation from "Information theory view". It will be more clear and give stronger concept.

  • @rabianaseem6190
    @rabianaseem6190 4 ปีที่แล้ว

    very clear explanation.

  • @PriteshGohil44
    @PriteshGohil44 4 ปีที่แล้ว

    What is difference between x and x* in RPN loss @17:45 ?

    • @alhasanalshaebi
      @alhasanalshaebi 3 ปีที่แล้ว

      x refer the location of predict box and x* to location of actual box(Ground-truth box)

  • @knowhowww
    @knowhowww 6 ปีที่แล้ว

    i find this extremely helpful! way to go my friend!

  • @GoneGoner
    @GoneGoner 5 ปีที่แล้ว

    Okei?... aah... okei? okei?

  • @tauha_azmat
    @tauha_azmat 7 ปีที่แล้ว

    Dear Ardian Umam, I have question that can I detect lips as object of a person and can only this region containing lips can be further convoluted to propose VSR system

    • @ArdianUmam
      @ArdianUmam  7 ปีที่แล้ว

      what is VSR system?
      yes, you can train by your own dataset.

    • @tauha_azmat
      @tauha_azmat 7 ปีที่แล้ว

      VSR is a Visual Speech Recognition or Automatic Lipreading System. Can I focus lips region of a person in RPN?. Actually I want to focus a particular region in the image and and then I want to process that region only. Is this possible if yes kindly guide me.

    • @ArdianUmam
      @ArdianUmam  7 ปีที่แล้ว

      Yes, it should be able to do that. Just train with your own dataset to locate lips vs not_lips. But, consider that this is deep learning-based approach which uses GPU and runs at 5 fps. I think locating a lip can be more simple, i.e by using non deep learning method which can run in CPU. Just like face detection in OpenCV that can detect eyes only, etc. And yes, if you use deep learning for the task, maybe you can get better accuracy when you train with good-and-big-amount of dataset.

    • @tauha_azmat
      @tauha_azmat 7 ปีที่แล้ว

      Thanks dear

  • @一地-n4x
    @一地-n4x 7 ปีที่แล้ว

    Nice job!

  • @ramanathanarun6032
    @ramanathanarun6032 7 ปีที่แล้ว

    Can you do a series on YOLO9000

    • @ArdianUmam
      @ArdianUmam  7 ปีที่แล้ว +1

      Yes, YOLO9000 is better than Faster R-CNN for its accuracy and FPS now .
      Ummm...I'll make it in my list, maybe I'll do in summer holiday :)

  • @曹俊年
    @曹俊年 6 ปีที่แล้ว

    great vidio!but I have a question in 4:36,how do we create corresponding between last layer(3*3) windows and the anchors in original image(input image)?

    • @ArdianUmam
      @ArdianUmam  6 ปีที่แล้ว +1

      See the loss function formula, the corresponding anchor will judge pos/neg anchor (regarding IoU) and P* = 1 only when anchor = positive. So, how to make corresponding anchor, you just check the anchor area in input image (pos or neg) with the movement follows the movement of kernel window in the last feature map.

    • @曹俊年
      @曹俊年 6 ปีที่แล้ว

      Thank for you!but I'm still a little puzzled,maybe I should see the loss functions again.

    • @abdulhannan9771
      @abdulhannan9771 3 ปีที่แล้ว

      i know its late but as far as I have learned from matterport what he did was quite simple he just normalized the coordinates in featured domain and then you can easily establish correspondence because normalized coordinates in the feature domain will correspond to normalized coordinates in the original image.

  • @thanasis2002
    @thanasis2002 7 ปีที่แล้ว

    Great video, just a quick but fundamental question. Positive/negative region proposals do have a ground truth label (e.g. car, bird, aeroplane, e.t.c.) in case we have more than 2 classes, right? I mean during the training process of the Fast R-CNN detector.

    • @ArdianUmam
      @ArdianUmam  7 ปีที่แล้ว

      As for the dataset, of course it has.
      In Faster RCNN, there are two parts : RPN (proposer) and Fast R-CNN (detector). For RPN, in training process, it only propose object. So, the classes are only two (object vs not objects), even though dataset classes are more than two. As for detector training, it is like standard classification.

    • @thanasis2002
      @thanasis2002 7 ปีที่แล้ว

      Hm.. If that's the case, then I have one more question. Region proposals take part in training the Faster R-CNN detector. So, they need to have a ground-truth label. If the ground - truth label these region proposals have is object or not object, then how will we know the error to adjust the weights accordingly in order to have a better classifier?

    • @ArdianUmam
      @ArdianUmam  7 ปีที่แล้ว

      They use softmax loss function to calculate the error in the two nodes classification branch output. Read more here if needed : cs231n.github.io/linear-classify/#softmax
      The GT is simple, all objects GT in the dataset are class_object, will be positive_object if the IoU > 0.7.

    • @thanasis2002
      @thanasis2002 7 ปีที่แล้ว

      I am quite familiar with the softmax classifier. To be more clear, let me give you a specific example, regarding the training time. Let's say that we have a region proposal and the RPN classifies it as an object (and we have three classes, e.g. cat, dog and background). Then, this region proposal is fed into the Fast R-CNN detector to predict what kind of object we have (according to the softmax classifier). But we need an a ground truth for this region proposal to check how much the predicted output diverges from the ground truth and minimize the error. Right?

    • @ArdianUmam
      @ArdianUmam  7 ปีที่แล้ว

      Do you already read the training stages in the paper or watching the video part of it? It has 4 stages/phases of training. (1) train RPN to generate proposed region, (2) train detector by using proposed region generated in step (1). Step (3) and (4) are for fine tuning in order to get shared-CNN between (1) and (2). (1) and (2) use different loss function for classification. (1) calculates binary-entropy object vs not, (2) calculates cross-entropy for all classes in the dataset. For bbox-regress, you can read the paper cz it's simple.
      * "Let's say that we have a region proposal and the RPN classifies it as an object" --> RPN proposes region proposal, because of it, faster rcnn doesn't need any external proposer again.

  • @ZakrzewxD
    @ZakrzewxD 5 ปีที่แล้ว +1

    Thank you for this series of videos, it helped me a lot in writing my own model in Faster RCNN model. However, it'd be nice if in next videos in the description you put some crib with most useful abbreviations like:
    ROI - Region of Interest
    RPN - Region Proposal Networks
    IoU - Intersection over Union
    etc - it's nice to take a quick look what this acronym meant if you are a total newbie in it - great that you're repeating it all the times in the video, it's very useful for following!

  • @reinforcer9000
    @reinforcer9000 6 ปีที่แล้ว

    how is it that the only tutorials on r-cnn's on youtube are always done by foreigners? are there any english speaking people who understand this in-depth?