DAB Detr (dynamic anchor boxes)

แชร์
ฝัง
  • เผยแพร่เมื่อ 17 ต.ค. 2024

ความคิดเห็น • 10

  • @eliaweiss1
    @eliaweiss1 6 หลายเดือนก่อน +1

    Great content!
    Some remarks:
    * you say that the residual connection makes 'half of the gradient flow to the anchor box', but it is not precise since the + operation (residual connection) passes the gradient as is (i.e. not half) and this is the key feature that help the gradient to propagate to the anchor (ie lower layers), essentially avoiding vanishing gradient
    * The division in the H,W is indeed strange, I would expect them to use multiply operation, but still it is not intuitive how any operation on a sine embedding should reflect the width and height, any way I just want to suggest that, for the network, it doesn't actually matter if it's a division or multiply, since it will learn to treat it as needed according to the loss function, so maybe they decided to use division to keep the numbers at a limited range, and the learnable weight is are just scalers that the network learns to (again) keep a reasonable numbers range
    I'm no expert, so take this remarks with a grain of salt :)

    • @eliaweiss1
      @eliaweiss1 6 หลายเดือนก่อน

      To me it seems that query sine code is a bit messy and random, I wouldn't be surprised if coming researches will improve on this point

    • @makgaiduk
      @makgaiduk  6 หลายเดือนก่อน

      Thanks for clarifications about the gradient!
      I think I understand hw modulation better after David's comment.
      The modulation happens because of Softmax properties:
      x = torch.Tensor([0, 0.25, 0.5, 0.75, 1, 0.75, 0.5, 0.25, 0])
      s = torch.nn.Softmax(dim=0)
      s(x)
      # tensor([0.0675, 0.0867, 0.1113, 0.1429, 0.1834, 0.1429, 0.1113, 0.0867, 0.0675])
      s(x/2)
      # tensor([0.0878, 0.0995, 0.1127, 0.1277, 0.1447, 0.1277, 0.1127, 0.0995, 0.0878])
      s(x*2)
      tensor([0.0369, 0.0609, 0.1004, 0.1655, 0.2728, 0.1655, 0.1004, 0.0609, 0.0369])
      Without Softmax, multiplying/dividing a tensor by a constant wouldn't change relative ratio between coordinates. With Softmax - it will; and to achieve correct modulation, we indeed need to have height/width in the denominator. So with high Height/Width in the denominator, after softmax the relative ratios between coordinates become smaller, effectively making attention more spread out around the same center; with small Height/Width in the denominator, the ratios become starker, making attention more focused around the central point

  • @davidro00
    @davidro00 8 หลายเดือนก่อน +1

    To refer to the width and height modulation: I can think a bit of it like they obtain a relational vector between w and h of content query and anchor box with the division. They then multiply it element wise to scale the attention map (or more specific the positional embedded input). Essentially, by dividing content query height by anchor box height, what you get is a scaling factor which then is used to modulate the positional embeddings in the transformer block BEFORE the softmax. This can increase or decrease the similarity between key and query and thus can be optimized during training.
    At least this is what i think about it, but it actually could be deepend a bit more in their paper 😅

    • @makgaiduk
      @makgaiduk  8 หลายเดือนก่อน

      Nice point! So the softmax seems to be the key here.

    • @davidro00
      @davidro00 8 หลายเดือนก่อน +1

      @@makgaiduk yeah, i think that makes the most sense 👍🏼

  • @A.El-Taher
    @A.El-Taher 7 หลายเดือนก่อน +1

    Great content ❤
    Detr can be used in instance segmentation if we add a mask head ... can it also be used in deformable detr and DAB detr ??

    • @makgaiduk
      @makgaiduk  7 หลายเดือนก่อน +1

      Great question!
      Looks like it can: arxiv.org/pdf/2206.02777.pdf
      DINO is a famous model for object detection, that had SOTA status for some time. It uses both dynamic anchor boxes and deformable attention, as well as a new technique that I am about to cover in the next video - query denoising.
      Mask DINO builds on top of that by adding a mask head and modifying some components a little to make them fit better with segmentation task. It still uses both dynamic anchor boxes and deformable attention as key components in the decoder.

    • @A.El-Taher
      @A.El-Taher 7 หลายเดือนก่อน +2

      @@makgaiduk Exciting 🎉
      I'm waiting for the next video 😊

  • @makgaiduk
    @makgaiduk  6 หลายเดือนก่อน

    Check out my next video: reading DAB detr source code th-cam.com/video/eClBoEnn9k4/w-d-xo.html