Certainly, RT DETR is next on my list! Though I believe RT DETR is better in terms of inference speed/model size to quality trade-off, absolute accuracy is not better than CoDETR
Hey man! great work. I have this question that I searched online but couldn't find any intuitive answers. Why do DETRs use seperate backbones, why not use transformer based backbones as encoder and backbone both.
I could think of the following reasons: - CoDETR input is multi-scale. I.e., a backbone encoder (like in SwinL) will project the image into smaller dimensions in later layers, while for object detection we need all the little pixelwise details. For this purpose, we take later outputs of backbone as well as earlier ones, flatten them, concatenate together and pass to Detr encoder, thus allowing information exchange between scales - Scale of the data: ViT was pretrained on 300m JFT dataset, which probably took millions of dollars. DETRs train on a smaller CoCo dataset with around 100k train images. In this regard, DETR encoder can bee seen as a smaller "adapter" to quickly finetune on a different target
Great answer. I think the fusion of cnns and transformers are really hyped right now because you obtain benefits from both worlds - the inductive bias + smaller & faster models within cnns and then the unbiased refinement of these features by (computationally expensive) transformers. You should have a look on parameter numbers and training time benchmarks between using a resnet50 and vit - while AP stays in a relative moderate range
@@makgaiduk Actually that could not be the sole reason at all, because you can also extract intermediate outputs of transformer encoder and create FPN by downsampling later layers and upsampling earlier to get those 4,8,16,32 stride features. I have actually done that, works pretty well. IMO the only intuitive reason I can come up with is that the CNNs are better at aggregation of local neighbourhood features. Transformers reason better globally. But then Faster RCNN produces best results with SWIN transformer backbone in documented experiments. I will try someday to train DETR without backbone and see for myself how it pans out.
@@davidro00 The question was not about replacing Resnet with transfomer, Question is why use resnet, why not use transformer encoder as backbone and encoder both.
@@saeedahmad4925 i see, my thought about this was: Encoder (but with more layers) = vit + encoder But if you remove the backbone and dont scale up the encoder, for me this intuitively is way to low representational power. If you can prove this wrong please tell me!
thanks you for this amazing serise of DETR videos. it really helps me understand how transformer works for object detection task.
hey there, i'm really inspired by your works, there is currently new SOTA of detr models, which names RT-DETR could you make a video about it?
Certainly, RT DETR is next on my list!
Though I believe RT DETR is better in terms of inference speed/model size to quality trade-off, absolute accuracy is not better than CoDETR
Dope!
Hey man! great work.
I have this question that I searched online but couldn't find any intuitive answers. Why do DETRs use seperate backbones, why not use transformer based backbones as encoder and backbone both.
I could think of the following reasons:
- CoDETR input is multi-scale. I.e., a backbone encoder (like in SwinL) will project the image into smaller dimensions in later layers, while for object detection we need all the little pixelwise details. For this purpose, we take later outputs of backbone as well as earlier ones, flatten them, concatenate together and pass to Detr encoder, thus allowing information exchange between scales
- Scale of the data: ViT was pretrained on 300m JFT dataset, which probably took millions of dollars. DETRs train on a smaller CoCo dataset with around 100k train images. In this regard, DETR encoder can bee seen as a smaller "adapter" to quickly finetune on a different target
Great answer. I think the fusion of cnns and transformers are really hyped right now because you obtain benefits from both worlds - the inductive bias + smaller & faster models within cnns and then the unbiased refinement of these features by (computationally expensive) transformers. You should have a look on parameter numbers and training time benchmarks between using a resnet50 and vit - while AP stays in a relative moderate range
@@makgaiduk Actually that could not be the sole reason at all, because you can also extract intermediate outputs of transformer encoder and create FPN by downsampling later layers and upsampling earlier to get those 4,8,16,32 stride features. I have actually done that, works pretty well. IMO the only intuitive reason I can come up with is that the CNNs are better at aggregation of local neighbourhood features. Transformers reason better globally. But then Faster RCNN produces best results with SWIN transformer backbone in documented experiments. I will try someday to train DETR without backbone and see for myself how it pans out.
@@davidro00 The question was not about replacing Resnet with transfomer, Question is why use resnet, why not use transformer encoder as backbone and encoder both.
@@saeedahmad4925 i see, my thought about this was:
Encoder (but with more layers) = vit + encoder
But if you remove the backbone and dont scale up the encoder, for me this intuitively is way to low representational power. If you can prove this wrong please tell me!
👍👍
can i ask for your ppt?
github.com/adensur/blog/blob/main/computer_vision_zero_to_hero/28_CoDetr/presentation.key