Involution: Inverting the Inherence of Convolution for Visual Recognition (Research Paper Explained)

Yannic Kilcher

มุมมอง 24 738

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 27 ก.ย. 2024

ความคิดเห็น • 41

@YannicKilcher 3 ปีที่แล้ว ⁺⁶
OUTLINE:
0:00 - Intro & Overview
3:00 - Principles of Convolution
10:50 - Towards spatial-specific computations
17:00 - The Involution Operator
20:00 - Comparison to Self-Attention
25:15 - Experimental Results
30:30 - Comments & Conclusion
@yoperator8712 3 ปีที่แล้ว ⁺¹⁸
i love your channel yannic! keep up your good work!
@Neptutron 3 ปีที่แล้ว ⁺¹
These guys are presenting at CVPR tomorrow at 6am!
@bluel1ng 3 ปีที่แล้ว ⁺⁴
Regarding position specific computations: We always could concat position-encoding feature maps, e.g. computed kernels would not only be content but also real position dependent.
The video contains a great 10min CNN intro/recap! Some notes for completeness:
- There are {1,2,3..N}-D convolutions, 4D weights are the 2D conv-case (most popular, due to image use-case).
- The center-output mappings only stay at the same position with proper padding, otherwise there the output-feature maps will be (w-kw+1)x(h-kh+1) (e.g. 4x4 input with 3x3 kernel -> 2x2 output without padding).
@NilabhraRoyChowdhury 3 ปีที่แล้ว ⁺¹¹
14:35 - "Whatever, we don't do slidey slidey anymore"
@sayakpaul3152 3 ปีที่แล้ว
I got my convolution revision pretty well. Such a lovely one
@spiritcrusha 3 ปีที่แล้ว ⁺²
The idea here seems a straightforward combination of fast weight memory networks and locally connected layers
@CristianGarcia 3 ปีที่แล้ว ⁺⁴
I don't know if standard convolutional operations on tensor frameworks support per location kernels which might be a barrier for practitioners in the short term. That said, I really like the idea.
@nitikanigam287 3 ปีที่แล้ว
Love your channel and way you explain
Please do tell something related to research. How to do and what should need to do regarding deep learning in computer vision.
@JoshBrownKramer 3 ปีที่แล้ว ⁺²
Where do new channels come from and how does information from different channels get fused together?
@rubenpartono 3 ปีที่แล้ว ⁺¹
About your comment at 24:00, if the pixel also contained spatial information (e.g. RGBXY), wouldn't this then be spatial-specific?
@jfno67 3 ปีที่แล้ว ⁺³
At 24:00 you mention that this "involution kernel" is also spatial agnostic, since it will generate the same one for two different pixel if they have the same channel components. Do you think it would be worthwhile to add a positional encoding to the channels to make each "involution kernel" truly position specific?
@linminhtoo 3 ปีที่แล้ว ⁺³
That's interesting, but is there really a need to/do we want that? It will enforce the idea that pixels in the top left corner of an image are semantically different from pixels somewhere else in the image, when in reality that is not true since the location of a pixel is more an artifact of whoever took the photograph than actual semantic meaning. Won't this make it lose translation invariance?
@priyamdey3298 3 ปีที่แล้ว ⁺⁴
2:32 - 2:59 This reminded me of Schmidhuber 😀
@albertwang5974 3 ปีที่แล้ว
Maybe, we just create a bunch of kernels manually, then apply every kernel to every channel, after several layers, we get a channel tree network, a network without training needed, we marked every activate cell as a connection to the target, more connections more credits to the target.
@usamanavid2044 3 ปีที่แล้ว
Where to learn about transformers & Self Attention?
@anishbhanushali 3 ปีที่แล้ว
thanks to their pseudo-code, I got that kernels are not direct learnable weights (in contrast to normal convolution convention )
@ACArchangels 3 ปีที่แล้ว
What do you think about Kervolutions?
@kimchi_taco 3 ปีที่แล้ว
I heard there are multiple instances in AWS, whose name is yannic_xxxx.
@billykotsos4642 3 ปีที่แล้ว ⁺²
Doesn't channel = feature... roughly ?
@pensiveintrovert4318 3 ปีที่แล้ว ⁺¹
Only if it turns out to be useful, but these nets are not intended to prune useless, random "features." Ever wondered why BERT transformers have 8 heads? They throw enough compute/storage at any problem and hope some useful feature would float to the top.
@YannicKilcher 3 ปีที่แล้ว ⁺¹
on a per-layer basis, yes more or less. channel is the technical name for the dimension, while feature is a more conceptual thing
@edeneden97 3 ปีที่แล้ว
isn't the first part just depthwise convolution?
@yoheikao490 3 ปีที่แล้ว ⁺⁴
Too bad that the emphasis is on the number of parameters or flops, these are known useless proxy measures of the things that really matter: generalization and computation time. The later point is a huge disappointment (at least for those still believing in the relevance of flops), as RedNet are actually *slower* (Table 2 of the paper) than ResNet. Ooops, the x-axis of those comparison graphs are now irrelevant: how to RedNet then *really* compares to old ResNet, not to mention newer variants?
@datrumart 3 ปีที่แล้ว ⁺²
this could be explained because pytorch call a very well optimized nvidia CuDNN implementation for the convolution operation used in Resnet whereas their new operation is written in pure pytorch. Using computation time is a bad idea for theorical papers as results would be even more sensitive to hardware lottery
@nabi7600 3 ปีที่แล้ว
what do you use to highlight on the pdf?
@yoperator8712 3 ปีที่แล้ว ⁺¹
first comment!
@MstProper 3 ปีที่แล้ว ⁺¹
I like turtles
@freemind.d2714 3 ปีที่แล้ว ⁺²
Free Hong Kong!!!
@nigelwan2841 3 ปีที่แล้ว
內卷
@EinsteinNewtonify 3 ปีที่แล้ว ⁺¹³
Hello Yannic! First of all, thank you for the work you do to present the papers to us. I think it's obvious that you're getting faster and faster at producing the stuff.
But now it's like this: you might start working soon. And maybe these videos are also a great advertisement for you to get exciting positions. Nevertheless, I hope that you will continue to find the time to fill your channel with life in the future.
Take care Jürgen
@RonaldoMessina 3 ปีที่แล้ว ⁺¹
that is some convoluted writing!
@nauman.mustafa 3 ปีที่แล้ว ⁺⁸
yet another good paper with pytorch pseudo code instead of math heavy latex
@Hugo-ms4mx 3 ปีที่แล้ว ⁺²
@@frazuppi4897 what makes the code bad ? I’ve quickly scheme through it and it doesn’t look that bad to me. Also I guess the original author was ironic ? He would have preferred more math in the paper ? Trying to learn here :) thanks !
@herp_derpingson 3 ปีที่แล้ว
15:10 The spacial-agnosticity of CNNs are counteracted by the fact that the features that these kernels extract are propagated further down the layers. Eventually a DNN at the end or a global max pooling does some non-spacial-agnostic stuff.
.
Nice idea, it would be interesting to see more of this meta-generated neural networks.
@NelsLindahl 3 ปีที่แล้ว
Keep rocking the papers! You will be to 100k subscribers before you know it.
@nocomments_s 3 ปีที่แล้ว
Amazing channel, man! Thank you very much!
@01FNG 3 ปีที่แล้ว
These researchers are going too far!!
@gamerx1133 3 ปีที่แล้ว
second
@hiransarkar1236 3 ปีที่แล้ว
Amazing

ต่อไป

เล่นอัตโนมัติ

XCiT: Cross-Covariance Image Transformers (Facebook AI Machine Learning Research Paper Explained)