MLP-Mixer: An all-MLP Architecture for Vision (Machine Learning Research Paper Explained)

แชร์
ฝัง
  • เผยแพร่เมื่อ 21 ก.ย. 2024

ความคิดเห็น • 105

  • @YannicKilcher
    @YannicKilcher  3 ปีที่แล้ว +10

    OUTLINE:
    0:00 - Intro & Overview
    2:20 - MLP-Mixer Architecture
    13:20 - Experimental Results
    17:30 - Effects of Scale
    24:30 - Learned Weights Visualization
    27:25 - Comments & Conclusion

  • @TheOneSevenNine
    @TheOneSevenNine 3 ปีที่แล้ว +146

    next up: SOTA on imagenet using [throws dart at wall] polynomial curve fitting

  • @CristianGarcia
    @CristianGarcia 3 ปีที่แล้ว +97

    Coming soon to an arxiv near you: Random Forrests are all you need.

  • @patrickjdarrow
    @patrickjdarrow 3 ปีที่แล้ว +51

    Log scale graphs are all you need (to make your results look competitive)

  • @adamrak7560
    @adamrak7560 3 ปีที่แล้ว +15

    It effectively learns to implement a CNN, but it is more general. So it could implement attention like mechanisms too, on global and local scales.
    This proves that the space of good enough architectures is very large, basically if the architecture is general enough, it can do the job.

  • @YannicKilcher
    @YannicKilcher  3 ปีที่แล้ว +22

    ERRATA: Here is their definition of what the 5-shot classifier is: "we report the few-shot accuracies obtained by solving the L2-regularized linear regression problem between the frozen learned representations of images and the labels"

  • @mayankmishra3875
    @mayankmishra3875 3 ปีที่แล้ว +10

    Your explanation of such papers is very easy to understand and highly engaging. A big thank you for your videos.

  • @pensiveintrovert4318
    @pensiveintrovert4318 3 ปีที่แล้ว +35

    I have come to the conclusion that all large networks are expensive hash tables.

  • @rock_sheep4241
    @rock_sheep4241 3 ปีที่แล้ว +11

    Congratulations for the PHD and for yourwork :D

  • @mrdbourke
    @mrdbourke 3 ปีที่แล้ว +14

    MLPs are all you need!

  • @Rhannmah
    @Rhannmah 3 ปีที่แล้ว +14

    Great video as usual. Although,
    15:49 I can't disagree more with this statement. Real scientific research isn't about a quest for results, it's a quest for answers. What I mean by that is that any answer is beneficial for the common knowledge, whether it be a positive or negative result.
    In the context of machine learning, if your results show that a specific type of architecture or algorithm produces worse results than already established methods, great! By publishing, you can now prevent others from pursuing that specific path of research, so research time gets put elsewhere in more productive areas. The modern requirement for positive results to get published is incredibly stupid and stifles progress in a huge manner.

  • @linminhtoo
    @linminhtoo 3 ปีที่แล้ว +1

    Thanks for the explanation! Just a little funny how the abstract claims "We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs)." but the pseudo-code has "nn.Conv" (although yes it's not really a convolution and just to get the "linear embeddings" and probably more efficient in implementation). To me it more of shows how specialized layers still have their place, due to reasons of practical efficiency or otherwise

  • @amrmartini3935
    @amrmartini3935 3 ปีที่แล้ว +6

    10:21 are they really patches after MLP1? I feel like we lose sense of patches as MLP1 will mix together all patches of a channel. So, "red" patch does not necessarily map to "red" patch as shown in the diagram. Also this mixing via transpose reminds me of normalizing flows that "mix" the limited coordinate updates of previous layers by permutation matrices. Here, the transpose is doing the same thing on flattened feature maps. Anyways, congrats on PhD!

  • @youngseokjeon3376
    @youngseokjeon3376 4 หลายเดือนก่อน

    this seems to be a good technique for attaining large receptive field size with cheap operation like MLP, rather than the expensive self-attention technique.

  • @TeoZarkopafilis
    @TeoZarkopafilis 3 ปีที่แล้ว +1

    Since different topologies can be equivalent (conv, attention, mlp), both in terms of learning capacity/ability and pure math/computationally-wise, I really like the 'effects of scale' part.

  • @luke.perkin.inventor
    @luke.perkin.inventor 3 ปีที่แล้ว +2

    A few-shot classifier is a better measure of generalisation, even if it's more relvant for theoretical academic progress than actual product implementations.

  • @scottmiller2591
    @scottmiller2591 3 ปีที่แล้ว +2

    So apparently they're calling linear (technically affine) transformations "fully connected," instead of "fully connected" meaning a perceptron layer.
    I did like that they put a "Things that did not work" section in the paper.

  • @belab127
    @belab127 3 ปีที่แล้ว +5

    One question:
    You say in the first "per patch FC" layers the weights are shared between each patch.
    In my opinion this would be the same as using a convolution with size=patch size and stride=patch size
    Short: if I do a lot of weight sharing I end up building convolutions from scratch, saying it's not a convolution is kind of cheating, as convolutions can be viewed as multiple mini networks with shared weights.

  • @yangyue5823
    @yangyue5823 3 ปีที่แล้ว +2

    one thing for sure, residual connection is a must no matter what architecture we are playing with.

    • @quAdxify
      @quAdxify 3 ปีที่แล้ว +2

      But that's just a technicality due to the nature of back propagation. If some other kind of optimization was used (and there are alternatives, just not as fast), residual connections would most likely be useless.

    • @slobodanblazeski0
      @slobodanblazeski0 3 ปีที่แล้ว

      @@quAdxify I don't think I so. I believe that skip connections are something like going from coarser to finer resolution

    • @quAdxify
      @quAdxify 3 ปีที่แล้ว

      ​@@slobodanblazeski0 huh, that does not really make sense, you need to explain. Residual connection are all about avoiding vanishing gradients. It's the same concept LSTMs use to avoid vanishing gradients. In classic CNNs the pooling/ striding/ dilating is what allows for multiple resolutions/ scales.

    • @slobodanblazeski0
      @slobodanblazeski0 3 ปีที่แล้ว

      @@quAdxify OK in GANs I feel it's something like previous result is more or less correct. Network should do some work in higher dimensional space but when you project in lower dimensional space and add it with intermediate result the next solution shouldn't change that much. For example you first create 8x8, then 16x16 to which you add previously generated and upsampled 8x8. etc If that makes any sense

  • @andres_pq
    @andres_pq 3 ปีที่แล้ว +4

    It's an axial convolution

  • @peterfeatherstone9768
    @peterfeatherstone9768 3 ปีที่แล้ว +1

    Do you think this architecture could be used for object detection? So have the final fully connected layer predict Nx(4+1+C) features where N is some upper bound on number of possible objects, lets say 100 (like detr), 4 corresponds to xywh, 1 for "is object" or "is empty", and C possible classes. Then use hungarian algorithm for matching targets with predictions (bipartite matching), CIOU loss for xywh, and binary cross entropy loss for the rest?

  • @zhiyuanchen4829
    @zhiyuanchen4829 3 ปีที่แล้ว +2

    Let MLP Great Again!

  • @Cardicardi
    @Cardicardi 3 ปีที่แล้ว +4

    Am I getting sth wrong or this works only for fixed size patches and number of patches (i.e. fixed size images)? Mostly due to the first MLP layer in the mixer layer (MLP 1) that operates on vectors of size "sequence length" (number of patches). It would be interesting to see if sth like this can be achieved by replacing the transpose operation with a covariance operation (X.transpose dot X), this would eliminate the dependency on the sequence length and ultimately apply to any number of patches in the images (still fixed patch size I guess)

    • @sheggle
      @sheggle 3 ปีที่แล้ว +4

      Or just use any type of pyramid pooling

  • @andres_pq
    @andres_pq 3 ปีที่แล้ว +1

    Can't wait for Mixer BERT

  • @shengyaozhuang3748
    @shengyaozhuang3748 3 ปีที่แล้ว +2

    Since it can scale up to encode a very long sequence of input, I'm very curious about how this model could be applied to NLP tasks, as transformers, such as BERT and GPTs, have a strong limitation length of inputs. And also I noticed that Mixer does not use position embeddings because "the token-mixing MLPs are sensitive to the order of the input tokens, and therefore may learn to represent location"? I don't get why this is the case.

    • @dariodemattiesreyes3788
      @dariodemattiesreyes3788 3 ปีที่แล้ว

      I don't get it either, can someone clarify this? Thanks!

    • @my_master55
      @my_master55 2 ปีที่แล้ว

      I guess it doesn't need positional encoding because we assume that patches are stacked at the beginning in a certain order and then "unstacked" at the end with the same order, preserving their positions by this.

  • @furkatsultonov9976
    @furkatsultonov9976 3 ปีที่แล้ว +13

    " It is gonna be not long video..." - 28:11 mins

    • @emuccino
      @emuccino 3 ปีที่แล้ว +1

      His videos are often closer to an hour.

  • @jeffr_ac
    @jeffr_ac 2 ปีที่แล้ว

    Great video!

  • @first-thoughtgiver-of-will2456
    @first-thoughtgiver-of-will2456 3 ปีที่แล้ว

    I always thought that if transformers can approximate unrolled LSTMs than a normal fully connected NN with dropout should be able to learn a similar branching logic. Skip connections are awesome but recurrent connections seem to be more difficult to implement efficiently than they are expressive. DAG > DCG for all approximators (in my opinion).

  • @hannesstark5024
    @hannesstark5024 3 ปีที่แล้ว +8

    How is the per patch Fully-connected layer in the beginning different from a convolution with the patchsize as the stride?

    • @JamesAwokeKnowing
      @JamesAwokeKnowing 3 ปีที่แล้ว

      I think the point is the opposite. That it can be implemented using only mlp (very old, pre-cnn) concepts

    • @sheggle
      @sheggle 3 ปีที่แล้ว

      @@JamesAwokeKnowing the point hannes is trying to make, is that by dividing it into patches, they didn't actually remove all convs

    • @hannesstark5024
      @hannesstark5024 3 ปีที่แล้ว

      Well, seems that Yann LeCun is asking the same thing twitter.com/ylecun/status/1390419124266938371
      So I'll just guess that there is no difference.

    • @saimitheranj8741
      @saimitheranj8741 3 ปีที่แล้ว +2

      @@hannesstark5024 seems like it, even their pseudocode from the paper has a "nn.Conv", not sure what they mean by it being "convolution-free"

    • @hannesstark5024
      @hannesstark5024 3 ปีที่แล้ว

      @@saimitheranj8741 Oh nice spot :D
      Do you have a link to the line in the code?

  • @sau002
    @sau002 3 ปีที่แล้ว

    Very nicely presented. Thank you

  • @yimingqu2403
    @yimingqu2403 3 ปีที่แล้ว

    "How about 'an RNN with T=1'". I LIKE THIS.

  • @konghong3885
    @konghong3885 3 ปีที่แล้ว

    Hype for it being the next big thing in NLP

  • @welcomeaioverlords
    @welcomeaioverlords 3 ปีที่แล้ว

    In what ways is this *not* a CNN? It seems the core operation is to apply the same parameters to image patches in a translation-equivariant way. What am I missing?

  • @jonatan01i
    @jonatan01i 3 ปีที่แล้ว +5

    So
    1 only cares about WHERE, and
    2 only cares about WHAT.

  • @peterfeatherstone9768
    @peterfeatherstone9768 3 ปีที่แล้ว

    So it looks like this doesn't work with dynamic size, correct? Or can you have as many patches as you want since the MLP layers are 1x1 convolutions? In which case, the image needs to have dimensions divisible by the patch sizes? I assume layer norm isn't sensitive to image size...

  • @sau002
    @sau002 3 ปีที่แล้ว

    Thank you.

  • @billykotsos4642
    @billykotsos4642 3 ปีที่แล้ว +18

    WHAT IS THIS? ITS 1987 ALL OVER AGAIN!!!!!

    • @abhishekpawar921
      @abhishekpawar921 3 ปีที่แล้ว +1

      History repeats itself.

    • @billykotsos4642
      @billykotsos4642 3 ปีที่แล้ว

      @@abhishekpawar921 First as a tragedy

    • @guillaumevermeillesanchezm2427
      @guillaumevermeillesanchezm2427 3 ปีที่แล้ว +2

      Care to explain this comment to people that weren't born in 1987?

    • @arthureroberer
      @arthureroberer 3 ปีที่แล้ว +1

      @@guillaumevermeillesanchezm2427 I think its referring to the (first?) A.I winter

  • @JamesAwokeKnowing
    @JamesAwokeKnowing 3 ปีที่แล้ว

    I think you lost an opportunity here. Well maybe a follow-up "for beginners" video because the code/process is so small you couls get new-to-deep-learning folks to see the full process of sota vision. Like you can illustrate the matrices and vectors, channels gelu and norm (skip back prop) to show what code/math is necessary to take a 2d image and output class, with sota accuracy.

  • @reuvper
    @reuvper 3 ปีที่แล้ว

    What a great channel! Thanks!

  • @kaixiao2931
    @kaixiao2931 3 ปีที่แล้ว

    I think there is a mistake in 9:34. The first channel corresponds to the upper left corner of each patch?

  • @mlengineering9541
    @mlengineering9541 3 ปีที่แล้ว

    It is a pointnet for image patches ...

  • @NeoShameMan
    @NeoShameMan 3 ปีที่แล้ว +1

    I'm confused what does mlp mean here? i thought mlp was binary neuron, ie either output 1 or 0, no fancy curve like sigmoid and relu...

    • @YannicKilcher
      @YannicKilcher  3 ปีที่แล้ว +3

      Multilayer Perceptron. A neural net consisting mainly of fully connected layers and nonlinearities

    • @NeoShameMan
      @NeoShameMan 3 ปีที่แล้ว +1

      @@YannicKilcher thanks!

  • @rishikaushik8307
    @rishikaushik8307 3 ปีที่แล้ว

    in the mixer, MLP-1 just sees the distribution of a channel or feature without knowing what that feature is, appending the one hot index of the channel to the input might help the model learn better
    also can we really say that each channel in the output of MLP-1 after transpose corresponds to a patch?

    • @arunavaghatak6281
      @arunavaghatak6281 3 ปีที่แล้ว +1

      In the output of MLP-1, before transpose, each row alone contains information about the global distribution of some feature but no information regarding what that feature is. But notice that in the resultant matrix, information regarding the identity of the feature is still there. The first row corresponds to first feature, the second row corresponds to the second feature and so on. The feature identity is determined from the row number. So, after the transpose, when MLP-2 is applied, it knows that the first element of the input vector contains information about the global distribution of the first feature, second element contains that of the second feature and so on. So, information regarding the feature's identity is not lost. We don't need one hot index of the channel for that.
      And we can definitely say that each channel in the output of MLP-1 after transpose corresponds to a patch. That's because MLP-1 doesn't output a single scalar value. It outputs a vector in which each element corresponds to some patch. So, MLP-1 outputs information (about global distribution of the feature) needed by the first patch at first element of the output vector, that needed by the second patch at the second element of the output vector and so on. That's why different patches have different feature vectors after the transpose.
      Hope it makes sense. I have not studied deep learning very deeply.

  • @ecitslos
    @ecitslos 3 ปีที่แล้ว

    You can implement MLP with CNN layers, and due to hardware and CUDA magic it can be faster. They are essentially the same operations.

    • @ghostriley22
      @ghostriley22 3 ปีที่แล้ว

      Can you explain more or give any links?

    • @my_master55
      @my_master55 2 ปีที่แล้ว

      @@ghostriley22 1x1 convolutions can be applied instead of MLP.

  • @user-mb3mf2og9k
    @user-mb3mf2og9k 3 ปีที่แล้ว +3

    How about replace the MLPs with CNNs or transformers in this architecture?

  • @XOPOIIIO
    @XOPOIIIO 3 ปีที่แล้ว +4

    These patches can't extract data efficiently, they should overlap at least. To do better you don't need larger dataset, just augment existing with translation.

    • @sheggle
      @sheggle 3 ปีที่แล้ว

      They do augment, what makes you think that you don't need a larger dataset?

    • @XOPOIIIO
      @XOPOIIIO 3 ปีที่แล้ว +1

      @@sheggleBecause how these patches are stacked together, they're not strided as in CNN.

  • @sahityayadav9606
    @sahityayadav9606 3 ปีที่แล้ว

    Always amazing video

  • @sau002
    @sau002 3 ปีที่แล้ว

    How does the feature detection work, if there are no CNN kernels ? E.g at some point you mention about corners getting detected in a patch.

  • @slackstation
    @slackstation 3 ปีที่แล้ว +2

    Congratulations on the PhD!

  • @talha_anwar
    @talha_anwar 3 ปีที่แล้ว

    in case of RGB, there will be 3 channels?

  • @IoannisNousias
    @IoannisNousias 3 ปีที่แล้ว

    Is the “mixer” kinda like a separable filter?

  • @herp_derpingson
    @herp_derpingson 3 ปีที่แล้ว

    Friendship ended with CNNs. DNN is my new best friend.

  • @bluel1ng
    @bluel1ng 3 ปีที่แล้ว

    Mix it baby! Gimme dadda, more more dadda ..

  • @NextFuckingLevel
    @NextFuckingLevel 3 ปีที่แล้ว +3

    Reject Attention, embrace Perceptron

  • @droidcrackye5238
    @droidcrackye5238 3 ปีที่แล้ว

    conv share parameters over patches. this paper does not.

  • @timhutcheson8714
    @timhutcheson8714 3 ปีที่แล้ว

    What I was doing in 1990...

  • @qidongyang7817
    @qidongyang7817 3 ปีที่แล้ว

    Pre-training again ?!

  • @Rizhiy13
    @Rizhiy13 3 ปีที่แล้ว

    Now someone needs to try it on NLP)

  • @Mikey-lj2kq
    @Mikey-lj2kq 3 ปีที่แล้ว

    per-patch looks like cnn though.

  • @jamgplus334
    @jamgplus334 3 ปีที่แล้ว

    next up: empty is all you need

  • @valeria6813
    @valeria6813 3 ปีที่แล้ว

    my little pony my little pony AaAaAAaAa

  • @domenickmifsud
    @domenickmifsud 3 ปีที่แล้ว +1

    Lightspeed

  • @jameszhang126
    @jameszhang126 3 ปีที่แล้ว +1

    these mlps are just another way of saying 1d cnn

  • @swordwaker7749
    @swordwaker7749 3 ปีที่แล้ว

    I feel that the mlp could be replaced with an lstm or something.

  • @Adhil_parammel
    @Adhil_parammel 3 ปีที่แล้ว

    Can alpha zero teach go or chess lessons with help of gpt3!?

  • @444haluk
    @444haluk 3 ปีที่แล้ว

    They literally proved nothing has changed after 2012.

  • @sangeetamenon8958
    @sangeetamenon8958 3 ปีที่แล้ว

    I want to mail you please give me id

  • @WhatsAI
    @WhatsAI 3 ปีที่แล้ว

    Hey Yannic! Awesome video as usual, and it's so great that you cover all these new papers so quickly. I love it!
    Also, we share your content on my discord server with over 10'000 members in the field of AI now. If you like chatting with people, I would love it if you joined us! It was built to allow people to share their projects, help each other etc., in the field of AI, and I would be extremely happy to see your name in there sometimes! You can also share your new videos and just talk with others on there. I am not sure if I can paste links in a youtube chat, but the server is Learn AI Together on discord. I could message you the link on Twitter if you'd like!

  • @peterfeatherstone9768
    @peterfeatherstone9768 3 ปีที่แล้ว +1

    It looks like what makes this good is the transposing of the patches which enhances the receptive field of the network. Maybe something similarly good could be achieved using pixel shuffle (pytorch.org/docs/stable/generated/torch.nn.PixelShuffle.html) and regular convs. Maybe pixel shuffling and other tricks like that could do the job just as fine.

    • @da_lime
      @da_lime 3 ปีที่แล้ว

      Sounds interesting