Linformer: Self-Attention with Linear Complexity (Paper Explained)

แชร์
ฝัง
  • เผยแพร่เมื่อ 20 ม.ค. 2025

ความคิดเห็น • 79

  • @mafaldahopfkirch299
    @mafaldahopfkirch299 4 ปีที่แล้ว +9

    It is just amazing how you manage to explain and still fit in a bit of humor. Thank you!

  • @MrVaunorage
    @MrVaunorage 4 ปีที่แล้ว +18

    4:30 best explanation of transformer ever! I finally understood the intuition behind it :)

  • @jg9193
    @jg9193 4 ปีที่แล้ว +4

    I love the combination of code plus paper explanation. Great work

  • @florianhonicke5448
    @florianhonicke5448 4 ปีที่แล้ว +2

    I like that you explain it with your own words and bring examples

  • @gilshapira3498
    @gilshapira3498 3 ปีที่แล้ว

    Thanks Yannic for the great value you add to the ML community by clearly distilling these interesting papers.

  • @adamtran5747
    @adamtran5747 2 ปีที่แล้ว +1

    this is a very good video. Please keep up the good work. I love this content.

  • @RaivoKoot
    @RaivoKoot 4 ปีที่แล้ว +1

    This paper is really impressive. You're explanation was incredibly great and needed as this was a more mathematical paper of course. Thank you. I literally watch all of your papers explained. They're great.

  • @Anirudh-cf3oc
    @Anirudh-cf3oc 2 ปีที่แล้ว

    This explaination saved me lot of time, Amazing work!!

  • @beepbuupbuupbeep
    @beepbuupbuupbeep 4 ปีที่แล้ว +1

    That review was fantastic!

  • @samjoel4152
    @samjoel4152 4 ปีที่แล้ว +30

    Wow you're fassssssstttttt.....
    Yannick: I am SPEED....

  • @siyn007
    @siyn007 4 ปีที่แล้ว +2

    Great explanation, easy to follow

  • @SpicyMelonYT
    @SpicyMelonYT 4 ปีที่แล้ว +4

    im watching this video with such interest and i'm constantly think about the information here but then i see the word cumsum and can't stop laughing. I feel like im two people sometimes haha

  • @DavenH
    @DavenH 4 ปีที่แล้ว +2

    Masterful presentation. Thank you.

  • @bluel1ng
    @bluel1ng 4 ปีที่แล้ว +11

    This time it felt a bit more like a lecture. Especially cool were the low-rank numpy demos! The dim reduction from 64k to 128 seems to me quite drastic, I am not directly convinced that the "granularity" of the attention will not significantly rise. But this is my initial (naive) thought without looking into the paper. Anyway this is definitely an important and very practical addition in the attention-toolbox. I am curious if they first did the math or if they just tried the random dim-reduction and then went out to explain it. ;-)

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว +3

      Yea good point. I guess that's stuff you'd ask at the poster presentation.

  • @ProfessionalTycoons
    @ProfessionalTycoons 4 ปีที่แล้ว

    thank you for covering this paper, much appreciated!

  • @ChuanChihChou
    @ChuanChihChou 4 ปีที่แล้ว +8

    Now I really wonder what would happen if we make those E & F projection matrices trainable...

  • @blue_bear_music
    @blue_bear_music 4 ปีที่แล้ว +5

    Yannic nice investigation into why the rank is low :D I wonder if this will replace transformer, since Reformer model had similar improvement but apparently nobody really cared enough to switch, GPT3 still has the O(n^2) dot products.

    • @isaackod
      @isaackod 4 ปีที่แล้ว +2

      They had an argument in the paper that the reformer was "only more efficient than the vanilla transformer when sequence length is extremely long.", so perhaps this new method will gain more momentum? Certainly excited as this has a lot going for it in terms of both improving training compute and inference speed.

  • @jasdeepsinghgrover2470
    @jasdeepsinghgrover2470 4 ปีที่แล้ว +1

    Really amazing video!!

  • @neetikapanwar8866
    @neetikapanwar8866 4 ปีที่แล้ว +2

    Nice explanation, very helpful. Thanks :-)

  • @beepbuupbuupbeep
    @beepbuupbuupbeep 4 ปีที่แล้ว +2

    I wonder what would happen if one learned to projection matricies. The backprob should automatically afjust them to retain maximal informstion right? So we might learn better projection matricies then the precomputed ones.

  • @convolvr
    @convolvr 4 ปีที่แล้ว +1

    11:30 if you order the eigenvalues by magnitude, high rank matrices have a slower decaying curve? Couldn't you use a diagonal matrix (with eigenvalues sorted down the diagonal) to produce a full rank matrix with whatever curve you want?

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว +1

      Very true. It's more like the intrinsic dimensionality of the data rather than the rank.

  • @jinlaizhang312
    @jinlaizhang312 4 ปีที่แล้ว +3

    awesome!

  • @akimtsvigun8783
    @akimtsvigun8783 4 ปีที่แล้ว +2

    Yannic, thanks for your amazing video. I would like to ask about the matrices rank - you use to say many things in the authors' proof come from the fact n (sequence length) is much larger than d (hidden dimensionality). However, d usually equals 512 in BERT-based models, while it is difficult to imagine the sequence length of 512 (and imagine the one which is "much larger than 512" is extremely difficult). Could you please clarify this?

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว +2

      512 is actually pretty limited, if you think of entire paragraphs / chapters / websites, etc. especially since the text is tokenized to word pieces, not entire words. Plus the d parameter usually gets sub-divided into multiple heads and ends up being more like 64 or 32.

    • @danraviv7393
      @danraviv7393 11 วันที่ผ่านมา

      lol this didn't age well :) The future always surprises us!

  • @TheThirdLieberkind
    @TheThirdLieberkind 4 ปีที่แล้ว +2

    could this be thought of as a kind of static dropout system, but with essentially no loss of data between the layers?

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว +4

      *approximately* no loss of data ;)

  • @dwrety
    @dwrety 4 ปีที่แล้ว +3

    I love you Yannic

  • @Xaelum
    @Xaelum 4 ปีที่แล้ว +5

    I was going to ask a video about the Linformer, but you are way too fast.
    Great job man!

  • @stefans.8027
    @stefans.8027 4 ปีที่แล้ว +1

    I like your videos going into details and help to understand the paper.
    But: you might know 2minutepapers... these are really nice to just get a short look into the key point and i can read further into them.
    What do you think about some short episodes... just looking into the papers and other episodes going more into details... view count and comment will tell you how to balance.
    Keep it up!

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว +9

      yea I feel there are already many channels that do that and I'm not good at keeping it short :)

    • @sucim
      @sucim 4 ปีที่แล้ว

      @@YannicKilcher The kids want long form content! ;) (approximate credit goes to Lex Fridman)

  • @charlesfoster6326
    @charlesfoster6326 4 ปีที่แล้ว +1

    Another great video! Thank you for this service to the community. :)
    One point I wasn't clear on: should we use a different set of random projection matrices for each input sequence (as in Reformer), or are they proposing they can be fixed?

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว +3

      I think they can just be fixed once. They even experiment with having the same projection throughout the entire network.

  • @bosepukur
    @bosepukur 4 ปีที่แล้ว +1

    thanks for this....Are you saying they are just saying that projecting the self attention matrix to lower dimention and claiming that because its self attention this projection is possible ? While in reality any matrix can be properly projected while keeping original distance constraints ?

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว

      yes, but if you do it with any matrix then the guarantees depend on the original dimension, whereas here they depend on the rank, which is the hidden dimension.

  • @shaz7163
    @shaz7163 3 ปีที่แล้ว

    Hi amazing presentation and lecture :). Btw I think projection matrices are made of learnable parameters and they do not use SVD.

  • @scottmiller2591
    @scottmiller2591 4 ปีที่แล้ว +4

    That future Yannic is pretty smart.

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว +2

      I gave him my number. xoxo

  • @srinathtankasala
    @srinathtankasala ปีที่แล้ว

    Great video and I love your channel. I have 3 questions on this:
    How does padding and masking work in this case?
    Since E and F depend on the sequence length (n), wouldn't this make the Linformer a fixed sequence length encoder-decoder?
    How would this model work during inference time when you are doing a greedy decoding of the output?
    For ex., say you generate the input embeddings from the encoder and you then start with a token at the decoder. In this case the Linear attention in the decoder cannot handle a single output as it can't project a single output embedding into the k-dimensional space. The decoder E,F map from n to k dimension, so it can't be multiplied with n=1 output embedding.
    Any help is appreciated

  • @mathematicalninja2756
    @mathematicalninja2756 4 ปีที่แล้ว +1

    This is more like using SVD on the routing matrix. Sure, you can use this in production to reduce the model size but you will have to train it in O(n**2) way I think

    • @charlesfoster6326
      @charlesfoster6326 4 ปีที่แล้ว +2

      You won't. Training is also O(nk). So long as you use a small k, it's effectively linear in time and space complexity with respect to the sequence length.

  • @MatsErikPistol
    @MatsErikPistol 4 ปีที่แล้ว +1

    Thanks for the video. Would it be possible to learn the projection? You allude to it in the video.

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว

      I guess, but you'd lose all the good properties.

  • @berin4427
    @berin4427 3 ปีที่แล้ว

    Thanks a lot for the great video(s)! I do think the projection matrices are trained (not pre-fixed) though, given they can be shared among different layers

  • @howardkong8927
    @howardkong8927 2 ปีที่แล้ว

    I'm not very convinced...
    As I understand, what this paper is saying is essentially this:
    If I say a very long sentence of n words, where each word carry an information vector of constant length d, you can compress the information into a matrix of size O(dlog(d)) regardless of n.
    In other words, if I say a long enough sentence, most of the words are going to be completely gibberish and can be safely ignored?

  • @souradipchakraborty7071
    @souradipchakraborty7071 4 ปีที่แล้ว

    One question I have is that is there any logical/theoretical reason why the softmax delays that stability point ?? One thing coming on my mind is regarding the reduction in variance that happens due to that and possible that is the reason, not sure ?

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว +1

      Good question, I have no idea.

    • @souradipchakraborty7071
      @souradipchakraborty7071 4 ปีที่แล้ว

      ​@@YannicKilcher Hahaha True. Actually, I am doing some research on this. The interesting thing is that JL lemma doesn't mention anything about low-rank matrices and their claim is for general any matrix. I don't know how JL lemma is relevant in this case as well.

    • @shivamraisharma1474
      @shivamraisharma1474 4 ปีที่แล้ว +1

      @@souradipchakraborty7071 they probably just saw that the data doesn't need to be that high in dimensionality empirically and just used the JL lemma to project it into a lower dimension, I naively think that this paper was just an extensive work to make a simple intuition work

  • @hannesstark5024
    @hannesstark5024 4 ปีที่แล้ว +1

    Is that superior in every way to the binning part of the Reformer?

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว

      Probably each has its pros and cons.

  • @herp_derpingson
    @herp_derpingson 4 ปีที่แล้ว +1

    There is no such thing as free lunch. The big O notation ignores the constants. In practice, we might find that for some problems the rank is not just ln(n) but c * ln(n) where c is very close to n. Also this getting rid of higher dimensions, reminds me of SqueezeNet.
    .
    Regardless, I think this will be used a lot in mobile devices in the future.
    .
    Very honest Broader Impact Statement. If this gets accepted I think all papers in the future will simply write, "We see no immediate negative effects beyond what applies to core machine learning" and be done with it.

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว +8

      That's true, but in the ideal case where the matrix is truly low-rank, there is actual speed-up to be gained for free. It's like there are different ways to implement the same algorithm and some are just faster.

    • @herp_derpingson
      @herp_derpingson 4 ปีที่แล้ว

      I am excited to see what Broader Impact Statement other authors are able to conjure.

  • @andreasv9472
    @andreasv9472 4 ปีที่แล้ว +1

    Thanks Yannic. Seems interesting. If possible, please help out an ai math-illiterate, what does big and small theta represent in these theorems? in 5Θ(log(n)/epsilon^2
    ) for example (8).

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว +2

      They are various bounds on complexity, look up "big o notation" on wikipedia.

    • @andreasv9472
      @andreasv9472 4 ปีที่แล้ว

      @@YannicKilcher thank you for quick answer! I'm aware of big O from calculus, didn't know it was written as Theta. Awesome!

    • @siyn007
      @siyn007 4 ปีที่แล้ว

      @@andreasv9472 theta means a constant times the function within the theta can be a lower bound and an upper bound of the function in question

  • @Kerrosene
    @Kerrosene 4 ปีที่แล้ว +3

    Hi Yannic, One quick question, How long did it take you to read and make sense of this paper? And, how long in general would it take you to read a paper like this (fairly mathematical)?

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว +3

      This one I was quite confused by, it took me multiple hours and a night's sleep.

  • @ghostlv4030
    @ghostlv4030 4 ปีที่แล้ว +1

    So, it only supports input with fixed length right?

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว

      Good question. I'll guess yes, but you can always pad probably

    • @ghostlv4030
      @ghostlv4030 4 ปีที่แล้ว

      @@YannicKilcher Thanks! Yannic.

  • @taylorchu
    @taylorchu 4 ปีที่แล้ว +1

    what do they do with padding mask?

  • @safoorayousefi3814
    @safoorayousefi3814 4 ปีที่แล้ว

    Keyeries? ;)

  • @oguzhanercan4701
    @oguzhanercan4701 2 ปีที่แล้ว

    I tkink PVT uses this metodology. Paper is not realy clear about spatial reduction

  • @etopowertwon
    @etopowertwon ปีที่แล้ว

    And they haven't used it in either version of Llama. Sad.

  • @boss91ssod
    @boss91ssod 4 ปีที่แล้ว +1

    still only 40K subscribers (?)

  • @woshikakadong
    @woshikakadong ปีที่แล้ว +1

    this paper is so janky