Longformer: The Long-Document Transformer

แชร์
ฝัง
  • เผยแพร่เมื่อ 14 ต.ค. 2024

ความคิดเห็น • 50

  • @ArnavArora
    @ArnavArora 4 ปีที่แล้ว +36

    Great video! Very intuitive.
    Slight correction, when explaining the sliding window attention using the graphs in the paper, either the range should be [i-w/2, i+w/2] or the window size is 2w.

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว +3

      Absolutely true, thank you!

  • @gary9630
    @gary9630 4 ปีที่แล้ว +12

    I just finished reading the paper and then watched the illustration you give in this video. TBH, you give such an informative and intuitive explanation! Nice job! I really love it.

  • @jeonghwankim8973
    @jeonghwankim8973 3 ปีที่แล้ว +1

    Great explanation. You literally tap into the main part of the paper and explain it in the most intuitive way possible. Thank you.

  • @shairuno
    @shairuno 3 ปีที่แล้ว +3

    I love how you are always skeptical and try to justify the claims from the paper. Great vid!

  • @sagumekishin5748
    @sagumekishin5748 4 ปีที่แล้ว +2

    Speaking of convolution and attention, there are papers suggesting self attention can replace convolution completely in vision tasks. I think those papers are worth covering.
    "Stand-alone self-attention in vision models"
    "On the Relationship between Self-Attention and Convolutional Layers"

    • @priyamdey3298
      @priyamdey3298 4 ปีที่แล้ว

      Very interesting line of work. Thanks for the info!

  • @DistortedV12
    @DistortedV12 4 ปีที่แล้ว +3

    Wow that was fast! this paper just came out. Will check vid tomorrow.

  • @mnk6436
    @mnk6436 2 ปีที่แล้ว +1

    Awesome video and great explanation! Thank you!

  • @iliemihai949
    @iliemihai949 4 ปีที่แล้ว +2

    This video is so awesome, and well explained. You are a great teacher ! Also was wondering if you know if the document classification datasets that they use for testing are available ? Thank you and keep it up.

  • @Arwin_Unbeatable
    @Arwin_Unbeatable 2 หลายเดือนก่อน

    Great video. A big thumbs up

  • @laveenabachani
    @laveenabachani 3 ปีที่แล้ว

    Thank you for making this. This was very helpful. Loved it!

  • @b.jardim2079
    @b.jardim2079 4 ปีที่แล้ว +1

    Great video, you made it very clear! Thanks.

  • @OccultDemonCassette
    @OccultDemonCassette ปีที่แล้ว

    Hmm, I wonder why "special tokens" are turned off on a lot of the Collab tests I've seen on longformers. Seems like they would be beneficial?

  • @freemind.d2714
    @freemind.d2714 4 ปีที่แล้ว +1

    Great video as always!!
    Did you think those sliding window and dilated sliding windows is very similar to idea of WaveNet architecture
    ? for make the deeper layer gain the all input informations more efficient

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว

      It's certainly related, but the dilated convolutions will then pull the dilations together in the next layer, not sure if that's happening here.

  • @dummyrezajzadeh
    @dummyrezajzadeh 11 หลายเดือนก่อน

    beautifully explained thank you

  • @chrisber
    @chrisber 2 ปีที่แล้ว

    This was excellent, thank you so much for your channel!

  • @GeekProdigyGuy
    @GeekProdigyGuy 4 ปีที่แล้ว +1

    If there are 2 layers with kernel size 3, and only the second layer is dilated skipping every other unit, the second layer will not "miss" any local information simply by the adjacent windows overlapping. So I don't think using dilation only at higher layers necessarily goes against the importance of locality.

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว

      Correct. Information can still aggregate with depth. My point was that their argument for the sliding window was the importance of locality and the dilation is directly counter to that. But yes, they solve that by only dilating the higher layers where they argue that locality does not matter as much anymore.

  • @АлексейТучак-м4ч
    @АлексейТучак-м4ч 4 ปีที่แล้ว +1

    Idea inspired by global+sliding window: n1 nodes would connect to c1 previous nodes, n2 to c2, n3 to c3 etc. and they all are randomly shuffled

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว

      Nice idea, but you'd lose the inductive prior that neighbors are important

  • @kikimajo6850
    @kikimajo6850 4 ปีที่แล้ว +1

    Love this series!

  • @paramsraman3948
    @paramsraman3948 3 ปีที่แล้ว

    Great review of the paper! Very clear and helpful.. Mind sharing what tools you use for the presentation? (to zoom in, annotate with your markers, get to whiteboard on the side of pdf etc).. It is pretty cool

  • @qian2718
    @qian2718 4 ปีที่แล้ว +2

    Didn't realize the memory consuming would be same until I watched the video😭

  • @riasingh2558
    @riasingh2558 4 ปีที่แล้ว +1

    Intuitively, how is Longformer different from Transformer-XL? Finally, how do Transformer-XL, Longformer, and Linformer compare with each other if Long and Lin -former bot have linear complexity/ Thanks for great content!

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว

      In transformer-xl you don't explicitly train the carry-over mechanism, as I understand it. The linformer projects the sequence length down, while the longformer only attends to a sub-part of the sequence.

  • @Chr0nalis
    @Chr0nalis 4 ปีที่แล้ว +5

    Thx for the paper. Sound level is on the low side, perhaps you could look into normalizing the level before uploading.

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว +2

      Noted, thanks for the feedback.

  • @drdca8263
    @drdca8263 4 ปีที่แล้ว +1

    Question: could it work to have the model select on-the-fly some subset (with a small maximum size, like, at most 10 or so, idk) of nodes to treat as special, and as being able to connect to everything?
    (And, which nodes are counted as special would differ between layers)
    Like, have something estimate for each node how useful it would be to have that node be checked against all of the nodes, not just the nearby ones, and then when computing the dot products, and associated matrix, compute the dot products for the nearby pairs and the pairs where at least one of the two was selected as important?
    I imagine maybe that would be hard to train because not differentiable? And also might be slow to compute?
    Disclaimer: I don’t know what I’m talking about.

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว +1

      You're absolutely right. First, yes that could be a very valid idea and second, yes it would be very hard to train, because as soon as you introduce "hard" attention like this, you have no learning signal flowing back.

    • @drdca8263
      @drdca8263 4 ปีที่แล้ว

      Yannic Kilcher Thanks!

    • @sauravmukherjeecom
      @sauravmukherjeecom 4 ปีที่แล้ว

      Weirdly, your idea is very similar to bigbird. Your idea was definitely very valid.
      For Yanic's skepticism of information flowing back, they kept the subset same across layers and changed the subsets only across different sequences.

  • @Timtom0707
    @Timtom0707 4 ปีที่แล้ว

    If every way of coloring the matrix is a valid way of cutting down on the attention calculations, maybe it would be interesting to do some kind of architecture search over possible colorings? It seems unlikely that the assumptions they've made are the optimum - maybe there's room some kind of hierarchical structure? hmmm

  • @herp_derpingson
    @herp_derpingson 4 ปีที่แล้ว +1

    It would be interesting to see what this model would look like under OpenAI microscope.
    If that is even possible.

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว

      Not directly. The microscope optimizes inputs in continuous space. Here you'd have to optimize the discrete text input. Not entirely clear how that would work.

  • @taku8751
    @taku8751 4 ปีที่แล้ว +1

    It just aggregate all seperated parts processed by formal transformer.

  • @priyamdey3298
    @priyamdey3298 4 ปีที่แล้ว

    A thought crossed my mind: Do you think special tokens act more like the cell states of LSTM?

  • @adamtran5747
    @adamtran5747 2 ปีที่แล้ว

    i love it

  • @mehdimashayekhi1675
    @mehdimashayekhi1675 4 ปีที่แล้ว

    great job! keep it up

  • @josephharvey1762
    @josephharvey1762 2 ปีที่แล้ว +1

    If local attention is really what matters, why bother building a model that can attend over entire massive documents? Why do we need to overcome the 512 seq limit if local attention is really most important?

    • @LegoGunshipper
      @LegoGunshipper ปีที่แล้ว

      It achieves better results. There is no 'why'

  • @pratik6447
    @pratik6447 3 ปีที่แล้ว

    In the long former model the config file has > and >. What does these 2 parameter means? Which one is token size?

    • @ty7521
      @ty7521 2 ปีที่แล้ว

      4098 is the token size

  • @zingg7203
    @zingg7203 2 ปีที่แล้ว

    The volume is really low. Have to max to hear your talk

  • @johngrabner
    @johngrabner 4 ปีที่แล้ว

    A more symmetrical approach: split seq into n. 2 layers of attention, first across n, second first element of n fer every n. Am I missing something?

  • @parker1981xxx
    @parker1981xxx 3 ปีที่แล้ว

    Yannic, I think it would be nice if you could redo that "Attention is All You Need" video. The reasons are: (1) your video lecture skills are so much better compared with 3 years ago (your voice control, drawings, jokes, etc.), and (2) that paper is at the core of many papers you present, so people visit that video continuously.