DeBERTa: Decoding-enhanced BERT with Disentangled Attention (Machine Learning Paper Explained)

แชร์
ฝัง
  • เผยแพร่เมื่อ 12 พ.ย. 2024

ความคิดเห็น • 85

  • @randomisedrandomness
    @randomisedrandomness 3 ปีที่แล้ว +16

    I don't understand most of your videos, yet i keep watching them.

  • @anshul5243
    @anshul5243 3 ปีที่แล้ว +56

    The old format seemed better, mainly because of the space wastage in this one. The title seems redundant, since the TH-cam video title already has the name of the paper, and the logo could be better off as a watermark in a smaller size.

  • @CosmiaNebula
    @CosmiaNebula 2 ปีที่แล้ว +2

    Unlike what you stated, the positional encoding vectors do not get added to the token encoding vectors. The new token encoding vectors after each attention layer is merely a weighted sum of the previous token encoding vectors.
    See the paper's Equation (4).

  • @timdernedde993
    @timdernedde993 3 ปีที่แล้ว +81

    The old layout was better. Especially as a Mobile user when the screen is smaller and there is a completely unnecessary black bar on the right.

    • @G12GilbertProduction
      @G12GilbertProduction 3 ปีที่แล้ว

      And so sumptuous.

    • @rbain16
      @rbain16 3 ปีที่แล้ว +1

      I came to the comments to say similarly. Not on mobile, but still would rather not have some screen taken up by your twitter pic (what if it changes too?).

  • @willemwestra
    @willemwestra 3 ปีที่แล้ว +30

    Hi Yannic, Absolutely love your video's, but the new recording setup I find quite a bit worse. The simple paper only setup was very nice, clean and distraction free. Moreover the font rendering is also quite a bit worse. It varied throughout your videos possibly because you might have switched software, so just pick the editor and recording program combination that gives the best pdf rendering quality. Distraction free, crystal clear rendering and good microphone are the things I appreciate the most. Your voice and microphone are great but I long for the old clean and crisp setup :)

  • @quebono100
    @quebono100 3 ปีที่แล้ว +29

    I like the old settings more now its smaller

  • @adamrak7560
    @adamrak7560 3 ปีที่แล้ว +3

    I always feed the positional information before every attention stage. That seemed better, and always converged faster for me.

  • @haukurpalljonsson8233
    @haukurpalljonsson8233 3 ปีที่แล้ว +2

    The positional information does not leak into later layers, at least not directly. The positional information is only in the attention which is then softmaxed and only multiplied to the content information.

  • @avatar098
    @avatar098 3 ปีที่แล้ว

    Thank you so much for doing these videos. Helps me keep current with NLP!

  • @rohankashyap2252
    @rohankashyap2252 3 ปีที่แล้ว

    Really awesome youtube channel, lucky to get access to such great content.Thanks a lot!

  • @null4598
    @null4598 2 ปีที่แล้ว

    Cool. You make it easy. Thank you, Yannic.

  • @biesseti
    @biesseti 3 ปีที่แล้ว

    Found this video and subscribed. You do it well.

  • @nghiapham1632
    @nghiapham1632 ปีที่แล้ว

    Thanks for your great explaination

  • @rezas2626
    @rezas2626 3 ปีที่แล้ว

    Awesome video! Do RANDOM FEATURE ATTENTION next please!

  • @sergiomanuel2206
    @sergiomanuel2206 3 ปีที่แล้ว

    Hello!! First of all, Thanks for the video. I don't the new setup, the image took a lot of screen space, although the title is okay.

  • @andres_pq
    @andres_pq 3 ปีที่แล้ว +1

    Next do a video about "A straight forward framework for Video Retrieval using CLIP" 👀

  • @lloydgreenwald954
    @lloydgreenwald954 ปีที่แล้ว

    Very well done.

  • @etiennetiennetienne
    @etiennetiennetienne 3 ปีที่แล้ว +1

    I am not sure with Position fed at first layer means their architecture already "agglomerates" position. Values are produced only from content, but are"weighted" by the hybrid attention. In itself values are just a clever mix of other values, but positional encoding is not really part of the vector itself, like it would be with direct summation or concatenation.

  • @frenchmarty7446
    @frenchmarty7446 2 ปีที่แล้ว

    I'm not sure if feeding information into the model at the beginning is necessarily better than at the end.
    Like you said yourself, the model would have to learn how to propagate that information through. That night be more of a bottleneck than just waiting until the end.
    There might also be a useful inductive bias here that's close to how humans read (you don't read a word and have both its relative and absolute position in mind).

  • @MrJaggy123
    @MrJaggy123 3 ปีที่แล้ว

    So glue was deprecated when submissions surpassed human performance. This papers submission has done the same thing for superglue (alongside another submission which also does so). Is it time for a new benchmark again? What are your thoughts on what "the benchmark after superglue" would look like?

  • @cerebralm
    @cerebralm 3 ปีที่แล้ว

    The only thing I didn't like about the old layout was that it rendered PDFs in lightmode. Not sure if there's a good way to do a vote on youtube to see which of your audience prefers lightmode and which would prefer darkmode, but that would be the only thing I would change if it was up to me :)

  • @CppExpedition
    @CppExpedition ปีที่แล้ว

    BRILLANT! :)

  • @dr.mikeybee
    @dr.mikeybee 3 ปีที่แล้ว +1

    Yannic is all you need!

  • @herp_derpingson
    @herp_derpingson 3 ปีที่แล้ว

    3:25 THICC vector :)
    .
    10:10 Yeah, this addition thing always felt dirty in the original "Attention is all you need" paper. I am glad I was not the only one who felt so.
    .
    14:13 Never mind, we end up adding them anyways, just with extra steps.
    .
    24:11 Is P learnt? Are we using the same P for all languages? There are two main types of languages, subject-verb-object languages and subject-object-verb languages. I dont think we should use the same learnt values of P for all languages as the position works completely differently in both types of languages.
    .
    35:30 Never mind, we end up using absolute positions anyways.
    .
    41:35 Fermat's last theorem: "I could prove it but I don't have enough battery"
    .
    I thought you accidentally recorded your video in wrong aspect ratio LOL

  • @Hank-y4u
    @Hank-y4u 3 ปีที่แล้ว

    Hi Yannic, big fan here. Would you take a video about Meta Pseudo Label?

  • @alpers.2123
    @alpers.2123 3 ปีที่แล้ว +12

    How much do you practice to pronounce author names:)

  • @drozen214
    @drozen214 3 ปีที่แล้ว +1

    This paper makes me wonder whether we really need to use a whole vector to represent a position

  • @mathematicalninja2756
    @mathematicalninja2756 3 ปีที่แล้ว +2

    Every day we arrive at the future

  • @Kram1032
    @Kram1032 3 ปีที่แล้ว +2

    3:13 OMG so that's why I am hungry!

  • @Dynidittez
    @Dynidittez 3 ปีที่แล้ว

    Hi, looking at the models they seem to have normal versions and versions fine tuned on mnli. Do the ones that are finetuned on mnli perform better in most benchmarks? Also on their git repo they show scores like i.e. 85.6/86.9. Is the second score there meant to represent the finetuned mnli version score?

  • @susantaghosh504
    @susantaghosh504 2 ปีที่แล้ว

    Awesome

  • @sandraviknander7898
    @sandraviknander7898 3 ปีที่แล้ว

    If the context and position were truly disentangled all the way trough the network how would the network be able to learn the transformations to the positional vectors it needs to do to rout context information? 🤔

  • @G12GilbertProduction
    @G12GilbertProduction 3 ปีที่แล้ว

    New shade of BERT v2, but more metronomical.

  • @andres_pq
    @andres_pq 3 ปีที่แล้ว +3

    Anyone knows an advanced Pytorch course. One that includes something like creating custom layers, custom training loops and handling weird stuff. I have some resesech ideas that I dont know how to implement.

    • @snippletrap
      @snippletrap 3 ปีที่แล้ว

      FastAI. The second half of the course is all custom implementations.

    • @andres_pq
      @andres_pq 3 ปีที่แล้ว

      @@snippletrap thanks

    • @谢安-k6t
      @谢安-k6t 3 ปีที่แล้ว

      @@snippletrap I was using FAST.ai framework about two years ago but quit it because it(the framework) was harder to customize then pytorch and has bad API. Do you think its course is worth learning?

    • @snippletrap
      @snippletrap 3 ปีที่แล้ว

      @@谢安-k6t Yes, I agree, I prefer vanilla PyTorch. I don't use the FastAI library because it's difficult to read the source when most functions rely on callbacks. I still highly recommend the course, you will learn a lot.

    • @谢安-k6t
      @谢安-k6t 3 ปีที่แล้ว

      @@snippletrap Get it, thanks a lot.

  • @paveltarashkevich8387
    @paveltarashkevich8387 3 ปีที่แล้ว +1

    The old layout was better. Text resolution was better. Screen space usage was better.

  • @sajjadayobi688
    @sajjadayobi688 3 ปีที่แล้ว +4

    from youtube import Yannic
    paper = 'any complex architect'
    easy_to_learn = Yannic(paper)

  • @mschnell75
    @mschnell75 3 ปีที่แล้ว +5

    Why isn't this called ERNIE?

    • @MehrdadNsr
      @MehrdadNsr 3 ปีที่แล้ว +4

      I think we already have a model named ERNIE!

    • @peterrobinson7748
      @peterrobinson7748 3 ปีที่แล้ว +1

      Because then it'll rhyme with BERNIE, one of the greatest enemies of America.

    • @ChlorieHCl
      @ChlorieHCl 3 ปีที่แล้ว

      Because there're already at least 2 models named ERNIE...

  • @pensiveintrovert4318
    @pensiveintrovert4318 3 ปีที่แล้ว

    Two many layers pollute information that may have been decisive when pristine.

  • @TechyBen
    @TechyBen 3 ปีที่แล้ว

    I came for the old Amiga game... I stayed for the new AI algorithm.

  • @zhangshaojie9790
    @zhangshaojie9790 3 ปีที่แล้ว

    Can anyone explain to me what is the difference between transformer encoder and decoder? Other than bidirectional, autoencoding, and extra FFW layer, the two model architecture looks the same to me.
    I keep hearing ppl said decoder is better at scaling. Do ppl actually mean Bert and GPT.

    • @frenchmarty7446
      @frenchmarty7446 2 ปีที่แล้ว

      The encoder and decoder are two components of the same model.

  • @yaaank6725
    @yaaank6725 3 ปีที่แล้ว

    Why the flash traveling back through time talks about a paper half a year ago

  • @anonymous6713
    @anonymous6713 3 ปีที่แล้ว

    So why in the beginning you said "the worst" case is disentangled embedding (half for positiong, half for content). But this paper just propose the disentangled one?

    • @frenchmarty7446
      @frenchmarty7446 2 ปีที่แล้ว

      He meant the worst case would be the model learning its own disentangled embedding, which would mean some chunk of the "content" vector is being occupied by position information.

  • @sedenions
    @sedenions 3 ปีที่แล้ว +1

    You are doing a good job. Talk about biologically plausible neural networks next, please.

    • @GreenManorite
      @GreenManorite 3 ปีที่แล้ว +1

      Why is that an interesting topic? Not being snarky, just trying to understand motivation for the biological parallelism.

    • @sedenions
      @sedenions 3 ปีที่แล้ว +2

      @@GreenManorite I'm biased, I majored in neuroscience and am currently switching careers. I guess this sentiment of mine comes from an interest in how researchers can better build cognitive AI. It seems like many of the early neural networks were 'neural' in name only. We are getting closer and closer to biologically plausible nets, but like you said, they're not that interesting to most.

    • @willrazen
      @willrazen 3 ปีที่แล้ว +1

      Watch his video on "predictive coding"

  • @kimchi_taco
    @kimchi_taco 3 ปีที่แล้ว +1

    Disentangled Attention is already handled by TransformerXL when it introduces relative positional embedding. In my opinion, no contribution about it.

  • @bishalsantra
    @bishalsantra 3 ปีที่แล้ว

    Just curious, what app are you using to annotate?

    • @timdernedde993
      @timdernedde993 3 ปีที่แล้ว

      OneNote

    • @florianjug
      @florianjug 3 ปีที่แล้ว

      @@timdernedde993 Is this also true with the new setup used in this video???

  • @gavinmc5285
    @gavinmc5285 3 ปีที่แล้ว

    content - ok. positioning - ok. what about context?

    • @frenchmarty7446
      @frenchmarty7446 2 ปีที่แล้ว

      What exactly do you mean by "context"? Like somekind of additional information not in the word vectors themselves? That would probably be something the model should learn on its own.

    • @gavinmc5285
      @gavinmc5285 2 ปีที่แล้ว

      @@frenchmarty7446 ok, well around the 30 minute mark there is a breakdown analysis of relative and absolute positioning merits. and the strength of either technique (or both) seems to be correlated to context. leaving aside computational or processing power (if even they could be considered of relevance) the paper analysis seems to highlight the before or after options of adding absolute positioning (in this paper at the end of the process). nonetheless, the context 'factor' or 'solving' the context (so as to allow accurate word embedding or prediction) remains and surely the optimum solution (approximately or precisely) would be to have - in a positional and hierarchical (content) vector or matrix set with relative values - some form of absolute feed within which absolute values could be accessed without necessarily having to position those as a priority before or after the relative value calculations are processed.

    • @frenchmarty7446
      @frenchmarty7446 2 ปีที่แล้ว

      @@gavinmc5285 You didn't actually answer my question but ok...
      When you say "hierarchical" information, I assume you mean some kind of graph. Unstructured graphs have actually been tried before (BP-Transformer) with some success.
      If you mean somekind of structured graph based on grammar rules, then that is a bad idea. The entire purpose of the self-attention mechanism is to learn the relationships between tokens. The attention mechanism *creates* its own graph at every layer.
      Transformers are powerful (with large amounts of data) because they impose very little inductive bias. We don't tell the network what is or isn't important, it learns that on its own. Feeding extra information that isn't in the data itself is just extra effort that only biases the network towards one particular way of looking at the data.

    • @gavinmc5285
      @gavinmc5285 2 ปีที่แล้ว

      @@frenchmarty7446 ok then, to be more definitive by 'context' i would understand it as such concepts as 'thrust', 'gist', 'essence' or 'meaning'. to interpret and apply context as relevant to subject matter is a function of intelligence. to some extent a lack of supervision - depending on the instance - may be appropriate although it is unlikely that any algorithm (unsupervised or reinforced) that wanders too far from the context within which it is operating (or supposed to be operating) is going to suddenly stumble on the parameters it needs to accurately determine values that require the appropriate context ('store / mall' is used here in the paper analysis example). not consistently time and again anyway.

    • @frenchmarty7446
      @frenchmarty7446 2 ปีที่แล้ว

      @@gavinmc5285 That is literally *more* vague than just saying "context". You are being less definitive...
      I also don't know what you mean by "stumble" on the correct parameters. We don't stumble on parameters, we train them. And we do so very consistently.
      What do you mean by "wander outside the context"? You mean outside the data distribution? That's a different meaning of "context" and we train for that as well.
      Where exactly are you unsatisfied? You say (paraphrasing) "it is unlikely that any algorithm... is going to stumble on the right parameters to accurately determine the right values". Accurately based on what? What specifically does the network have to output to meet your standard of understanding context?