Pretrained Transformers as Universal Computation Engines (Machine Learning Research Paper Explained)

แชร์
ฝัง
  • เผยแพร่เมื่อ 27 ก.ย. 2024

ความคิดเห็น • 74

  • @YannicKilcher
    @YannicKilcher  3 ปีที่แล้ว +11

    OUTLINE:
    0:00 - Intro & Overview
    2:00 - Frozen Pretrained Transformers
    4:50 - Evaluated Tasks
    10:05 - The Importance of Training LayerNorm
    17:10 - Modality Transfer
    25:10 - Network Architecture Ablation
    26:10 - Evaluation of the Attention Mask
    27:20 - Are FPTs Overfitting or Underfitting?
    28:20 - Model Size Ablation
    28:50 - Is Initialization All You Need?
    31:40 - Full Model Training Overfits
    32:15 - Again the Importance of Training LayerNorm
    33:10 - Conclusions & Comments

  • @PotatoKaboom
    @PotatoKaboom 3 ปีที่แล้ว +5

    As always, very well done! Very clear explanation and great thoughts on the paper. Blows my mind that you seem to do these videos in a single take.

  • @twmicrosheep
    @twmicrosheep 3 ปีที่แล้ว +1

    I think the paper "K for the Price of 1: Parameter-efficient Multi-task and Transfer Learning" from ICLR 2019 already demonstrated that it is possible to transfer/fine-tune a model by using only the parameters of the normalization layers, scale and biases. It also shows how pre-trained models can have better results compared to random init models.
    Quoting from the Abstract: "The basic approach is to learn a model
    patch - a small set of parameters - that will specialize to each task, instead of finetuning the last layer or the entire network. For instance, we show that learning a set of scales and biases is sufficient to convert a pretrained network to perform well on qualitatively different problems (e.g. converting a Single Shot MultiBox Detection (SSD) model into a 1000-class image classification model while reusing 98% of parameters of the SSD feature extractor)."

  • @norik1616
    @norik1616 3 ปีที่แล้ว +14

    I wonder, why are there never confidence intervals? From my limited experience even +-1 % can be in 2 × std just based on the seed.

    • @howardkong8927
      @howardkong8927 2 ปีที่แล้ว

      Yeah. These improvements in the paper seem pretty marginal, I wonder if an error could just make it disappear.

  • @DamianReloaded
    @DamianReloaded 3 ปีที่แล้ว +12

    This could open the possibility of having a pre-trained multi-purpose core model to which you can just append input layers and output layers and it could be able to process whatever data you throw at it. Imagine if a company provided this core-model as a service and "clients" only had to provide the input layers, the output layers and the data...

    • @Daniel-ih4zh
      @Daniel-ih4zh 3 ปีที่แล้ว

      Yeah, it's basically plug in priors.

    • @RobotProctor
      @RobotProctor 3 ปีที่แล้ว

      Isn't that what Google's BERT is?

    • @DamianReloaded
      @DamianReloaded 3 ปีที่แล้ว +1

      @@RobotProctor I think Google BERT is intended to be used only for NLP. This paper talks about NNs trained for NLP being used with any kind of data, say images, without re-training them with image datasets.

    • @RobotProctor
      @RobotProctor 3 ปีที่แล้ว

      @@DamianReloaded got it

    • @andrewminhnguyen9446
      @andrewminhnguyen9446 3 ปีที่แล้ว

      Amazing. You wanna do it? MaaS (model-as-a-service).

  • @f14-werto
    @f14-werto 3 ปีที่แล้ว +17

    Now that I've heard of this "computational utility", I wonder if a more "artificial" hard task like SAT solving can encode better "utilities" than a natural language task.

    • @norik1616
      @norik1616 3 ปีที่แล้ว +1

      I like it!
      Sadly I think it will require more time to converge than GPT2 and that would (for now) require a lot of compute out of a mere mortal reach.

    • @f14-werto
      @f14-werto 3 ปีที่แล้ว +1

      @@norik1616 Guess I'll keep wondering then

  • @burhanrashidhussein6037
    @burhanrashidhussein6037 3 ปีที่แล้ว +1

    22:10 Your insights could be important here, they should try on more complex task. For example fine-grained classification. MNIST and CIFAR10 have been repeatly use to investigate some model behaviours and making big conclusions while there are more complex tasks when it comes to practice implications.
    Thanks for the videos we are really. benefiting

  • @hannesstark5024
    @hannesstark5024 3 ปีที่แล้ว +1

    ~"Because we are in the low data domain we are not better than the fully trained transformer" But we also have the same performance for ListOps where we do have a lot of training data.

  • @pensiveintrovert4318
    @pensiveintrovert4318 3 ปีที่แล้ว

    Transformers simply build multi-level look up tables. It is basically an algo for that game "Is it a plant? Is it an animal? Does it have fur? etc..."
    By combining information about a token and its context transformers disambiguate the meaning of every token within a unique context.

  • @ce6535
    @ce6535 3 ปีที่แล้ว +1

    One of the things that I thinks supports their hypothesis that language is special in an indirect way is that essentially every novel item in an image corresponds to some new noun in the language model. So if you had an image classifier with the same number of classes as some other language model, the language model that 'corresponds' to that image classifier might be larger again.

  • @conduit242
    @conduit242 3 ปีที่แล้ว +10

    Embeddings are all you need 🤷🏻‍♂️

  • @christophecerisara8741
    @christophecerisara8741 3 ปีที่แล้ว

    Great video, thanks ! My guess is that they talk about zero-shot because they place themselves from the perspective of the pretrained transformer: all they're doing is, in a way, train the minimum required to adapt to a new task: inputs encoding, output classes, and adapt the range of values taken by the hidden representations for the new task, but the pretrained modules stay fixed. But I agree with your point, of course... Thanks !

  • @ahmedmagdy2932
    @ahmedmagdy2932 3 ปีที่แล้ว +1

    i like your point about the number of classes, also honestly, this randomly initialized model is surprisingly good, I can understand the point about the layer norm effect but still, you know it is random initialized lol, I saw those couple of times where people try the random initialization and get fairly good results, I have a theory I feel it will be proven one day, transformers idea is good but these large ones are not necessarily needed, I don't think leaving all of this weights behaving blindly is the solution.

  • @user-rh8hi4ph4b
    @user-rh8hi4ph4b 3 ปีที่แล้ว +1

    The comparison to a randomly initialized frozen transformer, while still showing a significant difference, is close enough to FPT to really throw me off. Same thing with random CNN of which only the batchnorms are trained. The idea that many supposedly difficult tasks can be solved reasonably well with random transformations where only the scale and mean of some layers/scalars/etc is tweaked tells me we're really missing something. It's almost like machine intelligence is a matter of trying as many random things as you can and scaling the things that don't work into oblivion. Are there any published experiments that compare a fully trained network to its initialized state and look at how much the parameters have actually changed, and which ones?

  • @freemind.d2714
    @freemind.d2714 3 ปีที่แล้ว +3

    Your second guess about why Pretrained Transformers even work just as same as mine

  • @liammcdevitt7594
    @liammcdevitt7594 3 ปีที่แล้ว +6

    I noticed that you made a video at the end of 2017 called, "Attention Is All You Need". There is a new paper called, "Is Attention Better Than Matrix Decomposition?" and it is proposing a new method called Hamburger that uses Matrix Decomposition claiming that this 20-year-old technique is better than Attention. I was wondering if you could read this paper and make a video helping me understand this new method?

    • @liammcdevitt7594
      @liammcdevitt7594 3 ปีที่แล้ว

      @@pshyilocibinoaut9433 Maybe, we'll have to see if Yannic does a video on it ;)

    • @piotr780
      @piotr780 3 ปีที่แล้ว

      pretrained networks are exactly some kind of decomposition of original data, so mayby pretreining is discovering some unknown (or known) decomposition method or aproximating existing one ? I tried once series of NMF on data but I have problem we architecture itself like : split the data into two in second layer and apply two NMF ? or add noise after first layer etc.

  • @aleph0540
    @aleph0540 3 ปีที่แล้ว +2

    Does the number of degrees of freedom have something to do with the complexity of the tasks here? Or rather the success of these tasks?

  • @slavamoiseev803
    @slavamoiseev803 2 ปีที่แล้ว

    It triggers some speculation about a new type of hardware that provides some kind of general inference for downstream tasks with minimal tuning.

  • @0102030405Jacky
    @0102030405Jacky 3 ปีที่แล้ว

    I think that model pretrained on language outperforms ViT or Bit is due to the fact that language has more complicated syntax such as causality or recursion. I guess that a transformer pretrained on logic reasoning tasks may performed equally well as FPT

  • @dimonenka
    @dimonenka 3 ปีที่แล้ว +1

    19:36 It does not seem to me that the vision transformer lags behind. It performs considerably worse on Homology than FPT, but it also underperforms to Random on Homology. I don't know what this task is, but I'm guessing vision transformer just has bad inductive biases or learns bad patterns or whatever for that task, and that it is more similar to learning natural language. Bottom line, it really matters what set of problems one chooses to test transformers, and based on the choice in this paper the results are inconclusive.

  • @williechen1876
    @williechen1876 3 ปีที่แล้ว

    Waiting for their pretrained model

  • @rtluo1546
    @rtluo1546 3 ปีที่แล้ว +4

    Hi, Can you provide a paper reference for the adapter between transformer layers? Thank you

    • @Dougystyle11
      @Dougystyle11 3 ปีที่แล้ว +1

      Look up adapterhub

  • @louis3195
    @louis3195 3 ปีที่แล้ว

    Correct me if I am wrong: these researchers are trying to create software that reproduces the computation that some strange apes learned over million years fighting mammoths?
    So can this communication tool “language” contains all the information needed to understand how to survive in a world surrounded by mammoths (and other apes)?
    Did they try the other way around, from vision to language? Because I think vision is largely the most useful (so bringing the most important knowledge to survive among apes and mammoths) input for human beings?

  • @sheggle
    @sheggle 3 ปีที่แล้ว

    I loved the paper, interested to see what you think

  • @krzysztofwos1856
    @krzysztofwos1856 3 ปีที่แล้ว +1

    These transformers develop a universal basis for understanding the structure of the world. Language is the intellect's way of making sense of sensory perceptions, so the structure of all our sensory perceptions is encoded in the language. This is why we can imagine what certain words, e.g. elated, "feel like". It's a path through a space that activates memories of related sensory perceptions. Words move you from one state to another. So the language model develops fundamental understanding of the reality as perceived by the users of the language.

  • @hailking5588
    @hailking5588 ปีที่แล้ว

    why does it make sense to train the input layer when training new modalties?

  • @silberlinie
    @silberlinie 3 ปีที่แล้ว

    Ich denke Yannic, es ist so ähnlich wie das, was wir in
    der Erziehung das Lernen zum Lernen bezeichnen.

  • @andrewminhnguyen9446
    @andrewminhnguyen9446 3 ปีที่แล้ว

    Re: fine-tuning self-attention and feedforward layers resulting in performance degradation, the authors state that they don't "[change] the optimization or learning rate scheme." I'm not sure it's a foregone conclusion that the network is overfitting. I would be curious to see if it continues to degrade even if they make judicious choice of learning rate as they gradually unfreeze the layers from the top. Thanks, Yannic.

  • @adamtran5747
    @adamtran5747 2 ปีที่แล้ว

    I love you Yannic.

  • @alonsomartinez9588
    @alonsomartinez9588 3 ปีที่แล้ว

    Yannic, could you do a video on "Predicting Video with VQVAE"??

  • @herp_derpingson
    @herp_derpingson 3 ปีที่แล้ว

    14:55 Maybe its not zero shot. It is epsilon shot ;)
    .
    Also, I think you should add the #transformer tag to all videos related to transformers because #pretrainedtransformers != #transformer and will probably not show up in the list/recommendations.

  • @matthieulin335
    @matthieulin335 3 ปีที่แล้ว

    Is it possible that the model overfitted because it is smaller (Double gradient descent phenomenon)

  • @mwdcodeninja
    @mwdcodeninja 3 ปีที่แล้ว

    Finance immediately comes to mind. Anyone have an order flow stream they'd like to share? For Science!

  • @jabowery
    @jabowery 3 ปีที่แล้ว

    Turing not found

  • @carlosxaviersoto5743
    @carlosxaviersoto5743 3 ปีที่แล้ว +3

    If you don't know what an attention layer is, I'm sure you'll find some video on youtube that explains it... 😏

    • @jeremykothe2847
      @jeremykothe2847 3 ปีที่แล้ว

      But where???

    • @conduit242
      @conduit242 3 ปีที่แล้ว

      Pairwise distance. Done

    • @jeremykothe2847
      @jeremykothe2847 3 ปีที่แล้ว

      @@conduit242 If you already know what it means, that works :P I'm hoping everyone watching this probably does by now...

    • @conduit242
      @conduit242 3 ปีที่แล้ว

      @@jeremykothe2847 I probably should have said weighted distance 😌

  • @Dendus90
    @Dendus90 3 ปีที่แล้ว

    The next exercise that comes to my mind is a time series prediction. For instance, we have a timestamp represented as an n-dim vector. It would be great to see whether pre-trained LM models can overperform LSTM and trained from scratch transformers in such tasks. What are you guys thinking?

    • @conduit242
      @conduit242 3 ปีที่แล้ว

      Absolutely, let’s see it predict hierarchical step functions or simple “linearly growing volatility” non-stationary time series

  • @JTMoustache
    @JTMoustache 3 ปีที่แล้ว +1

    That’s definitely not zero shot

  • @adamrak7560
    @adamrak7560 3 ปีที่แล้ว

    Their claim is truly massive. The 19 page article seems very short for supporting it.
    I think somebody will have to try it on much bigger datasets with much bigger networks too. This can be truly significant if it is true, but the article mostly demonstrates it on "toy" examples, so I am not fully convinced.

    • @conduit242
      @conduit242 3 ปีที่แล้ว

      Transformers are approximate Turing machines bc natural language is Turing complete, so I’m not sure what this paper is proving besides saying that looking at various points in history in combination is valuable to Turing machines. This point is already well understood, there are only so many ways to approximate look back and fractional windows 🤷🏻‍♂️

    • @adamrak7560
      @adamrak7560 3 ปีที่แล้ว

      @@conduit242 They imply that the trained transformers with most of the weight frozen are Universal Turing Machines. That is very different from knowing that the architecture is Turing complete.

    • @conduit242
      @conduit242 3 ปีที่แล้ว

      @@adamrak7560 How? ;)

  • @shubhamthapa7586
    @shubhamthapa7586 3 ปีที่แล้ว +8

    first

  • @willrazen
    @willrazen 3 ปีที่แล้ว

    Neurosymbolic learning ftw

  • @valthorhalldorsson9300
    @valthorhalldorsson9300 3 ปีที่แล้ว +6

    Good video! I think they’ve put their finger on an interesting empirical finding that is worth investigating further. I’m not sure whether it’s worth getting excited about it yet though considering previous papers showing you can get competitive performance on some architectures by only training the batch norm layers.

  • @piotr780
    @piotr780 3 ปีที่แล้ว +1

    ok, but what is your conclusion ? your paper summary is really good, but what is your opinion about source of effectiveness of pretrained transformers ? (I guess that language has the highest entropy so it is the the most challenging modality so pretraining on it gives better results)

  • @RandyArdywibowo
    @RandyArdywibowo 3 ปีที่แล้ว +3

    Nice overview! Indeed, the layer norm training somewhat makes the "Universal Computation Engine" claim weaker. Also, what about ablations against simpler feature extractors like Fourier Transforms, DCT, or wavelets replacing the transformer architecture? If the results for these simple feature extractors are only slightly worse, then it's a bit silly to pretrain on a huge language dataset to eek out a few percentage points of performance don't you think?

    • @conduit242
      @conduit242 3 ปีที่แล้ว

      Bingo, enormous value is being hidden in the embedding training and positional encoding. Universal computation shouldn’t require that.

  • @codeWithBatta
    @codeWithBatta 3 ปีที่แล้ว +2

    Hey Yannic, how many books you read per week or atleast give me some Idea (of which genre), I am getting a bit competitive with you :). Thanks for replying btw you know someone doing the similar stuff as you on youtube ?

  • @dianchen9083
    @dianchen9083 3 ปีที่แล้ว

    Is there a paper showing "training only the batch norm layers of a randomly initialized CNN gives non-trivial results" that Yannic mentioned at around 15:45? Can someone please tell me where I could find reference of that conclusion? Thanks!

  • @swordwaker7749
    @swordwaker7749 3 ปีที่แล้ว

    Hard pre-training task? What about programming language execution? Make a simple instruction set and let the model predict the result.

  • @andres_pq
    @andres_pq 3 ปีที่แล้ว

    Thinking a lot about those computational primitives, perhaps those are the ones that make the information routing altogether. It has been somewhat proved that the self-atenttion by itself is not useful, it is needed along skip connections and MLPs to make a Transformer a superior model. Perhaps there can be another architecture that englobes said primitive computations.

  • @nathancooper1001
    @nathancooper1001 3 ปีที่แล้ว

    I totally agree with some of your intuition as to what these models learning. However, I'm betting there is a huge overlap between "natural signals" and "computational premitives" especially if you are on the team of the universe itself being computable since then the natural signal may just be a higher level computation

  • @BBorn223
    @BBorn223 3 ปีที่แล้ว

    Wow, thanks for sharing.
    I have fine tuned a hugging face model for nlp task. Watching this i think that even if i want to classify other languages, i could just use GPT-2 model and just fine-tune those layer norms.
    I think this is worth trying

  • @normalchannel4747
    @normalchannel4747 2 ปีที่แล้ว

    This is a hell of a paper. Thank you Yannic for sharing this work

  • @larrybird3729
    @larrybird3729 3 ปีที่แล้ว

    if an object has all of its components replaced does it remain fundamentally the same?🤯

  • @techma82
    @techma82 3 ปีที่แล้ว

    Loved this one!

  • @MrMIB983
    @MrMIB983 3 ปีที่แล้ว

    Amazing