OUTLINE: 0:00 - Intro & Overview 1:10 - Sponsor Spot: Weights & Biases 3:35 - Problem Statement 8:00 - Continuous Attention Mechanism 16:25 - Unbounded Memory via concatenation & contraction 18:05 - Does this make sense? 20:25 - How the Long-Term Memory is used in an attention layer 27:40 - Entire Architecture Recap 29:30 - Sticky Memories by Importance Sampling 31:25 - Commentary: Pros and cons of using heuristics 32:30 - Experiments & Results
Seriously. But I cannot complain; NLP has progressed thanks to the popularity of the Transformer / GPT drastically. NLP was slow, tedious, and with many architectures for each sub-specialized problem a few years back.
i thought the infinite stood for the number of variations of attention we have come up with, which aren't really even attention but we call them attention because its cool. (Like seriously, why the hell do we call modern "Gated Linear Units" an attention mechanism? It seems anything that involves a sigmoid or softmax applied to some vector then multiplied by some other vector is called attention theses days)
17:43 I don't quite agree with this, the basis functions doesn't seem to be biased towards storing more recent information, it seems like the precision you get is uniform across the entire time domai n 20:00 It might be that the network learns to map embeddings in a way to help the interpolation in performing better, but I think the most important question is how well does this approach scale compared to storing the discrete representations themselves and whether tasks like NLP may benefit more than other general time sequence predictions. I think it would be cool to see a learned interpolator too, although the performance might be very bad 🤔
You could argue that when training the model, if the embedding space is learned, the model would learn to map the training sequences to an embedding space where they can be represented with continuous signals. Also, this might just be a limitation of the choice of basis functions but there is no reason why you couldn't have different types of basis functions in there (sawtooth, triangle, square, gaussians, etc) at the same time.
I think this model would work really well in reinforcement learning so I'm really curious as to how well it performs in scenario's that weren't described in the paper. So I'd love to see you throwing some problems at it and see how well it works.
During my PhD studies I came across the not largely known and sometimes surprising dark-magic-sorcery fact that markov chains can have memory and even more it can be infinite. I was wondering when this could reach the AI niche. I am a believer that Information Theory can synergise pretty well with DL.
I forget which other transformer paper you did, but it brought up the idea of why not use fourier transforms to define attention. At that point idea being, it's not the exact form of the attention that matters, since the learning modulates it, but just some mixing in general. This one gets me thinking, if we want dabble in heuristic compression hell for a good tradeoff against incalculable backprop, why not use fourier for the longterm signal memory (instead of RBF) and also the attention learner (instead of whatever du jour). Like, signals are all waves anyway, tokens are projections of whatever the "upstream" process was that generated them. It's not too crazy to think that the compression lossiness against tokens might actually overlap well with the generating function making the tokens, or at least its relevant features that you're trying to learn anyway. Promise I'm not trying to superficially conflate two areas here where "hey look, it's a curvy continuous thing". More a remark on the artificiality of tokens as learnable data. I guess another thing you could say here is, if not fourier or another good jack of all trades compression system, what gimmick is supposed to work best? It can't be that we're just hunting for the right gimmick which runs well on our chips. Forgot to say, totally agree, super shady to call it infinite with such a modest performance bump of a "new architecture". And skeezy outsourcing to the heuristic.
Yeah it would had been a great name for it, but someone already stole the name for a transformer architecture for time series forecasting, published during early 2021.
Why does this paper feel like ' when Perceiver IO, Language modelling and Fourier transform walks into a bar' thing?😛 Though, great video, once again!😄
Can't wait for somebody to come up with the idea to "just learn the attention functions" by using an entire fully connected DNN dedicated to just the attention mechanism to encode arbitrary functions - and then for somebody to switch that out with first a CNN and then a transformer to get a second order attention-attention ... and then just continue from there. Construct the infinite tower of attentions all the way down Also: try the same but in fourier space! More seriously though, the smoothing could work out if you allow, like, deeper lookups. I'm imagining something like the network going "oh I remember there was something relevant roughly in the first quarter of the book" at which point this information can be unpacked and attended to at higher resolution. A sort of binary search -ish thing could be possible that way, assuming you can't just hold everything in RAM, but you *can* have access to it on your hard drive and it might be fast enough to retrieve that way. In that case, smoothing the signal first might make sense. If you can somehow reconstruct the deeper attention once you see what actually lies there again.
Before he said they were using RBFs, I just assumed it would be using Fourier series. I wonder if you could reorder the dimensions of the embeddings to try to get rid of higher frequency components. Basically looking at all of the tokens in your dataset (or some subset) and looking at what dimensions correlate the most.
@@Virsconte This seems to make good intuitive sense as well as attractive from the engineering perspective. Like, you don't need to remember the exact sentence structure of a story years later, just the major nouns and verbs, and hence a simplified but effective description of events. It's another tradeoff for compression, but seems reasonable.
Yannic, but you can place infinite Universe in your finite head. Even if you have forgotten something, you can read the appropriate paper or talk to a specialist or expand your brain with new modules by yourself with your own money which you yourself have earned. So we need a model that can read and talk and make money and expand itself with it's own money. Positive cash flow.
18:47 The reason (why a learned embedding could be modeled to be continuous) could be that language "feels" mostly fluid and continuous - or wouldn't you say?
I don't think that language either is or feels continuous at all. I think the reason this approximation works here, is that the feature width is large enough that there is enough redundancy in the hidden representations, that the model can reconstruct some of the lost information when attending to/reading the long-term memory. Also note how the compression is applied to each feature dimension separately. For any element of the original sequence, the lossiness of the compression might have been particularly unfavorable to one dimension, but particularly gentle in another dimension. In total, even with the lossiness and the seemingly arbitrary assumption of continuity, after the compression there's still so much information for the model to extract relevant features, and that only gets better the larger you make the feature width.
Rather than using a decomposition based on RBFs they could have used a more classic and richer wavelet decomposition.. the structure of the wavelet output should also make more sense to the network, imho
it is weird to limit to gaussian, if you make this model into a BERT type model, as the BERT attention for the CLS token will often attend at every SEP tokens...
why not treat the problem of compressing the embedding vectors as a learnable task or at least show that their choice (ridge regression using RBFs) is superior to other lossy compression heuristics? seems arbitrary.
The problem with learning compression is now you have to learn through time, and run into the classic problems the LSTMs had to solve like the vanishing/exploding gradient problem, not to mention that you can't parallelize training anymore. But yeah I agree their choices seem arbitrary and not justified (as someone who hasn't read the paper myself).
@@bigbuckey7687 > The problem with learning compression is now you have to learn through time I don't think it's true in this case: you only need to learn something that's good at interpolating points, there's no variation in the training regime that would be caused by the amount of tokens.
@@mgostIH Well we have plenty of algorithms to simply interpolate points, there's no need to learn that. But if we did make some custom interpolation that was learned, then since this architecture samples the past interpolations the loss would depend on those previous iterations, a.k.a learning through time. This is in the same idea as what an LSTM does (with major differences of course). Not sure what you mean by "there's no variation in the training regime that would be caused by the amount of tokens".
@@bigbuckey7687 > Well we have plenty of algorithms to simply interpolate points, there's no need to learn that. A lot of algorithms for interpolation make specific assumptions about the data, given that there's a strong link with compression and intelligence (Marcus Hutter) I would start thinking of new neural approaches much more for this sort of problems. > since this architecture samples the past interpolations the loss would depend on those previous iterations, a.k.a learning through time But there's no "previous iterations" in this case, in training models like these you only need a single forward pass to get all the tokens in a sentence their continuous representation and then interpolate between them. > Not sure what you mean by "there's no variation in the training regime that would be caused by the amount of tokens". An example could be training a SIREN network to fit the points you are given, their amount doesn't change anything about how you train the network but you still get out an interpolation that doesn't depend on the amount of tokens you had.
@@mgostIH Ah ok now I get your point. As long as the learned interpolation only depends on the input and no hidden states updated through time then each iteration doesn't depend on each other. Thanks for the clarification.
@@__E__ I am sorry, I have a habit of modifying names (as they sound to me for fun). I actually meant fastformer (the additive attention paper). "Additive attention CAN BE all you need".
@@jawadmansoor6064 It's alright man :) But I'm not sure I got this paper right because afaik it's not about stacking layers but rather having a global key and a global query per layer, the quadratic stuff happens inside a layer, not because you stack them
@@__E__ Transformers (ordinarily) are expensive due to quadratic complexity. If you stack layers in them the BigO would still be quadratic (being the most expensive operation) however this paper suggests that you only compute KVQ once (Q once, K twice and V thrice, I forgot which was which) hence the maximum complexity just depends on number of tokens (linearly). What I mean by a layer is that all the operation from "global" key/query computation is one layer. And you can stack the same operation above it. (It is difficult to explain in a few words, better yet please refer to the video by Yanic Kilcher). If you still don't get it (after watching the video, then do ask again, I will write an explanation of it in a few days (motivated for you, though will make it public) insha ALLAH.
I was wondering when such architecture would emerged, back with the released of ai dungeon, i was like, what if we compressed into a summary the previous half of the working memory, such that the sliding working memory retain more information for continuity. It's a function that the language model could do back then.
Hey, can you clear my confusion regarding as to why the mathematical model of artificial neuron is like : The input data x, is subject to an affine transformation defined by W , followed by a non-linear transformation i.e nonlinear_fun(weights*inputs+bias). but why not like this : nonlinear transformation on input and then affine transformation on the transformed input i.e nonlinear_fun(inputs)*weights+bias? And also the mathematical model of a artificial neural net is like : weights*nonlinear_fun(weights*inputs+bias)+bias ,um isn't it the output of the artificial neural net ,then shouldn't it be like this : nonlinear_fun(weights*nonlinear_fun(weights*inputs+bias)+bias) or is it beacuse the activation function of output neuron is linear so that's why ? EDIT: I mean shouldn't mathematical model of a single neuron and artificial neural net be same ?
For most problems you do "nonlinear_fun(weights*nonlinear_fun(weights*inputs+bias)+bias)". But for some problems, when the output is a linear regression, then the last nonlinear_fun is the identity function, so you can omit it.
@@rpcruz Thanks ,but about the first ,do you have any thoughts on that i.e why the mathematical model of artificial neuron is like : The input data x, is subject to an affine transformation defined by W , followed by a non-linear transformation i.e nonlinear_fun(weights*inputs+bias). but why not like this : nonlinear transformation on input and then affine transformation on the transformed input i.e nonlinear_fun(inputs)*weights+bias?
@@Anujkumar-my1wi so, take the example of fully connected feed forward network. In the way this is usually done, you can do this by applying a matrix to the input vector, and then applying the non-linearity to each coordinate separately. If you also mean to have the non-linearity apply to each coordinate separately, then uh, the non-linearity doesn’t get to use the mixing from the different inputs. If you add more layers, then the two should be equivalent, but this is basically just the same as “apply this non-linearity at the start of your network, and then continue as normal”, But that’s kinda pointless I think? I am not experienced in ML so take this with a grain of salt
Doing a non linear operation first will lose information on the inputs: For example if your inputs are [1., -1.], if you apply a ReLU before doing any matrix multiplication you will lose the -1 as it'll become a 0.
He's trying to go infinity and beyond, he's reforming his thinking and having a watershed moment bro! Or he's just scared, we all need a rooftop moment.
This is one of the xformers I'm gonna forget. Continuous representation is welcome but I don't think they justify their choice enough. The model feels too unjustifiably handcrafted.
OUTLINE:
0:00 - Intro & Overview
1:10 - Sponsor Spot: Weights & Biases
3:35 - Problem Statement
8:00 - Continuous Attention Mechanism
16:25 - Unbounded Memory via concatenation & contraction
18:05 - Does this make sense?
20:25 - How the Long-Term Memory is used in an attention layer
27:40 - Entire Architecture Recap
29:30 - Sticky Memories by Importance Sampling
31:25 - Commentary: Pros and cons of using heuristics
32:30 - Experiments & Results
Cool fact: The infinite in Infinity Former stands for the amount of papers about transformer variations get published!
Seriously. But I cannot complain; NLP has progressed thanks to the popularity of the Transformer / GPT drastically. NLP was slow, tedious, and with many architectures for each sub-specialized problem a few years back.
i thought the infinite stood for the number of variations of attention we have come up with, which aren't really even attention but we call them attention because its cool. (Like seriously, why the hell do we call modern "Gated Linear Units" an attention mechanism? It seems anything that involves a sigmoid or softmax applied to some vector then multiplied by some other vector is called attention theses days)
Learning is compression
17:43 I don't quite agree with this, the basis functions doesn't seem to be biased towards storing more recent information, it seems like the precision you get is uniform across the entire time domai n
20:00 It might be that the network learns to map embeddings in a way to help the interpolation in performing better, but I think the most important question is how well does this approach scale compared to storing the discrete representations themselves and whether tasks like NLP may benefit more than other general time sequence predictions.
I think it would be cool to see a learned interpolator too, although the performance might be very bad 🤔
Exactly, this parameter tau gives you the needed control parameter to equally weight newly learnt and already known information
Can i ask u for feedback on my masters thesis ?
You could argue that when training the model, if the embedding space is learned, the model would learn to map the training sequences to an embedding space where they can be represented with continuous signals. Also, this might just be a limitation of the choice of basis functions but there is no reason why you couldn't have different types of basis functions in there (sawtooth, triangle, square, gaussians, etc) at the same time.
I think this model would work really well in reinforcement learning so I'm really curious as to how well it performs in scenario's that weren't described in the paper. So I'd love to see you throwing some problems at it and see how well it works.
During my PhD studies I came across the not largely known and sometimes surprising dark-magic-sorcery fact that markov chains can have memory and even more it can be infinite. I was wondering when this could reach the AI niche. I am a believer that Information Theory can synergise pretty well with DL.
Markov chains have very strong assumptions. Namely, that the next step depends only on the current step.
I forget which other transformer paper you did, but it brought up the idea of why not use fourier transforms to define attention. At that point idea being, it's not the exact form of the attention that matters, since the learning modulates it, but just some mixing in general.
This one gets me thinking, if we want dabble in heuristic compression hell for a good tradeoff against incalculable backprop, why not use fourier for the longterm signal memory (instead of RBF) and also the attention learner (instead of whatever du jour). Like, signals are all waves anyway, tokens are projections of whatever the "upstream" process was that generated them. It's not too crazy to think that the compression lossiness against tokens might actually overlap well with the generating function making the tokens, or at least its relevant features that you're trying to learn anyway.
Promise I'm not trying to superficially conflate two areas here where "hey look, it's a curvy continuous thing". More a remark on the artificiality of tokens as learnable data.
I guess another thing you could say here is, if not fourier or another good jack of all trades compression system, what gimmick is supposed to work best? It can't be that we're just hunting for the right gimmick which runs well on our chips.
Forgot to say, totally agree, super shady to call it infinite with such a modest performance bump of a "new architecture". And skeezy outsourcing to the heuristic.
Infinite Memory Transformer -> Informer
Yeah it would had been a great name for it, but someone already stole the name for a transformer architecture for time series forecasting, published during early 2021.
Feels like that compressing & appending long term memory in figure 2 should be applied to the attention
Why does this paper feel like ' when Perceiver IO, Language modelling and Fourier transform walks into a bar' thing?😛
Though, great video, once again!😄
Can't wait for somebody to come up with the idea to "just learn the attention functions" by using an entire fully connected DNN dedicated to just the attention mechanism to encode arbitrary functions - and then for somebody to switch that out with first a CNN and then a transformer to get a second order attention-attention
... and then just continue from there. Construct the infinite tower of attentions all the way down
Also: try the same but in fourier space!
More seriously though, the smoothing could work out if you allow, like, deeper lookups. I'm imagining something like the network going "oh I remember there was something relevant roughly in the first quarter of the book" at which point this information can be unpacked and attended to at higher resolution. A sort of binary search -ish thing could be possible that way, assuming you can't just hold everything in RAM, but you *can* have access to it on your hard drive and it might be fast enough to retrieve that way.
In that case, smoothing the signal first might make sense. If you can somehow reconstruct the deeper attention once you see what actually lies there again.
Before he said they were using RBFs, I just assumed it would be using Fourier series. I wonder if you could reorder the dimensions of the embeddings to try to get rid of higher frequency components. Basically looking at all of the tokens in your dataset (or some subset) and looking at what dimensions correlate the most.
@@Virsconte This seems to make good intuitive sense as well as attractive from the engineering perspective. Like, you don't need to remember the exact sentence structure of a story years later, just the major nouns and verbs, and hence a simplified but effective description of events. It's another tradeoff for compression, but seems reasonable.
MLFlow is a good alternative to weights & biases in my opinion :)
Yannic, but you can place infinite Universe in your finite head. Even if you have forgotten something, you can read the appropriate paper or talk to a specialist or expand your brain with new modules by yourself with your own money which you yourself have earned. So we need a model that can read and talk and make money and expand itself with it's own money. Positive cash flow.
18:47 The reason (why a learned embedding could be modeled to be continuous) could be that language "feels" mostly fluid and continuous - or wouldn't you say?
I don't think that language either is or feels continuous at all. I think the reason this approximation works here, is that the feature width is large enough that there is enough redundancy in the hidden representations, that the model can reconstruct some of the lost information when attending to/reading the long-term memory.
Also note how the compression is applied to each feature dimension separately. For any element of the original sequence, the lossiness of the compression might have been particularly unfavorable to one dimension, but particularly gentle in another dimension. In total, even with the lossiness and the seemingly arbitrary assumption of continuity, after the compression there's still so much information for the model to extract relevant features, and that only gets better the larger you make the feature width.
I prefer calling it "nifty-former"
I think the next team who writes a transformer paper needs to donate a dollar to the swear jar.
Rather than using a decomposition based on RBFs they could have used a more classic and richer wavelet decomposition.. the structure of the wavelet output should also make more sense to the network, imho
Can i ask u for feedback on my masters thesis 🧐
nice format!
I can't believe no-one spotted the typo in equation 8.
Who or what is the name in the last sentenc of the video I understand "lucid rains" (and so does the automatic sutitles)?
i think you meant to say. "brings about the same problems as LSTM namely you get angry post from Schmidbhuber"
Basically quantization.
"Informer" would have been the best choice of name IMO 🙂
Hi I can learn a lot from you you are great thx have a wonderful day ,
Hey Yannic, your videos are awesome! Do you know if you'll be doing one on the AlphaFold 2 paper?
Thank you for this.
it is weird to limit to gaussian, if you make this model into a BERT type model, as the BERT attention for the CLS token will often attend at every SEP tokens...
why not treat the problem of compressing the embedding vectors as a learnable task or at least show that their choice (ridge regression using RBFs) is superior to other lossy compression heuristics? seems arbitrary.
The problem with learning compression is now you have to learn through time, and run into the classic problems the LSTMs had to solve like the vanishing/exploding gradient problem, not to mention that you can't parallelize training anymore. But yeah I agree their choices seem arbitrary and not justified (as someone who hasn't read the paper myself).
@@bigbuckey7687 > The problem with learning compression is now you have to learn through time
I don't think it's true in this case: you only need to learn something that's good at interpolating points, there's no variation in the training regime that would be caused by the amount of tokens.
@@mgostIH Well we have plenty of algorithms to simply interpolate points, there's no need to learn that. But if we did make some custom interpolation that was learned, then since this architecture samples the past interpolations the loss would depend on those previous iterations, a.k.a learning through time. This is in the same idea as what an LSTM does (with major differences of course). Not sure what you mean by "there's no variation in the training regime that would be caused by the amount of tokens".
@@bigbuckey7687
> Well we have plenty of algorithms to simply interpolate points, there's no need to learn that.
A lot of algorithms for interpolation make specific assumptions about the data, given that there's a strong link with compression and intelligence (Marcus Hutter) I would start thinking of new neural approaches much more for this sort of problems.
> since this architecture samples the past interpolations the loss would depend on those previous iterations, a.k.a learning through time
But there's no "previous iterations" in this case, in training models like these you only need a single forward pass to get all the tokens in a sentence their continuous representation and then interpolate between them.
> Not sure what you mean by "there's no variation in the training regime that would be caused by the amount of tokens".
An example could be training a SIREN network to fit the points you are given, their amount doesn't change anything about how you train the network but you still get out an interpolation that doesn't depend on the amount of tokens you had.
@@mgostIH Ah ok now I get your point. As long as the learned interpolation only depends on the input and no hidden states updated through time then each iteration doesn't depend on each other. Thanks for the clarification.
I don't know if you did it intentionally or not but image for problem statement chapter is weights and biases ad.
Thanks - good work
I like the sound of the helicopter
Man, I love your videos
keep up the good work!
I think it should be called Continous Attention is All you Need.
I think the best transformer was the stack former or add former (where you can stack layers without increasing complexity quadratically.
which papers are you exactly referring to ? I can't find them googling addformer or stackformer
@@__E__ I am sorry, I have a habit of modifying names (as they sound to me for fun). I actually meant fastformer (the additive attention paper). "Additive attention CAN BE all you need".
@@jawadmansoor6064 It's alright man :)
But I'm not sure I got this paper right because afaik it's not about stacking layers but rather having a global key and a global query per layer, the quadratic stuff happens inside a layer, not because you stack them
@@__E__ Transformers (ordinarily) are expensive due to quadratic complexity. If you stack layers in them the BigO would still be quadratic (being the most expensive operation) however this paper suggests that you only compute KVQ once (Q once, K twice and V thrice, I forgot which was which) hence the maximum complexity just depends on number of tokens (linearly).
What I mean by a layer is that all the operation from "global" key/query computation is one layer. And you can stack the same operation above it. (It is difficult to explain in a few words, better yet please refer to the video by Yanic Kilcher). If you still don't get it (after watching the video, then do ask again, I will write an explanation of it in a few days (motivated for you, though will make it public) insha ALLAH.
I was wondering when such architecture would emerged, back with the released of ai dungeon, i was like, what if we compressed into a summary the previous half of the working memory, such that the sliding working memory retain more information for continuity. It's a function that the language model could do back then.
Hey, can you clear my confusion regarding as to why the mathematical model of artificial neuron is like : The input data x, is subject to an affine transformation defined by W
, followed by a non-linear transformation i.e nonlinear_fun(weights*inputs+bias). but why not like this : nonlinear transformation on input and then affine transformation on the transformed input i.e nonlinear_fun(inputs)*weights+bias?
And also the mathematical model of a artificial neural net is like : weights*nonlinear_fun(weights*inputs+bias)+bias ,um isn't it the output of the artificial neural net ,then shouldn't it be like this : nonlinear_fun(weights*nonlinear_fun(weights*inputs+bias)+bias) or is it beacuse the activation function of output neuron is linear so that's why ?
EDIT: I mean shouldn't mathematical model of a single neuron and artificial neural net be same ?
For most problems you do "nonlinear_fun(weights*nonlinear_fun(weights*inputs+bias)+bias)". But for some problems, when the output is a linear regression, then the last nonlinear_fun is the identity function, so you can omit it.
@@rpcruz Thanks ,but about the first ,do you have any thoughts on that i.e
why the mathematical model of artificial neuron is like : The input data x, is subject to an affine transformation defined by W
, followed by a non-linear transformation i.e nonlinear_fun(weights*inputs+bias). but why not like this : nonlinear transformation on input and then affine transformation on the transformed input i.e nonlinear_fun(inputs)*weights+bias?
@@Anujkumar-my1wi so, take the example of fully connected feed forward network. In the way this is usually done, you can do this by applying a matrix to the input vector, and then applying the non-linearity to each coordinate separately.
If you also mean to have the non-linearity apply to each coordinate separately, then uh, the non-linearity doesn’t get to use the mixing from the different inputs.
If you add more layers, then the two should be equivalent, but this is basically just the same as “apply this non-linearity at the start of your network, and then continue as normal”,
But that’s kinda pointless I think?
I am not experienced in ML so take this with a grain of salt
@@drdca8263 Thanks
Doing a non linear operation first will lose information on the inputs: For example if your inputs are [1., -1.], if you apply a ReLU before doing any matrix multiplication you will lose the -1 as it'll become a 0.
Best W&B ad ever
sorry, what was the reference at 36:28 where the implementation will be available?
Also want to know this.
Why are you recording yourself on a roof top? :) Love the location.
He's trying to go infinity and beyond, he's reforming his thinking and having a watershed moment bro! Or he's just scared, we all need a rooftop moment.
Amazing
This paper reminds me of VQ-VAE
Nice
This is one of the xformers I'm gonna forget. Continuous representation is welcome but I don't think they justify their choice enough. The model feels too unjustifiably handcrafted.
W&B gang
Cool idea. Really misleading title.
Hello there!!
Nice