You have switching on the brain😆 And rightly so. With ReLU nets the switching happens at zero and a gradual change in input never causes a discontinuous change in output. It seems nets where switching causes discontinuity can still be trained, but I can't study everything.
Off topic: That batch training even works suggests that training algorithms only ever search the space of statistical solutions. Where no one neuron can be exceptional. Then neurons must work in diffuse statistical groups that are more resistant to damage when moving from batch to batch. (B) Dot products are statistical summary measures and filters. A net would have to be seriously sparsified for that not to apply. (C) The is a type of net based on neurons that have forced +c or -c weighted connections to every neuron in the the prior layer. Only statistical solutions exist in that case yet the nets work well. (They use parametric activation functions as the adjustable components.)
1:44 Bold of you to assume that I have friends. * cries in a corner * . 23:26 Kinda reminds me of the scratch pad idea which some neural turing machine papers used to have. Any node can write to its scratch pad and any node can read any scratchpad. . 29:40 The whole setup reminds me of Full Adder from my Electronics 101 course. If I remember it correctly there was also a parallel adder which allowed for some parallelism before ultimately synchronizing at the end. I wonder if we can try something similar to get some speedup. . 32:00 Yeah this very similar to an attention RNN. We are going full circle. :P . This looks quite useful for use cases where we are a bit more data constrained.
This is so extremely well explained, I'm so grateful for this content! Do you have a Discord or something else where one can interact with you about deep learning?
Transformer: let's get rid of recurrent connections so we can train very fast! Feedback Transformer: why not we add recurrent connections to transformers so we can do better reasoning!
Transformers are just becoming more like the idea of neural population coding or reservoir computing. The next step is going to be randomly allowing the connections, rather than the dense connection to all. If you agree - let's collab, if you disagree - let's discuss.
One of the main advantages of transformers over RNNs with attention was the possibility of parallel and effective training. Feedback transformers are a nice model and they are more powerful but I will have to read the paper to learn how effective/practical they are from a training perspective. BTW I think “unbounded reasoning depth” is a bit too bold. Also even vanilla transformers have per token the feed forward “MLP burgers” (e.g. projecting into higher 2x or 4x d_model an then again reducing to d_model) layers with multiple non-linearities... just to say that some form of general (in the sense of not linear separable) computation already can be performed even by a single layer.
22:50 why not use the stored memory data only when a prediction is uncertain? that way it can train in parallel but also check against itself whenever it doesn't know for sure.
Thank you, Yannic for the nice explanation! In a case of a language translation task with an encoder-decoder transformer, these memory units are added only in the decoder, or the encoder as well? In the case we add memory in both, the keys and values we feed in the second multi-head attention layer of the decoder should be the same ones as in the encoder, since the author proposes to compute them only once. Is this correct?
In general, the decoder has full access to the encoder states, and no causal masking is applied in the encoder, so I'd say this is only a problem in the decoder.
34:00 I think this being a transformer has attention at every transforming layer as well along with the attention to the memory. It would be more like an RNN with attention at each layer and memory. Right? Oh you just said that XD
Love your work, I cannot express in words how informative every video has been. Can you (or anyone) point me to a resource that elaborates further on your statement @11:25 "a single neural network layer can only do Linear Operations" ? Thanks and keep up the great work!
a single layer has the formula z = Wx, where z is the hidden layer and x is the input layer. If you don't have other layers or not naming z = Nonlinearity(Wx), z will be a linearly dependent on x, which is the most you can get from one layer. If you have two layers, z = W_1 x, v = Nonlinearity(W_2 z), you can perform non linear operations such as Xor.
However a large single layer net can be used as an external memory bank for a main NN. However you have to set things up to meet the mathematical requirements of the dot product. The variance equation for linear combinations of random variables tells you the dot product rather prefers non-sparse inputs when used as a memory. And used under capacity can even provide error correction (variance reduction.) Then a suitable scheme is vector to vector random projection, bipolar (+1, -1) binarization and then the dot product with a weight vector. To train, get the recall error and divide by the number of dimensions. Then add or subtract that as indicated by the +1,-1 binarization to each weight term to make the error zero. By the CLT that has added a little Gaussian noise to all the prior stored memories. However by retraining you can squeeze the noise out completely within capacity. And that is the same as a matrix inverse solution but online. Hey, you need quite a good understanding of the dot product to see how that works, which I'm not sure so many NN researchers actually have.😳 You can think about the case where you replace the locality sensitive hash (RP+binarization) with a full hash in combination with the CLT.
Why does facebook invest money into ML tasks such as visual recognition? Or do they just assume that all task are so interconnected in principle so that by solving one problem they are getting closer to solve the others, which are directly useful, eg. recommender system or text analysis.
I'm wondering what the definition of a "Transformer" really is. My impression was that you need both parallel processing with a position input as well as an attention mechanism to qualify as a transformer. But this paper really tries to extend the definition to anything with attention mechanism-ish something. It sounds to me like they just try to up-market some obsolete research by using trendy words.
There is no definition. These aren't hundred year old terms, they're newly coined jargon with not even a decade of use. You may as well complain about people naming their models things like BERT and RoBERTa and YOLO9000...
Finally, we'll be able to make a realistic GPT-Girlfriend, it will remember about that one time years ago when we made a compliment to some other girl and use that against us passive aggressively 😎
ERRATA: Sometimes I say "Switch Transformer" instead of "Feedback Transformer". Forgive me :)
You have switching on the brain😆
And rightly so.
With ReLU nets the switching happens at zero and a gradual change in input never causes a discontinuous change in output. It seems nets where switching causes discontinuity can still be trained, but I can't study everything.
Anyone who watches these videos, i think can understand at least this much.
Such an easy explanation.. your just awesome
Off topic: That batch training even works suggests that training algorithms only ever search the space of statistical solutions. Where no one neuron can be exceptional. Then neurons must work in diffuse statistical groups that are more resistant to damage when moving from batch to batch.
(B) Dot products are statistical summary measures and filters. A net would have to be seriously sparsified for that not to apply.
(C) The is a type of net based on neurons that have forced +c or -c weighted connections to every neuron in the the prior layer. Only statistical solutions exist in that case yet the nets work well. (They use parametric activation functions as the adjustable components.)
Great video - thanks!
1:44 Bold of you to assume that I have friends. * cries in a corner *
.
23:26 Kinda reminds me of the scratch pad idea which some neural turing machine papers used to have. Any node can write to its scratch pad and any node can read any scratchpad.
.
29:40 The whole setup reminds me of Full Adder from my Electronics 101 course. If I remember it correctly there was also a parallel adder which allowed for some parallelism before ultimately synchronizing at the end. I wonder if we can try something similar to get some speedup.
.
32:00 Yeah this very similar to an attention RNN. We are going full circle. :P
.
This looks quite useful for use cases where we are a bit more data constrained.
How long do you practice pronouncing the names before you record the video? ^^
Swiss are often fluent in French
I take it as a compliment ;)
@@YannicKilcher curious to hear you pronounce African names :D
Very useful explanation. Thank you
This is so extremely well explained, I'm so grateful for this content!
Do you have a Discord or something else where one can interact with you about deep learning?
Sure, there's a discord link in the description of the video
I am hoping this can help improve long term patterns in music generation such as from jukebox
"this is an RNN with an attention mechanism.." don't say that to Schmidhuber :P
Transformer: let's get rid of recurrent connections so we can train very fast!
Feedback Transformer: why not we add recurrent connections to transformers so we can do better reasoning!
How do they fair against performers. Speed and accuracy wise
I have no idea, but probably performers are faster and this is more accurate
Transformers are just becoming more like the idea of neural population coding or reservoir computing. The next step is going to be randomly allowing the connections, rather than the dense connection to all. If you agree - let's collab, if you disagree - let's discuss.
One of the main advantages of transformers over RNNs with attention was the possibility of parallel and effective training. Feedback transformers are a nice model and they are more powerful but I will have to read the paper to learn how effective/practical they are from a training perspective. BTW I think “unbounded reasoning depth” is a bit too bold. Also even vanilla transformers have per token the feed forward “MLP burgers” (e.g. projecting into higher 2x or 4x d_model an then again reducing to d_model) layers with multiple non-linearities... just to say that some form of general (in the sense of not linear separable) computation already can be performed even by a single layer.
22:50 why not use the stored memory data only when a prediction is uncertain? that way it can train in parallel but also check against itself whenever it doesn't know for sure.
Good suggestion!
Thank you, Yannic for the nice explanation! In a case of a language translation task with an encoder-decoder transformer, these memory units are added only in the decoder, or the encoder as well? In the case we add memory in both, the keys and values we feed in the second multi-head attention layer of the decoder should be the same ones as in the encoder, since the author proposes to compute them only once. Is this correct?
In general, the decoder has full access to the encoder states, and no causal masking is applied in the encoder, so I'd say this is only a problem in the decoder.
34:00 I think this being a transformer has attention at every transforming layer as well along with the attention to the memory. It would be more like an RNN with attention at each layer and memory. Right? Oh you just said that XD
Love your work, I cannot express in words how informative every video has been. Can you (or anyone) point me to a resource that elaborates further on your statement @11:25 "a single neural network layer can only do Linear Operations" ? Thanks and keep up the great work!
a single layer has the formula z = Wx, where z is the hidden layer and x is the input layer. If you don't have other layers or not naming z = Nonlinearity(Wx), z will be a linearly dependent on x, which is the most you can get from one layer.
If you have two layers, z = W_1 x, v = Nonlinearity(W_2 z), you can perform non linear operations such as Xor.
What siyn007 says :)
I removed my like so that I can like it again.
31:38 you said "Switch Transformer" :)
also at 28:15
Crap 😁 thanks
However a large single layer net can be used as an external memory bank for a main NN. However you have to set things up to meet the mathematical requirements of the dot product. The variance equation for linear combinations of random variables tells you the dot product rather prefers non-sparse inputs when used as a memory. And used under capacity can even provide error correction (variance reduction.) Then a suitable scheme is vector to vector random projection, bipolar (+1, -1) binarization and then the dot product with a weight vector. To train, get the recall error and divide by the number of dimensions. Then add or subtract that as indicated by the +1,-1 binarization to each weight term to make the error zero.
By the CLT that has added a little Gaussian noise to all the prior stored memories. However by retraining you can squeeze the noise out completely within capacity. And that is the same as a matrix inverse solution but online.
Hey, you need quite a good understanding of the dot product to see how that works, which I'm not sure so many NN researchers actually have.😳 You can think about the case where you replace the locality sensitive hash (RP+binarization) with a full hash in combination with the CLT.
😱👍🏾
What is the representation power of a transformer?
Why does facebook invest money into ML tasks such as visual recognition? Or do they just assume that all task are so interconnected in principle so that by solving one problem they are getting closer to solve the others, which are directly useful, eg. recommender system or text analysis.
Visual recognition is directly useful and Facebook uses it for its main platform as well as Portal and probably the upcoming smart glasses.
They do that, too
facebook owns oculus
The problem of tracking code relationships between layers was well defined, however the solution they propose its not satisfying.
Soon: RNN is all you need
I'm wondering what the definition of a "Transformer" really is. My impression was that you need both parallel processing with a position input as well as an attention mechanism to qualify as a transformer. But this paper really tries to extend the definition to anything with attention mechanism-ish something. It sounds to me like they just try to up-market some obsolete research by using trendy words.
There is no definition. These aren't hundred year old terms, they're newly coined jargon with not even a decade of use. You may as well complain about people naming their models things like BERT and RoBERTa and YOLO9000...
Finally, we'll be able to make a realistic GPT-Girlfriend, it will remember about that one time years ago when we made a compliment to some other girl and use that against us passive aggressively 😎
Imagine using this with a BERT-like model. Why not just replace the CLS token with a memory vector?
I think DALL-E used multiple modalities tokenized as input for 1 transformer. Why not have a masked vector in the input that acts as memory?
good iea
@@YannicKilcher I'm an new researcher focusing on this area. Here's to hoping I can publish something like this in 2021.
1st
Loss: 0.9765
@@yabdelm ??!
Loss: 1.0125
@@yabdelm what are you training?