8:00 "It's just MLP" yeah, I felt bamboolzed. MLP originally was upscale, relu, downscale; later got beautiful gradient-bearing curves of ReLU replacement and additional multiplication(eg GeGLU instead of GELU). Here during initial explanation I felt that up proj and down proj got renamed to K, V. It doesn't feel like cross attention at all as their K V don't depend on any part of the input. 23:01 It may be because they moved lots parms around: finetuning MLP layers works so much worse than finetuning QKVO. By having MLP inside they have more parms to play a role on multi head attention which is still used. It also might be training. Their results at image classification feel feel more revealing than NLP tasks: dataset is way more fixed than "pick your favorite part of pile, filter bad parts" and models have the same weight class, and there is no different tokenizers(they use same with pythia so comparing ppl against it doesn't raise eyebrows) In ideal world they would also train at least pythia 160M from scratch on the same training data, but considering it's "up to 300B tokens", it's not exactly surprising they didn't. Also in ideal world they would put in ablation study "hey, let's not add non-linearity to Q K V proj and instead of new weights for 3 P-attentions we simply make dimension of heads bigger keeping around the same parm count" Also Pythia has an outdated MLP architecture: it use gelu, not geglu. P attention doesn't use GeGLU as well, but their architecture replaces MLP which already is replaced in modern LLMs. Speaking of their results in images: Their winning model in table 2 image classification has more parameters that ViT. If it's excluded, their small 86M model loses to another 86M model, though their 307M model still wins. (Also I remember trying somewhat similar but more naive idea on imdb dataset: replace Q,K,Vs proj with self attentions. Idea was walk toward super hierarchical nested attention: if you replace one Q projection with attention, you can go deeper and replace projection inside nested self-attn of Q with another attention and once walking over this inception is done the initial layer knows so well what token it has to attend to other token. Worked awful. Results were worse, number of parms went through the roof(as single proj got replaced with several, though will not be surprised if better hyper parms can make it at least bad rather than awful)
Oh, they actually did train transformer from scratch: "Table 6 compares Transformer models trained from scratch and Tokenformer trained by parameter reusing with varying amounts of seen tokens during training. It is evident that Transformers trained with the same number of seen tokens do not reach the performance level of Tokenformer with parameter reusing" Even if modern MLP would improve transformer results, training costs still seems to be much better which is really impressive
I think most of the results only look impressive because they slightly scale the model above the baseline. From the experiments I've done, it seems like the activation they develop is worse than normal GeLU and moving params from the MLP into the Q K V O projections performs the same given the same # of params.
Pattention is mathematically the same as an MLP tho. Pattention has the following formulation: O = f(X K^T) V and a two layer MLP with an intermediate activation looks like the following: O = f(X W_1) W_2 Since we are doing an inner product with the input and keys, "token interaction" is just a function applied on the input space R^d, of which there are n tokens, or rather n functions, if W_1 = K^T is of shape (d x n). Basically cross attention with learnable K,V is an MLP where the up projection is renamed to K, the down projection is renamed to V, and the activation is changed from GeLU to GeLU(Norm())
really appreciate for explaining such papers, please keep them coming
8:00 "It's just MLP" yeah, I felt bamboolzed. MLP originally was upscale, relu, downscale; later got beautiful gradient-bearing curves of ReLU replacement and additional multiplication(eg GeGLU instead of GELU). Here during initial explanation I felt that up proj and down proj got renamed to K, V.
It doesn't feel like cross attention at all as their K V don't depend on any part of the input.
23:01 It may be because they moved lots parms around: finetuning MLP layers works so much worse than finetuning QKVO. By having MLP inside they have more parms to play a role on multi head attention which is still used.
It also might be training. Their results at image classification feel feel more revealing than NLP tasks: dataset is way more fixed than "pick your favorite part of pile, filter bad parts" and models have the same weight class, and there is no different tokenizers(they use same with pythia so comparing ppl against it doesn't raise eyebrows)
In ideal world they would also train at least pythia 160M from scratch on the same training data, but considering it's "up to 300B tokens", it's not exactly surprising they didn't. Also in ideal world they would put in ablation study "hey, let's not add non-linearity to Q K V proj and instead of new weights for 3 P-attentions we simply make dimension of heads bigger keeping around the same parm count"
Also Pythia has an outdated MLP architecture: it use gelu, not geglu. P attention doesn't use GeGLU as well, but their architecture replaces MLP which already is replaced in modern LLMs.
Speaking of their results in images: Their winning model in table 2 image classification has more parameters that ViT. If it's excluded, their small 86M model loses to another 86M model, though their 307M model still wins.
(Also I remember trying somewhat similar but more naive idea on imdb dataset: replace Q,K,Vs proj with self attentions. Idea was walk toward super hierarchical nested attention: if you replace one Q projection with attention, you can go deeper and replace projection inside nested self-attn of Q with another attention and once walking over this inception is done the initial layer knows so well what token it has to attend to other token.
Worked awful. Results were worse, number of parms went through the roof(as single proj got replaced with several, though will not be surprised if better hyper parms can make it at least bad rather than awful)
Oh, they actually did train transformer from scratch: "Table 6 compares Transformer models trained from scratch and Tokenformer trained by parameter
reusing with varying amounts of seen tokens during training. It is evident that Transformers trained
with the same number of seen tokens do not reach the performance level of Tokenformer with
parameter reusing"
Even if modern MLP would improve transformer results, training costs still seems to be much better which is really impressive
I think most of the results only look impressive because they slightly scale the model above the baseline. From the experiments I've done, it seems like the activation they develop is worse than normal GeLU and moving params from the MLP into the Q K V O projections performs the same given the same # of params.
Can I ask what tablet/app you use to produce these videos? Thank you!
Thanks
I think in pattention (cross-attention), the tokens can kinda communicate with each other, so it does not act like an MLP layer.
Pattention is mathematically the same as an MLP tho. Pattention has the following formulation:
O = f(X K^T) V
and a two layer MLP with an intermediate activation looks like the following:
O = f(X W_1) W_2
Since we are doing an inner product with the input and keys, "token interaction" is just a function applied on the input space R^d, of which there are n tokens, or rather n functions, if W_1 = K^T is of shape (d x n). Basically cross attention with learnable K,V is an MLP where the up projection is renamed to K, the down projection is renamed to V, and the activation is changed from GeLU to GeLU(Norm())
Diffusion forcing? Please