ALiBi - Train Short, Test Long: Attention with linear biases enables input length extrapolation

Yannic Kilcher

มุมมอง 20 808

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 28 ก.ย. 2024
วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 51

@YannicKilcher 3 ปีที่แล้ว ⁺⁵
OUTLINE:
0:00 - Intro & Overview
1:40 - Position Encodings in Transformers
4:55 - Sinusoidial Position Encodings
11:50 - ALiBi Position Encodings
20:50 - How to choose the slope parameter
23:55 - Experimental Results
29:10 - Comments & Conclusion
@user-zq4oe6ro2w 3 ปีที่แล้ว
NVAX is literally collapsing hows that for short long
@ofirpress 3 ปีที่แล้ว ⁺³³
Yannic thank you so much for making this amazing summary video!
Here are a few comments I had:
1. "There's no reason why this shouldn't work for full (mask-less) self-attention"- correct! I've already heard from people who have gotten this to work for encoder-attention in NMT, and we'll definitely have more on that.
2. About the different heads having different slopes, I just wanted to add that although I didn't play with this too much, it did seem like when we had multiple heads that used the same slope it would degrade performance.
3. Small note about our baseline for WikiText- it's actually not a model we designed at all, its just us using the language model from Baevski & Auli's Adaptive Word Embedding paper.
4. Our ALiBi model actually runs just as fast as the sinusoidal model. Throughout the paper you might see that our model and the sinusoidal one have a tiny difference in training or evaluation speed but thats just the amount of variance we get on our hardware (so you'll even have the same model showing slightly different speeds sometimes too).
5. I know our experiments on WikiText-103 seem too good to be true, but we've uploaded our models so anyone can check us! In addition- WikiText-103 is basically a toy dataset at this point, and as we later show in the paper our results on the big dataset are much closer to the sinusoidal model. Since WikiText-103 is so small, a model with a strong inductive bias will achieve amazing results there, but that advantage almost disappears when you train on a much much larger dataset with a much greater compute budget.
6. Our language modeling results have already been replicated by multiple external groups, and as previously stated, others have managed to make this work for machine translation. Gathering language modeling results for 3 different datasets was a lot of work, and that's why we didn't have experiments on other domains, but now that we have more time we're definitely going to explore different applications, models and domains.
@hope_pead 3 ปีที่แล้ว
Hi. Super impressed by ALiBi. But can you help me understand why softmax suppression should work? If you distort the softmax so that past becomes irrelevant more you go further, as Yannic said in the video you basically tell the model "I don't care if K_1 is important to Q_6, pay less attention to it". This could maybe work in the lower layers, but in higher layers where the model learns much more intricate representations (like whether a sentence is from someone else's POV, for example), they have to take into account a lot of information from distant past. How do you explain the fact that suppressing past doesn't hurt the representation ability when the context depends on the distant past?
@ofirpress 3 ปีที่แล้ว ⁺²
@@hope_pead Thanks!
Some of the heads are able to look far back, and for the rest of them, I would conclude that ALiBi works because even in the normal (sinusoidal) transformer, not much context is required for SoTA perplexity.
@SLAM2977 3 ปีที่แล้ว ⁺¹⁹
Great explanation of positional encoding Yannic! Best I have seen so far. All the way to 100k! :)
@adamrak7560 3 ปีที่แล้ว ⁺⁴
One thing you have not mentioned: The potential of the network to bring interesting info from the beginning of sequence, not by defeating the biases, but buy jumping forward in every layer.
Even less important info can be propagated forwards this, if the you have enough layers.
@anirbansen7132 4 หลายเดือนก่อน
Very Helpful
@herp_derpingson 2 ปีที่แล้ว
I never fully understood why the original transformer used the weird sinosoid thing. In hindsight making the position encodings linear makes much more sense.
.
28:35 The early token curse is just that at the beginning of evaluation, the transformer doesnt have enough context to make reliable predictions?
@MachineLearningStreetTalk 3 ปีที่แล้ว ⁺⁴
Nearly first 👌😀
@MrDREAMSTRING 3 ปีที่แล้ว
Shouldn't they compare with the case where only the most recent N tokens are fed into the model for inference, since practically their approach just blindly make the attention weights of early tokens to close-to-zero?
@ofirpress 3 ปีที่แล้ว
Ya that's the 'sliding window' evaluation approach that we discuss in the analysis section and also in a previous paper ('Shortformer'). It works well with the sinusoidal model but it is *very* slow.
So our advantage is that we achieve these great perplexity results *and* we're super fast!
Tell me if you have more questions :)
@leotrisport 3 ปีที่แล้ว
Super cool!
@ofirpress 3 ปีที่แล้ว
Thanks!
@G12GilbertProduction 3 ปีที่แล้ว
BiT-related transformer with 1024 tokens? It's not going through with zero-sum method qualitatively. :D
@liptherapy 3 ปีที่แล้ว
why is this unlisted?
@YannicKilcher 3 ปีที่แล้ว ⁺³
it took very long to process the HD version
@JamesAwokeKnowing 3 ปีที่แล้ว
Larger m equals more milinated nerves.
@odysseus9672 3 ปีที่แล้ว ⁺⁸
Roughly speaking: Fourier transform is no good. Long live the Laplace transform!
@patf9770 3 ปีที่แล้ว ⁺⁵
Very curious how this generalizes to 2-d inputs (images) or n-d.
Edit: seems like you'd have to indicate direction somehow
@eelcohoogendoorn8044 3 ปีที่แล้ว ⁺¹¹
Don't want to be that guy claiming their work has no value because of prior work and im not saying that... but this is literally the same attention bias as proposed in LeViT, just in 1d rather than 2d. ctrl-F LeViT in their paper shows no hits. Useful insight as to how this enables generalization from short to long sequences... didnt realize sinusoidal sucks so much at that, though then again relative position encodings are not a new idea exactly and id wager someone must have made that observation before too... The geometric series per head seems like a useful and potentially novel idea though.
@ofirpress 3 ปีที่แล้ว ⁺¹
1. ALiBi is not "literally the same attention bias as proposed in LeViT".
Here's the relevant quote from the LeViT paper "Each head has H ×W parameters corresponding to different pixel offsets". This is basically a 2D version of the T5 bias that we describe and cite heavily in our paper. Our main idea is to use a simple linear function that's a function of the relative distance instead of learning separate biases for each relative distance.
2. "relative position encodings are not a new idea exactly" definitely! We cite many previous relative position methods, including Shaw et al. which IIRC was the first paper to talk about relative positioning.
3. "id wager someone must have made that observation before too" where?
4. "Useful insight as to how this enables generalization from short to long sequences" and "The geometric series per head seems like a useful and potentially novel idea though" thank you!
@eelcohoogendoorn8044 3 ปีที่แล้ว
@@ofirpress fair enough; didn't realize T5 also shared that character. In that context the trainable weights of levit seem like a relevant difference. Would have been interesting to see their 1d equivalent also thrown in the comparison i suppose. Note that i mean my comment about not 'not having' value non sarcastically. (Wow lots of negations). I think novelty is way overrated compared to thorough investigation and understanding, and no doubt your paper contributes to that
@BenZ-os4ms 3 ปีที่แล้ว ⁺¹²
That came at just the right time, have been doing texture classification using transformer neural networks and have been looking for a way to generalise to longer sequence lengths - this might just be the thing I'm looking for. Thanks, love your videos!
@WatchAndGame 3 ปีที่แล้ว
What do you mean with textures exactly?
@BenZ-os4ms 3 ปีที่แล้ว ⁺¹
@@WatchAndGame Using sequential textural data captured using a robotic arm, sliding a sensor across varying materials :)
@WatchAndGame 3 ปีที่แล้ว
@@BenZ-os4ms oh interesting, so basically detecting the materials in front of the sensor?
@BenZ-os4ms 3 ปีที่แล้ว
@@WatchAndGame Yep! Pretty much 😀
@ofirpress 3 ปีที่แล้ว
Awesome!
@angry4rtichoke646 3 ปีที่แล้ว ⁺²
I am looking for research to do with the UW CSE department, and this video could not have come at a more perfect time! I am excited to better understand the concepts in this video soon )
@neonardo77 3 ปีที่แล้ว ⁺¹
thanks for your articulate pronunciation haha. it really helps me, a non-native english speaker, understand the content way better.
@arijaa.9315 8 หลายเดือนก่อน
11:46 if the positional encoding is injected to the query and key then after multiplication the positional encoding alreay has its effects in weight matrix that is multiplied by values right? then this positional encoding is transferre to the next layer and we again add he positional encoding to the keys and queries. what I did not get why if the value is position free then the hidden layers do not transfere the position infirmation to the next layer?
@a3ytc 3 ปีที่แล้ว ⁺²
Wonder how this will work with bidirectional input - if you just apply the distance penalty in both directions then it doesn't know e.g. if the word is n tokens before or n tokens after (in theory I guess this means you can reverse the sentence and have the same prediction for the missing token)
@YannicKilcher 3 ปีที่แล้ว
True. I would guess that's fine, though.
@draxd3045 3 ปีที่แล้ว ⁺¹
Many channels talked about the same paper, but Yannic's is always my favorite
@KristoferPettersson 3 ปีที่แล้ว ⁺¹
Mr Kilcher, you are awesome! Thank you for these well made lessons!
@oneman7094 3 ปีที่แล้ว ⁺¹
Why not learn m? (The scalar that multiplies the matrix which is added to the attention matrix)
@ofirpress 3 ปีที่แล้ว ⁺³
We do mention in the paper that if you'd like to make m learned it will slow down training by about 3%. In our experiments it was just really tricky to make trainable slopes work. Sometimes we'd learn negative slopes which would kill any extrapolation ability, and then when we tried using ReLU to make sure that the slopes are never negative that made the slopes we got perform badly.
I'm definitely sure that with a bit more work we'll get trainable slopes to work, and if also started hearing from other people that they've made it work.
@oneman7094 3 ปีที่แล้ว
@@ofirpress Thanks! That makes sense .
@yeaves 2 ปีที่แล้ว
The subtraction something in the log space means multiplication or division is ture but ALiBi subtract it inside softmax function. So I think that your explanation in terms of log space doesn't correct does it?
@yeaves 2 ปีที่แล้ว
Naah you are right. I confused.
@dragoninfire123 3 ปีที่แล้ว ⁺¹
Could someone explain how the transformers can extrapolate to longer sequence lengths?
If a model is designed to handle an input sequence length of 1024 during training, how can it handle 2048 or longer sequences at inference?
Thanks!
@ekstrapolatoraproksymujacy412 3 ปีที่แล้ว
You should learn about transformer architecture so that you really understand it, then it is self explanatory, in one layer every input goes through the same shared weights of key, querry and value layers, then they are summed (every output is weighted sum of all inputs transformed by the value layer) so it doesn't matter how many inputs there is and where they are, they are treated the same and summed, this can be a problem in tasks where position of input is relevant and then positional encodings are needed. You shouldn't try to understand papers like this one if you are lacking basic understanding of architecture, it's just waste of your time
@ofirpress 3 ปีที่แล้ว ⁺¹
I think it's a great question! We talk about this a bit in the analysis section. Basically we think our ALiBi method forces the model to kind of make every prediction with a virtual sliding window of say 1024 and so while we feed it 2048 tokens at a time, for each of the predictions in there it's probably only using around 1024 tokens of context. A lot more research is needed here though to understand exactly what's going on internally and how we can make this whole thing even better!
@dragoninfire123 3 ปีที่แล้ว
@@ofirpress Thank you Ofir for your explanation and contribution to the paper!
@ashokkumarj594 3 ปีที่แล้ว
Thank you so much for your great explanation😍
@sieyk 3 ปีที่แล้ว ⁺³
Did they state why m can't be a learned parameter? I get the feeling next paper will be "Wow, we can do amazing image synthesis now because we made m a learned parameter lol"
@ofirpress 3 ปีที่แล้ว ⁺¹
We do mention in the paper that if you'd like to make m learned it will slow down training by about 3%. In our experiments it was just really tricky to make trainable slopes work. Sometimes we'd learn negative slopes which would kill any extrapolation ability, and then when we tried using ReLU to make sure that the slopes are never negative that made the slopes we got perform badly.
I'm definitely sure that with a bit more work we'll get trainable slopes to work, and I've also started hearing from other people that they've made it work.
@sieyk 3 ปีที่แล้ว
@@ofirpress thanks for clearing that up! Admittedly I only skimmed through the paper looking for something along the lines of "m as a trainable parameter".
How was the performance when bounding m using sigmoid with a trainable spacing d_m?
@ofirpress 3 ปีที่แล้ว
@@sieyk I''m not sure what you mean by "trainable spacing" but bounding m using a sigmoid is exactly what makes it possible to train these slopes!
@sieyk 3 ปีที่แล้ว
@@ofirpress I mixed up the function of m during my last question. I did not realise that you used a geometric series for m across the heads, which makes sense why a trainable m would not add much.
Also I appreciate your patience, when I was reading through the paper I noticed that, indeed, you already stated that sigmoid was the best choice for trainable geometric functions. Sorry about that!
What I meant to ask was, how well did it do if each head had its own trainable m, discarding the geometric series? I understand that the geometric series was a substitute for positional encoding, so perhaps that idea wouldn't work at all 😅.

ต่อไป

เล่นอัตโนมัติ

∞-former: Infinite Memory Transformer (aka Infty-Former / Infinity-Former, Research Paper Explained)