Rethinking Attention with Performers (Paper Explained)

Yannic Kilcher

มุมมอง 55 256

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 27 พ.ค. 2024
#ai #research #attention
Transformers have huge memory and compute requirements because they construct an Attention matrix, which grows quadratically in the size of the input. The Performer is a model that uses random positive orthogonal features to construct an unbiased estimator to the Attention matrix and obtains an arbitrarily good approximation in linear time! The method generalizes beyond attention and opens the door to the next generation of deep learning architectures.
OUTLINE:
0:00 - Intro & Outline
6:15 - Quadratic Bottleneck in Attention Mechanisms
10:00 - Decomposing the Attention Matrix
15:30 - Approximating the Softmax Kernel
24:45 - Different Choices, Different Kernels
28:00 - Why the Naive Approach does not work!
31:30 - Better Approximation via Positive Features
36:55 - Positive Features are Infinitely Better
40:10 - Orthogonal Features are Even Better
43:25 - Experiments
49:20 - Broader Impact Statement
50:00 - Causal Attention via Prefix Sums
52:10 - Code
53:50 - Final Remarks & Conclusion
Paper: arxiv.org/abs/2009.14794
Code: github.com/google-research/go...
Blog: ai.googleblog.com/2020/10/ret...
Kernels on ML Street Talk: • Kernels!
My Video on Linformer: • Linformer: Self-Attent...
My Video on Reformer: • Reformer: The Efficien...
My Video on Attention: • Attention Is All You Need
Abstract:
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
Authors: Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller
Links:
TH-cam: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher
Parler: parler.com/profile/YannicKilcher
LinkedIn: / yannic-kilcher-488534136
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 97

@imgtsmnlom 3 ปีที่แล้ว ⁺⁴⁰
I find your level of detail absolutely spot on! None of my profs ever felt so well rehearsed while at the same time going just deep enough so that the audience (well, me) has a chance to actually follow along in real time. Big ups!
@SunilMeena-do7xn 3 ปีที่แล้ว ⁺⁵³
Please continue your classical paper series. That's very helpful for beginners like me.
@Neural_Causality 3 ปีที่แล้ว ⁺²³
yes, it is... even for PhD students 👀!
@EinsteinNewtonify 3 ปีที่แล้ว ⁺²
Yeeeees. I Love it
@andrehoffmann2018 3 ปีที่แล้ว ⁺⁷⁵
Yannic made close to none sassy remarks. This paper must be huge
@toom2141 3 ปีที่แล้ว ⁺²⁹
I just recently dicovered your channel. But I love your videos so much, Yannic.
It is so fantastic having real experts posting stuff on youtube. Your channel is without any doubt Triple-A Plus 👍
Thank you so much for putting all that energy into your videos.
@klammer75 3 ปีที่แล้ว ⁺⁴
Agreed 100%...keep up the good work😎🎓
@martinpflaum882 3 ปีที่แล้ว ⁺³
Yes they are really great. always really high quality just miss the days when he published every day one:D
@jonatan01i 3 ปีที่แล้ว ⁺³
@@martinpflaum882 I couldn't keep up back then.
@anthonyrepetto3474 3 ปีที่แล้ว ⁺¹¹
Thank you for walking through the key concepts and confusions :) I got chills, thinking about how much of an accelerant this is, for rolling-out massive attention. Every time a researcher says "oh, look a lotto ticket!" we casually assume that the efficiencies will make it easier for lower-tier compute to compete... while Microsoft leans over and says "I'll give you *two* billion, kid..."
Also, at 23:54 -> you draw two sad skulls looking out the window of a bus at night, with a third skull at the back of the bus, asleep.
@florianhonicke5448 3 ปีที่แล้ว ⁺¹⁸
I still watch each of your videos. There is no one on TH-cam who is going so deep into the papers like you do. Also I like the way you present your perception of the paper. #fanboy :D
@katiefaery 3 ปีที่แล้ว
Great video. I read the paper a few days ago but it's nice to have someone talk you through it as well. Nice clear explanations. Thanks 😊👍
@allessandroable 3 ปีที่แล้ว ⁺²
You explain difficult things in a really enjoyable and easy way! Thanks for your work
@alessiaventani9504 3 ปีที่แล้ว ⁺¹
Hi! I've to study this paper for my last work at the university! Without your video I won't understand all these details! Thank you!
@Jason-zb8xq 3 ปีที่แล้ว ⁺¹
Very well presented! Definitely worth watching more of your videos to learn your presentation skills :)
@NilabhraRoyChowdhury 3 ปีที่แล้ว
The most interesting paper that has come out so far in 2020 IMO. Thanks for the detailed video!
@wentianbao5368 3 ปีที่แล้ว ⁺¹
straightforward explanation . pretty cool
@clee5653 3 ปีที่แล้ว
My head exploded. Thanks Yannic, no way I can understand this paper without your awesome explanation.
@chaoqiyang5219 3 ปีที่แล้ว ⁺¹
excellent video! Thanks, Yannic!
@yi-hsuanyang1518 3 ปีที่แล้ว ⁺¹
Very nice video. Many thanks!
@ProfessionalTycoons 3 ปีที่แล้ว
Dope, great theoretical breakthroughs
@dik9091 ปีที่แล้ว
the quadratic scaling problem was long time ago, 1952, solved with a neural network technology called Clos networking (what is it with the French and neural networks) It was a problem for analog telephony switching boards that also had the quadratic scaling problem. Funny not how nobody thinks to buy an old analog telephony exchange and be done with it. The attention matrix can be build completely in hardware, as well as the deep learning network and everything involved. There is no need for a cpu or gpu anywhere, only analog calculations with opamps and sequential circuitry. Our brain also has no cpu or gpu, there is only analog summings of voltages, which take no time since it all happens with the speed of light and at the speed of light there is no such thing as time. Circuitry is all we need, would be the title of my paper when I can prove with an experiment that this is the way to go about it.
What is also funny is how the nomenclature is the same, in telephony/pro audio they use symmetric /balanced signals for noise rejection with hardware transformers. Now in the AI context I hear talk about audio transformers which is for a pro audio guy kinda weird. I make these hardware switches for pro audio with clos hardware neural networking and solved this issue for the pro audio industry, check this current SOS edition for an article about it. This can be done because I know how to do it, it is a lack of funding and time that prevents me from working on it now but I can make a design already and a simulation on falstad, this is at least going to be fun.
@sarathks9911 3 ปีที่แล้ว
Thanks for your neat explanation.
I am curious to know how effective is performer based transformer on different NPUs? Is there any limitation?
@shivamraisharma1474 3 ปีที่แล้ว ⁺²
Can you do a video on a code along for some neural rendering repo on colab?
@herp_derpingson 3 ปีที่แล้ว ⁺²
I still think that there is some kind of No Free Lunch effect going on when it comes to attention. Sometimes you just need a large amount of compute. Regardless, this is the best tradeoff I have seen so far.
@konghong3885 3 ปีที่แล้ว ⁺³⁴
TL;DR:
49:07 --of course they beat everything
@Neural_Causality 3 ปีที่แล้ว ⁺¹
In the paper, if it is going to be the next thing that everyone uses (we don't know) is fairly possible
@norik1616 3 ปีที่แล้ว ⁺²
Google lab at this point has to releas a bit better transformer each time - the first 48 mins are what I came for. If not this, once there (hopefully) will be a true liear attention (or something stronger more general and also linear with seq length). And that will be a great deal for all of us "gaming hardware" DL engineers.
@ksy8585 3 ปีที่แล้ว
Your videos are really awesome
@ilyasaroui7745 3 ปีที่แล้ว ⁺²
Hello Yannic, Thnx for the great video, can you please share with us with software you use to record your scree and to edit the pdfs ?
@YannicKilcher 3 ปีที่แล้ว ⁺¹
OneNote
@Myblogband 5 หลายเดือนก่อน
This isn't mathematics, this is grunt work!
@francescocariaggi1145 3 ปีที่แล้ว
What's the purpose of the function "g" at 15:55? It looks like they introduce it, but then they don't include it in the definition of phi(x)
@hyunsunggo855 3 ปีที่แล้ว
What a great paper. This is the kind that will change the future.
@NielsRogge 3 ปีที่แล้ว ⁺⁶
Great video! I think there's a typo in the description of the video, should be Performer rather than Reformer
@YannicKilcher 3 ปีที่แล้ว
Thank you!
@pravinnagar6111 2 ปีที่แล้ว
I am adapting this work for very long videos (egocentric lifelogging videos). However, I am stuck in equation 5. It would be a great help if you will provide proof/resources of equation 5.
I also read the related work titled 'Orthogonal Random Features.' In this work, I follow the third equation. This equation seems the special case of equation 5. However, I still don't understand how h(x) is introduced in equation 5.
@Shivam98i 3 ปีที่แล้ว
Great video! Covers every aspect of it... I have one doubt though.. how to perform masking in the bidirectional
Will it be the same as done in transformer.
QK.t was masked anth then softmax was done in transformer but how to do it in this?
@YannicKilcher 3 ปีที่แล้ว
I actually don't know exactly
@justincho2043 3 ปีที่แล้ว ⁺¹
I think you meant to say "The Performer" instead of "The Reformer" in the video description. Thank you as always, keep up the good work!
@YashBhalgat 3 ปีที่แล้ว
Is there any mention of the actual on-target (e.g. TPU) latency comparisons between conventional Transformer and Performers? (I don't see it in the paper, unless I am missing something)
@YannicKilcher 3 ปีที่แล้ว ⁺¹
there is not, as far as I can tell
@granttao7504 3 ปีที่แล้ว
thank you
@asilserhan685 3 ปีที่แล้ว ⁺¹
So can we train a model with GPT 3 performance and same input sequence length faster using these or does this only allow us to have longer input sequences?
@YannicKilcher 3 ปีที่แล้ว
technically yes, but whether it reaches gpt-3 is not clear
@felipemello1151 3 ปีที่แล้ว ⁺⁴
I was very very excited about it, but then I saw this paper comparing performers vs other attention mechanisms: openreview.net/pdf?id=qVyeW-grC2k
It seems that the performer attention doesn't do as well as other attentions when there is some hierarchical structure (check listOps results). There are some interesting comments here: github.com/lucidrains/performer-pytorch/issues/2
@thomasevers1938 20 วันที่ผ่านมา
Hello Yannic, You say approximate the attention matrix, this implies there is some ground truth attention matrix. Does this mean these methods are only applied at inference? Meaning to say, are the models still trained on the actual softmax attention and then an approximation is made during inference?
If not, and this is actually used during training, meaning the model is trained to work best it can with this approximation of the softmax, why do we still talk about unbaisedness towards the actual attention matrix? We basically came up with a new type of model, why compare it to the softmax version? Just because we know that works? We do we desire our model approximate the original transformer? Why can it not be its own thing.
Thank you in advance :)
@JI77469 3 ปีที่แล้ว ⁺¹
Is it correct to think that Random Fourier Features is "the" modern breakthrough that's preventing Kernel methods from being banished into relative obscurity (except for niche applications or when you have a small data set) ?
@YannicKilcher 3 ปีที่แล้ว
yes, one of the few last things that keeps kernels going
@ross825 3 ปีที่แล้ว
Just saw this and now I’m clearing my schedule lol
@weikailin4342 2 ปีที่แล้ว
however, When we use Transformer, we find that MLP computation is the bottleneck, too, because latent size d is very big, and seqence size N is not that big. I wonder is there article rethinking the MLP Layer?
@Zenol42 3 ปีที่แล้ว ⁺¹
I want causal performers in pytorch!!! 😍
@ivanvoid4910 3 ปีที่แล้ว
Oh man this was cooler than Marvel, thank you!
@faizanahemad 3 ปีที่แล้ว
At 8:38 doing Q.(K^t.V) instead of (Q.K^t).V is same as the Transformers are RNN paper?
@YannicKilcher 3 ปีที่แล้ว
good connection, I don't know exactly
@TheReferrer72 3 ปีที่แล้ว
You are a star! was wondering how this architecture works, and too lazy/dumb to read the paper.
@iuhh 3 ปีที่แล้ว ⁺¹
The Street Talk on Kernel with Alex Stenlake: th-cam.com/video/y_RjsDHl5Y4/w-d-xo.html (mentioned at 12:43)
@scottmiller2591 3 ปีที่แล้ว
Bochner never got enough love.
@RohitKumarSingh25 3 ปีที่แล้ว ⁺¹
I think there are many technical blog sites who also wait for your videos. Once you explain it here. They just summarise the video there. 😅
@matthewtang1489 3 ปีที่แล้ว
This I totally agree (Haha). Many Chinese tech blog I follow post what the videos he makes
@chaidaro 3 ปีที่แล้ว
Matthew Tang these people who summarize the vid have no shame.
@hannesstark5024 3 ปีที่แล้ว
37:00: PRF are infinitely better than the trigonometric approximation: Why are the ratios between the MSEs going down to 0 and not just 1 for length differences close to 0? Does that not mean that in that area the PRF is infinitely worse than the trigonometric approximation?
@YannicKilcher 3 ปีที่แล้ว
good point, I don't now
@andres_pq 3 ปีที่แล้ว
Beats me why I have not heard a SOTA model with Performers.
@G12GilbertProduction 3 ปีที่แล้ว ⁺¹
Laplacian differentials in the multi-layer 225-bytes kernel isn't really interpolate themselves in the distraction progress, it could be generating more costable errors in R²d (upper) and R²d (lower) maximal interpolation rate counting by unlinear / meta-linear differentation, if we comfortably using only one of kernelization estimating network in one layer by product.
@hannesstark5024 3 ปีที่แล้ว ⁺⁷
You say "and of course they beat everything". What is your opinion of that after looking at the "long-range arena": openreview.net/forum?id=qVyeW-grC2k which compares many different efficient transformer ideas including the Performer?
@clee5653 3 ปีที่แล้ว
Well, obviously it's from Google.
@DavenH 3 ปีที่แล้ว
Thanks for the paper link -- interesting results.
Cliffs: Performer is on the (current) Pareto-optimal curve with a great performance/accuracy tradeoff.
Big Bird also on the PO curve and outdoes vanilla Transformer's accuracy slightly with less memory but similar (bad) performance.
Reformer and Local Attention suck.
Linformer and Linear Transformer are similar, but slightly dominated by Performer.
@hannesstark5024 3 ปีที่แล้ว
@@DavenH what does pareto-optimal curve mean? I only heard about pareto optimality from game theory. And why do you say Performer slightly dominates Linformer and Linear Transformer and BigBird has bad performance even though the Performer performs very much worse than the other models on, for instance, the list ops?
@DavenH 3 ปีที่แล้ว ⁺²
@@hannesstark5024 It's a term used in economics too. It means the curve on a multivariate function that expresses the best trade-offs possible. I'm using the term a bit flexibly because these are merely best / SOTA results, rather than known-optimal results.
An example could be a measurement device that determines the momentum and position of a particle to the joint accuracy limit prescribed by the Planck Constant -- you can make a tradeoff on either measurement, and so long as the product of errors of those quantities is the Planck Constant, it will be fall on the Pareto-optimal curve of greatest possible accuracy. In contrast if you had a measurement device whose product of errors in each measurement was greater than PC, it would not be Pareto Optimal. If I haven't described it well enough, search "wiki Pareto Frontier".
The comments about dominating Linformer and LT are from the overall results on the Long Range Arena task plotted in their Figure 3.
You can see Performer lies on the Pareto Frontier, as does Big Bird and Synthesizer. Meaning they're particular combinations of accuracy and performance are not dominated.
Performer is better in both accuracy and performance than LF, LT, LA, Reformer, and Sinkhorn, so those models are dominated and never the right choice (overall). But they could be the right choice for a particular task.
@hannesstark5024 3 ปีที่แล้ว ⁺¹
@@DavenH Ah, nice thanks for the explanation and pointer! Btw, do you know if the "size of the circle" representing the memory footprint is the radius or the area of the circles?
@shaneacton1627 3 ปีที่แล้ว
Does this mean sparse attention is dead?
@karteekmenda7621 3 ปีที่แล้ว
Hey yannick, can you make a video on PRADO.
@karteekmenda3282 3 ปีที่แล้ว
Hi Yannic,
Can you please make a video on PRADO. Attaching the link of the paper(aclweb.org/anthology/D19-1506.pdf) for your reference.
@krooqs 3 ปีที่แล้ว
What tablet do you use?
@krooqs 3 ปีที่แล้ว ⁺¹
Found the answer, It's a surface according to an older video.
@pi5549 8 หลายเดือนก่อน
7:00 It's nTokens * nTokens (or MAX_TOKENS * MAX_TOKENS if you're batch-training and using padding) not L*L
@pi5549 8 หลายเดือนก่อน
And waitwat -- the Values aren't coming from layer L+1. They're coming from layer L the same as Q and K. The inputs to layer L are being matMul'd by W_Q and W_K & softMax'd which gens the AttentionMatrix which is then applied to v (= matmul(inputs, W_V))
@fast_harmonic_psychedelic 3 ปีที่แล้ว
this dalle+clip colab uses sigmoid and softmax. I thought that was modern..
@Mnnvint 2 ปีที่แล้ว
"Believe it or not, you young kids" - don't make me feel even older than I am, you impudent zoomer! It's just... ten years ago or so :-|
In Andrew Ng's first machine learning course (which had only a small chapter on neural networks, at the time they didn't impress me since they performed no better than SVMs and took ten times as long to train) I don't remember which activation function we used, but it was certainly not ReLU.
@machinelearningdojowithtim2898 3 ปีที่แล้ว
This is a seriously amazing video, make sure you all get over to Yannic's SubscribeStar and cough up! It's more cost-effective than going to university I promise! www.subscribestar.com/yannickilcher
@charilaosmylonas5046 3 ปีที่แล้ว
great explanation - random Fourier features are becoming quite trendy lately - (demonstrations are on "coordinate-based MLPs") arxiv.org/abs/2006.10739 . This random features idea works ridicolusly well
@shivani404sheth4 3 ปีที่แล้ว
'what is this paper doing? it's exactly doing what I said was impossible' xD
@marijnstollenga1601 3 ปีที่แล้ว ⁺¹
Allright, so we're back to kernel methods. I'm sure most of this has been done
@444haluk 2 ปีที่แล้ว
of course orthagonal w's are better, random w's will put your original vector into a latent space in the new high dimensional space. that is 40 years old knowledge.
@JumpNationFilms 3 ปีที่แล้ว
I don't quite get what an attention matrix is at 7:50. I thought we had a separate Q, K and V matrix, not one big attention matrix A
@YannicKilcher 3 ปีที่แล้ว
A is the product of Q and K
@johnpope1473 3 ปีที่แล้ว
13:00 - "What is Kernel?
A kernel is a function used in SVM for helping to solve problems. They provide shortcuts to avoid complex calculations.
The amazing thing about kernel is that we can go to higher dimensions and perform smooth calculations with the help of it
We can go up to an infinite number of dimensions using kernels. Sometimes, we cannot have a hyperplane for certain problems. This problem arises when we go up to higher dimensions and try to form a hyperplane. A kernel helps to form the hyperplane in the higher dimension without raising the complexity." techvidvan.com/tutorials/svm-kernel-functions/
@mmmooo... 3 ปีที่แล้ว ⁺²
我哪儿敢说话
@user-hc8oh1yg4z 3 ปีที่แล้ว ⁺³
why without subtitles?

ต่อไป

เล่นอัตโนมัติ

Feedback Transformers: Addressing Some Limitations of Transformers with Feedback Memory (Explained)