He'll get there. Quality informational content always picks up slowly, but as long as quality is not declining, its growth is exponential. To a limit, obviously, since this information is specialized, but this limit is high. So as more teachers discover these illustrations and pass these on to their students, it will grow.
these are the best animations i've seen about neural nets, i hope we can get a such a clear video like the one in separable dephwise convolutions but for attention
Saying that Multihead Attention has less parameter than a token-wise linear is true for NLP models but not true for ViT. Additionally, simply creating a mechanism which incorporates the entirety of the features does not explain away the success of attention mechanisms -- looking again at computer vision tasks, MLP Mixer also incorporates the entirety of the features in its computations, but is still less successful than the attention based ViTs. One part of the strength of the attention layer is its adaptability -- which you can see the value of in things like GAT. Otherwise, it could just be replaced with a generic low-rank linear.
I felt like a lot of work has been put into making the animations for the series and I should leave learning something, but somehow I am left more confused after watching the entire series than before I started. Not sure if this is due to a need to visualize something that cannot be represented in 3D space, knowledge gap created due to assumptions used during the explanation process, or I am simply too stupid.
I'm confused because you make it look like an attention layer could be used as a drop-in replacement for a linear layer but GPT-4o says: "No, an attention layer cannot be used as a direct drop-in replacement for a linear layer due to the fundamental differences in their functionalities and operations."?
That’s correct that an attention layer is not functionally equivalent to a linear layer. This efficiency comes with its own trade-offs. But it’s going to make more sense to talk about those trade-offs a couple more videos down the line in this series, so I didn’t go over them in this video.
@@animatedai Thanks for clearing that up, also I ran some quick tests comparing the performance of a pytorch MultiheadAttention layer with a Linear layer and the linear layer is significantly faster on CPU and GPU in every test i can run so i hope that's something you could clarify in a future video. Looking forward to the next one!
Computational efficiency is also due to higher dimensionality, right? You can represent data in a much richer space compared to RNN's of similar parameter size and capture more complex features due to this higher dimension space that's enabled by each attention layer. That said I might be unfair to RNN's since they have such bad long-range dependency and "physically" can't do the same stuff even if it wanted.
FWIW, have you seen the recent business presentation given by Randell L Mills who can explain the reality of N electron, 4 having the solution to EVERYTHING?
How the hell do you have so little views? This is both one of the best and factually correct animations out there.
He'll get there. Quality informational content always picks up slowly, but as long as quality is not declining, its growth is exponential. To a limit, obviously, since this information is specialized, but this limit is high.
So as more teachers discover these illustrations and pass these on to their students, it will grow.
these are the best animations i've seen about neural nets, i hope we can get a such a clear video like the one in separable dephwise convolutions but for attention
The other drawings and visuals can't keep up with this!
Great content! I love the visualizations!
straight heat i can't even lie good stuff bro
Saying that Multihead Attention has less parameter than a token-wise linear is true for NLP models but not true for ViT. Additionally, simply creating a mechanism which incorporates the entirety of the features does not explain away the success of attention mechanisms -- looking again at computer vision tasks, MLP Mixer also incorporates the entirety of the features in its computations, but is still less successful than the attention based ViTs. One part of the strength of the attention layer is its adaptability -- which you can see the value of in things like GAT. Otherwise, it could just be replaced with a generic low-rank linear.
4:27 I'm definitely judging the animation of the recurrent layer...
I have been waiting for this video! Very much worth the wait!
This is so unfairly underrated, I have never seen such a good video about CNNs.
It would be good to know the rational behind such way of calculation besides computational efficiency.
Thanks! great explanation :)
You are a legend keep it up!
I felt like a lot of work has been put into making the animations for the series and I should leave learning something, but somehow I am left more confused after watching the entire series than before I started.
Not sure if this is due to a need to visualize something that cannot be represented in 3D space, knowledge gap created due to assumptions used during the explanation process, or I am simply too stupid.
I'm confused because you make it look like an attention layer could be used as a drop-in replacement for a linear layer but GPT-4o says: "No, an attention layer cannot be used as a direct drop-in replacement for a linear layer due to the fundamental differences in their functionalities and operations."?
That’s correct that an attention layer is not functionally equivalent to a linear layer. This efficiency comes with its own trade-offs. But it’s going to make more sense to talk about those trade-offs a couple more videos down the line in this series, so I didn’t go over them in this video.
@@animatedai Thanks for clearing that up, also I ran some quick tests comparing the performance of a pytorch MultiheadAttention layer with a Linear layer and the linear layer is significantly faster on CPU and GPU in every test i can run so i hope that's something you could clarify in a future video. Looking forward to the next one!
Computational efficiency is also due to higher dimensionality, right? You can represent data in a much richer space compared to RNN's of similar parameter size and capture more complex features due to this higher dimension space that's enabled by each attention layer. That said I might be unfair to RNN's since they have such bad long-range dependency and "physically" can't do the same stuff even if it wanted.
FWIW, have you seen the recent business presentation given by Randell L Mills who can explain the reality of N electron, 4 having the solution to EVERYTHING?
😊
hi