Amazing content bro.lots of hard work . thank you so much .please make more AI playlists like NLP, RL , Deep RL ans Meta Learning with these amazing animations.
You're my hero. Marry me! Hahaha this is just a comment to let you know that your explanation can easily be the clearest one on TH-cam to understand attention. Keep up the good work! Thanks a mil!
Thank you for the fruitful lecture! Instead of α_i, using α_{i,j}=align(h_i, s_j) makes the equation easier to see for me. But it's super helpful for beginners like me, thanks again!
7:25 I think concatenation before the linear layer is from the paper of Luong et al. 2015. In Bahdanau et al., the authors performed matrix multiplication with linear layers first (on both h and s), then the concatenation.
It could be better if you can address QKV with your notation. I'm new to attention mechanism and I'm getting confused with some of your notations. But the explanation itself is very clear.
In the decoder part of Seq2Seq with attention model, decoder uses three inputs. At first it uses c0, s0, and x'1 to predict s'1, here s0 is the latent representation of encoder and x'1 is the start sign, s0 and x'1 is different. In the next step uses c1, s1, and x'2 to predict s'2. Aren't s1 and x'2 same here? Because s1 is the previous hidden state and the x'2 is the predicted word, which is like a result of probability distribution based on s1. If I am not wrong, it supposed to use only one of them, or always use the s0. Can someone clarify this?
You mention at 11:19 that x1prime is the start sign, later (15:22) you mention x2prime as obtained in the previous step, but how? You show clearly how to obtain s1 and c1 but not x2prime.
I am also confused about that. Based on my intuition, using s0, c0, and x1, generated hidden state s1 at the decoder is used to generate a probability distribution over the vocabulary of possible output tokens. And the possible outcome is x'2. Again, x'2 is used together with s1, and c1 to generate s2. My confusion is that, x'2 and s1 carries the same information since x'2 is generated from s1. Therefore I don't see any reason to use both of them.
Your explanation was very clear and useful. I strongly recommend this video if you want to understand the concept of the Attention mechanism in RNNs.
Shusen Wang, this was extremely beneficial, absolute masterpiece.
Thanks man, I've done my last minute prep for the exam through this video.
Amazing content bro.lots of hard work . thank you so much .please make more AI playlists like NLP, RL , Deep RL ans Meta Learning with these amazing animations.
extremely clear and easy to follow explanatiom
You're my hero. Marry me!
Hahaha this is just a comment to let you know that your explanation can easily be the clearest one on TH-cam to understand attention. Keep up the good work! Thanks a mil!
The best on TH-cam, thank you very much
Astonishing pedagogic effort Shusen! That’s a lot of work involved to share knowledge. Kudos !
Thank you for the fruitful lecture!
Instead of α_i, using α_{i,j}=align(h_i, s_j) makes the equation easier to see for me.
But it's super helpful for beginners like me, thanks again!
The same notation was already used in the next next lecture.
Sorry for the redundant comment.
So excited. Great supporting material to Goodfellow textbook. I am building my knowledge for the vision Transformer model.
very beautifullly and simply explained ,GGs
7:25 I think concatenation before the linear layer is from the paper of Luong et al. 2015. In Bahdanau et al., the authors performed matrix multiplication with linear layers first (on both h and s), then the concatenation.
Best explanation I've ever seen
Amazing explanation 👏 just one question what are A and A prime in this video? h correspond to hidden states of encoder at different time steps
Very good explanations thank you very much
I have a question: what is the output of attention and how do you measure the loss?
very well explained ... thank you very much
Explained nicely. Thank you.
Beautifully explained
It could be better if you can address QKV with your notation. I'm new to attention mechanism and I'm getting confused with some of your notations. But the explanation itself is very clear.
Which slides or time step are you referring to?
This is really good ! Thanks !
In the decoder part of Seq2Seq with attention model, decoder uses three inputs. At first it uses c0, s0, and x'1 to predict s'1, here s0 is the latent representation of encoder and x'1 is the start sign, s0 and x'1 is different. In the next step uses c1, s1, and x'2 to predict s'2. Aren't s1 and x'2 same here? Because s1 is the previous hidden state and the x'2 is the predicted word, which is like a result of probability distribution based on s1. If I am not wrong, it supposed to use only one of them, or always use the s0. Can someone clarify this?
Best attention video!
你讲的很清楚!谢谢!
Very well explained!
Hi @ShusenWangEng, which template you have use to create this slide?
I search but cannot found any slide in overleaf like this. Thanks
at 19:26 the number of weights should be m*t+1 or am i getting it wrong ? because we have c0 as well
I was looking for way to implement the encoder decoder , with attention model , with out using the for loop at decoder stage, is it possible???
So good! Thanks
Excellent video
You mention at 11:19 that x1prime is the start sign, later (15:22) you mention x2prime as obtained in the previous step, but how? You show clearly how to obtain s1 and c1 but not x2prime.
I am also confused about that. Based on my intuition, using s0, c0, and x1, generated hidden state s1 at the decoder is used to generate a probability distribution over the vocabulary of possible output tokens. And the possible outcome is x'2. Again, x'2 is used together with s1, and c1 to generate s2. My confusion is that, x'2 and s1 carries the same information since x'2 is generated from s1. Therefore I don't see any reason to use both of them.
Very clear lecture. Thank you!
It is not clear to me what the vector V, used for inner product with tanh of W and hiS0, corresponds.
You can review his previous slides about basics of RNN. I guess it is the learnable parameter matrix connecting the inputs to the hidden states.
What is x' ? And where do you get x'1 from?
x‘1 is the start sign (like an empty space), x‘2 is the first word of the decoder, x‘3 the second and so on.
the video image is too poor, you need to fix it more
you need to check your internet quality
How it comes at about 7:27 that s0 is suddenly a vector? In the previous slide you state that s0 = hm. Useless video...
I am very enlightened.. thankyou