Intuition buildup was amazing, you clearly explained why we need learnable parameters in the first place and how that can help relate similar words. Thanks for the explanation.
Great and clear explanation. One question about W_Q and W_K. Since z1 = k1^T *q3 = x1^T * (W_k^T * W_Q) * x2, and W_k and W_Q are trainable matrices, could we just combine it as a matrix like W_KQ = W_k^T * W_Q to reduce the number of paramters?
pretty bad example . Even if we have trainiable Wq and Wk , what if there was a new sentence where we had Tom and and he , the WQ will still make word 9 point to wmma and she
Superb explanation. This is the clearest explanation of the concept of weight in self-attention I have ever heard. Thank you so much.
A first class explanation of self attention- the best on TH-cam.
Intuition buildup was amazing, you clearly explained why we need learnable parameters in the first place and how that can help relate similar words. Thanks for the explanation.
Best explanation of self-attention I've seen so far. This is gold.
It is a very brilliant explanation about self-attention!!! Thank you.
Thank you for sharing.
Thanks for putting these videos together!
Best explanation ever :) Thank you
greatful forever
Thank you for great explanation. I still don't understand about how to gain Wq ad Wk.
Thank you for your explanation. I just didnt understand how we chose W_k and W_q???
These matrices contain learnable parameters that can be trained using standard techniques from deep learning.
Thank you
Thank you
Thank you !
Great and clear explanation. One question about W_Q and W_K. Since z1 = k1^T *q3 = x1^T * (W_k^T * W_Q) * x2, and W_k and W_Q are trainable matrices, could we just combine it as a matrix like
W_KQ = W_k^T * W_Q to reduce the number of paramters?
What you are suggesting should be possible as long as the matrices are quadratic.
pretty bad example . Even if we have trainiable Wq and Wk , what if there was a new sentence where we had Tom and and he , the WQ will still make word 9 point to wmma and she