I've been closely following the Transformer playlist, which has greatly helped in my comprehension of the Transformer Architecture. Your excellent work is evident, and I can truly appreciate the dedication you've shown in simplifying complex concepts. Your approach of deconstructing intricate ideas into manageable steps is truly praiseworthy. I also find it highly valuable how you begin each video with an overview of the entire architecture and contextualize the current steps within it. Your efforts are genuinely commendable, and I'm sincerely grateful for your contributions. Thank you.
truly amazing video, I have read the original paper but this video definitely helped me to understand it better, especially the way that you visualize the whole architecture.
7:00 I feel as though the implementations that just repeat the Q K V matrices are making a mistake, mostly because the purpose of multihead attention is to learn different attentions right? In the attention blocks the linear layers / learnable parameters are at the beginning for each Q K V, then one big one after the heads are concatenated, so without the individual ones at the beginning (I’m assuming each initialized to random values) I believe the multiple heads would be useless. Thoughts or corrections?
I’m even more confused because I’m realizing in encoder-decoder attention, Q comes from the decoder, K V from encoder, but I feel like it would make more sense for Q to come from encoder, and K and V to come from the decoder… because in the english-french example, it would be like asking, What’s this english sentence in french? then checks the compatibility of the english tokens with the french tokens, then multiplies these compatibilities to the french tokens for output Also I still feel like dividing the tokens into pieces would be an unnecessary set back
Excellent video @CodeEmporium 👏 One question: at minute 20:00 you say that we don’t need the look-ahead mask for the cross-attention layer at inference time, but this is valid during training too, right?
in this part, Q is only generated from the target language words, and K is only generated from the source language words so we don't worry the Q of word i in target sentence will multiply K of word i+1 in target sentence, thus no mask needed
I shall get to this at some point. For the next series of videos, I am going to go through the history of language models so we truly understand why transformers and ChatGPT have the architectures they do. Once this series is complete, I will take this on or some other topics. Whatever videos come out, I am sure they will be helpful and fun :)
Illustrating your explanations with code actually provides much deeper insights. Thanks, man! Quick note on this video: I was wondering why you haven't included the "output embeddings" in your sketch of the decoder?
Thanks for commenting and the kind words. For the sketch, the point was to focus on the architecture. But I do hope all of the later videos in this series clear up what the output of the decoder looks like. Currently releasing the training code next week too. So hope that helps even more
One thing I don't understand is that at 20:35 , the matrix obtained by multiplying the cross-attention matrix, derived from the encoder, with the v matrix is said to represent one English word per row. But the q part of the cross-attention matrix comes from the Kannada sentences in the masked attention, shouldn't each row of the resulting matrix correspond to a Kannada word?
The resulting matrix from (qT.k) is just attention matrix, and when this matrix is multiplied with v we get final representation, which is attentive to both the encoder output(english sentence) and decoder output(kannada sentence) hence the name cross-attention
@@AshishBangwal I totally agree with your thought on where the name “cross attention” comes from.Yet my point here is that since vector q is derived from decoder ahead ,its dimension should be max token length of Kannada * 64.Then the amount of rows of resulting matrix after multiplying attention matrix with vector v derived from encoder ought to equal to max Kannada token length.Hence each of this matrix’s row should stands for a Kannada token instead of an English one.
@@jackwoo5288 Appologies for not getting your point for the first time.😅 I am not sure if I'm correct, but i have wrote a explanation(what i understood) below. PS: i wrote it in matrix, but i think its similar for vector. If I understood correctly let the output of the encoder will be (512xTe) so the K (key matrix) and V (value matrix) for cross-attention step will be of same dimension (512xTe), and our decoder output after masked-multi-head attention will be (512xTd) so the Q (quesry matrix) will be (512xTd) So If we go with the formula (Q^T.K) we will get attention matrix as (Td x Te) and then we do its multiplication with V^T we will get (Td x 512) It sounds more confusing when i read it again, lol. But ig that's the fun part
Great work indeed. Helped clear a lot of things especially the part where softmax is used for the decoder output. So the first row will output the target lang first word. But in scenarios where two source words resonate with one target lang word, how is softmax handled their? Can you please help me in figuring this out.
When the data is fed through the network N times (21:45), does each pass through the network use the same weights or is a different set of weights used for each pass?
No, data is not fed at each layer(vertical layer) , each layer takes input from previous layer (except the first one) and reach layer has its own weight..
Thank you for all the videos about transformer. Although I understood the architecture, I still dont know what to set for the input of the decoder (embeded target) and mask for the TEST phase?
By the way, when you do the dot product between q and K^T, it won't directly refer to cosine similarity (closer vectors mean more related terms) unless the magnitudes are normalized. But clearly, the vectors are not of the same magnitude, so how is the dot product a metric of similarity?
Thanks so much. Yea. I can’t seem to export this as a diagram from the whiteboard software. So I am planning to sketch both the combined encoder and decoder out together and have it accessible as like a PDF. It should be done in the coming weeks
@@CodeEmporium I can’t wait for the diagrams to be uploaded, I really need them, because I plan to draw them in the PowerPoint, but I got limited to the size of the PowerPoint slide (can’t fit in one place), then I tried to copy what I draw in the explain everything , then it got my time looking back and forth to the video seeing the diagram. So, I decided to check if you uploaded them or not. If you plan to do so, please do at your earliest convenience with my appreciation in advance for your tremendous work
The only question I can ask you brother is nothing rather than understanding these concepts are we going to be the next Elon Musk to create an gpt 4 or gpt infinity we are normal people who are going to use this technology mat be to earn money or earn a PhD and then earn money it's good to understand these things like we understand mechanics to understand the world arround us but space science and quantum mechanics are a variant only few will venture into it I do not say this is not important but only thing I say is your content are very good and unique but again it helps the people who are working in academic level or may be phd your way of deriving neural networks and machine learning algorithms mathematically is great but matrix calculus org which I tried your stuff is bit hard to me I don't say your not good at that but matrix calculus which is more important has no kind of computer algebra software codes to help it and it's rare my only advice is you are working to hard to make these videos but you have an class of tutoring which should be given in Oxford or Harvard iam really Proud to say you are an Indian specially south Indian but by this time you must have reached a million subscribers because of your genuine thoughts and to make a quality content people are judging you differently please don't take this as an advice please run with the folks you will be earning in millions and I have told to everyone about you still they don't have the knowledge to admire you for an intellectual admiration require not only brain it needs intellectual brain my only thing I have to tell you is you are a best book in the library un noticed I know the quality of best books not all books reveal the big picture you have an master class tutoring techniques which I have only seen in costly intellectual books and I don't know where you studied and are you a teacher, lecturer or professor you will surely be rewarded for your self less pure hearted efforts I have told everyone about you but in a world of fake gurus the true ones never blow there trumpet your contents are equivalent to a PhD ❤
I've been closely following the Transformer playlist, which has greatly helped in my comprehension of the Transformer Architecture. Your excellent work is evident, and I can truly appreciate the dedication you've shown in simplifying complex concepts. Your approach of deconstructing intricate ideas into manageable steps is truly praiseworthy. I also find it highly valuable how you begin each video with an overview of the entire architecture and contextualize the current steps within it. Your efforts are genuinely commendable, and I'm sincerely grateful for your contributions. Thank you.
Your drawing skill is actually amazing!
mind BLOWING..lucky enough to find your lectures
Man you're a pure treasure! Keep up this outstanding work! 🙏🏼
Thanks so much!
You are really great at articulation, Thank you😇
Best drawing to explain this concept 👏🏼👏🏼👏🏼
truly amazing video, I have read the original paper but this video definitely helped me to understand it better, especially the way that you visualize the whole architecture.
Glad you saw it that way! More to come!
7:00 I feel as though the implementations that just repeat the Q K V matrices are making a mistake, mostly because the purpose of multihead attention is to learn different attentions right? In the attention blocks the linear layers / learnable parameters are at the beginning for each Q K V, then one big one after the heads are concatenated, so without the individual ones at the beginning (I’m assuming each initialized to random values) I believe the multiple heads would be useless. Thoughts or corrections?
Ohhh, I just continued on to them getting divided by the number of heads. I thought the heads each worked with the whole matricies
I’m more confused now but I think in a good way because I’m a bit closer to understanding
I’m even more confused because I’m realizing in encoder-decoder attention, Q comes from the decoder, K V from encoder, but I feel like it would make more sense for Q to come from encoder, and K and V to come from the decoder… because in the english-french example, it would be like asking, What’s this english sentence in french? then checks the compatibility of the english tokens with the french tokens, then multiplies these compatibilities to the french tokens for output
Also I still feel like dividing the tokens into pieces would be an unnecessary set back
Can you explain in other video, examples of vectors of Q K V ? is still confusing for me what they represent.
Excellent video @CodeEmporium 👏 One question: at minute 20:00 you say that we don’t need the look-ahead mask for the cross-attention layer at inference time, but this is valid during training too, right?
in this part, Q is only generated from the target language words, and K is only generated from the source language words so we don't worry the Q of word i in target sentence will multiply K of word i+1 in target sentence, thus no mask needed
Will you make a video on transformers using vision transformer + transfotmer decoder for image captioning?
I shall get to this at some point. For the next series of videos, I am going to go through the history of language models so we truly understand why transformers and ChatGPT have the architectures they do. Once this series is complete, I will take this on or some other topics. Whatever videos come out, I am sure they will be helpful and fun :)
Great video !!! Clear explanation about dimensions and the whole process.
Thanks so much!
Illustrating your explanations with code actually provides much deeper insights. Thanks, man! Quick note on this video: I was wondering why you haven't included the "output embeddings" in your sketch of the decoder?
Thanks for commenting and the kind words. For the sketch, the point was to focus on the architecture. But I do hope all of the later videos in this series clear up what the output of the decoder looks like. Currently releasing the training code next week too. So hope that helps even more
One thing I don't understand is that at 20:35 , the matrix obtained by multiplying the cross-attention matrix, derived from the encoder, with the v matrix is said to represent one English word per row. But the q part of the cross-attention matrix comes from the Kannada sentences in the masked attention, shouldn't each row of the resulting matrix correspond to a Kannada word?
The resulting matrix from (qT.k) is just attention matrix, and when this matrix is multiplied with v we get final representation, which is attentive to both the encoder output(english sentence) and decoder output(kannada sentence) hence the name cross-attention
@@AshishBangwal I totally agree with your thought on where the name “cross attention” comes from.Yet my point here is that since vector q is derived from decoder ahead ,its dimension should be max token length of Kannada * 64.Then the amount of rows of resulting matrix after multiplying attention matrix with vector v derived from encoder ought to equal to max Kannada token length.Hence each of this matrix’s row should stands for a Kannada token instead of an English one.
@@jackwoo5288 Appologies for not getting your point for the first time.😅
I am not sure if I'm correct, but i have wrote a explanation(what i understood) below. PS: i wrote it in matrix, but i think its similar for vector.
If I understood correctly let the output of the encoder will be (512xTe) so the K (key matrix) and V (value matrix) for cross-attention step will be of same dimension (512xTe), and our decoder output after masked-multi-head attention will be (512xTd) so the Q (quesry matrix) will be (512xTd)
So If we go with the formula (Q^T.K) we will get attention matrix as (Td x Te) and then we do its multiplication with V^T we will get (Td x 512)
It sounds more confusing when i read it again, lol. But ig that's the fun part
While we are yet to translate the sentence to kanada, how can we pass it to the decoder??
Great work indeed. Helped clear a lot of things especially the part where softmax is used for the decoder output. So the first row will output the target lang first word. But in scenarios where two source words resonate with one target lang word, how is softmax handled their? Can you please help me in figuring this out.
Ajay can you provide the link to the architecture diagram that you are using for explanation it would be of great help
At the end of the decoder block, isn't there supposed to be another "Add & Norm" operation as in the architecture? Did he miss it?
Thank you! Your video makes me know a lot
When the data is fed through the network N times (21:45), does each pass through the network use the same weights or is a different set of weights used for each pass?
Different Set of weights. In order to add to make the model more deeper and make the model track more useful features.
No, data is not fed at each layer(vertical layer) , each layer takes input from previous layer (except the first one) and reach layer has its own weight..
Great work Ajay, Can you share the diagram link which you have showed in the video?
Thank you for all the videos about transformer. Although I understood the architecture, I still dont know what to set for the input of the decoder (embeded target) and mask for the TEST phase?
Dude, you resemble Ryan from The Office! Btw great explanation. Thanks for posting such wonderful content.
This is Awesome!!!!
thank you so much for the video!!!!!!
thank you brother ,good job
By the way, when you do the dot product between q and K^T, it won't directly refer to cosine similarity (closer vectors mean more related terms) unless the magnitudes are normalized. But clearly, the vectors are not of the same magnitude, so how is the dot product a metric of similarity?
Excellent again, thank you!!
Thanks that is very clear!
Great work! It is really great that you can draw such a complex diagram. Can you share which software you're using to draw it?
multi cross attention will only work if the seq length for encoder and decoder is same. but what if it isn't?
just project qkv into a multiable space with some learnable matrix
bravoo, the best expalin i ever seen, bro could you explain the implemention of cnn+ swin transformer
can you please share this Image?
Is it possible to get a pdf of the big diagram?
I was trying to do exactly that. I might create a separate cleaner diagram and circulate that as a PDF with the complete transformer architecture
@@CodeEmporium May I know the software that you used to design the diagram?
Thanks in advance,
It’s a white boarding tool called “Explain Everything”
You are great
This is really a great video, thanks, man! Can you share the pdfs for the diagram also?
Thanks so much. Yea. I can’t seem to export this as a diagram from the whiteboard software. So I am planning to sketch both the combined encoder and decoder out together and have it accessible as like a PDF. It should be done in the coming weeks
@@CodeEmporium That's great, this would be immensely helpful. Thanks again
@@CodeEmporium I can’t wait for the diagrams to be uploaded, I really need them, because I plan to draw them in the PowerPoint, but I got limited to the size of the PowerPoint slide (can’t fit in one place), then I tried to copy what I draw in the explain everything , then it got my time looking back and forth to the video seeing the diagram. So, I decided to check if you uploaded them or not. If you plan to do so, please do at your earliest convenience with my appreciation in advance for your tremendous work
You have missed the concept of teacher forcing during training
The only question I can ask you brother is nothing rather than understanding these concepts are we going to be the next Elon Musk to create an gpt 4 or gpt infinity we are normal people who are going to use this technology mat be to earn money or earn a PhD and then earn money it's good to understand these things like we understand mechanics to understand the world arround us but space science and quantum mechanics are a variant only few will venture into it I do not say this is not important but only thing I say is your content are very good and unique but again it helps the people who are working in academic level or may be phd your way of deriving neural networks and machine learning algorithms mathematically is great but matrix calculus org which I tried your stuff is bit hard to me I don't say your not good at that but matrix calculus which is more important has no kind of computer algebra software codes to help it and it's rare my only advice is you are working to hard to make these videos but you have an class of tutoring which should be given in Oxford or Harvard iam really Proud to say you are an Indian specially south Indian but by this time you must have reached a million subscribers because of your genuine thoughts and to make a quality content people are judging you differently please don't take this as an advice please run with the folks you will be earning in millions and I have told to everyone about you still they don't have the knowledge to admire you for an intellectual admiration require not only brain it needs intellectual brain my only thing I have to tell you is you are a best book in the library un noticed I know the quality of best books not all books reveal the big picture you have an master class tutoring techniques which I have only seen in costly intellectual books and I don't know where you studied and are you a teacher, lecturer or professor you will surely be rewarded for your self less pure hearted efforts I have told everyone about you but in a world of fake gurus the true ones never blow there trumpet your contents are equivalent to a PhD ❤