First thanks for this amazing video, but I have one question regarding the implementation of Self Attention. To distribute values, keys and queries to heads you just did a reshape for the input, while the original paper suggested to do projection using trainable matrices. Am I right or I missed up something?
@@alhasanalkhaddour434 yes i think he did the projection already using self.values, self,keys, self.queries cause these are linear layers . the real inputs comes from the parameters passed to forward function see 14.43 for more details
@@AladdinPerssonHi! You missed one error in your video. In your GitHub code, you have `self.values = nn.Linear(embed_size, embed_size)`, but in your video, you used `self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)`. I couldn't reproduce your results until I noticed this discrepancy.
This is undoubtedly one of the best transformer implementation videos I have seen. Thanks for posting such good content. Looking forward to seeing some more paper implementation videos.
In the original paper each head should have seperate weights, but in your code all heads share the same weights. here are two steps to fix it: 1. in __init__: self.queries = nn.Linear(self.embed_size, self.embed_size, bias=False) (same for key and value weights) 2. in forward: put "queries = self.queries(queries)" above "queries = queries.reshape(...)" (also same for keys and values) Great video btw
Hey, thank you so much for bringing this to my attention ;) When reading through the paper I get the same idea that you do, namely that each head should have separate weights, and when reading blog posts like "The Annotated Transformer" he has done exactly what you describe. From the blog post www.peterbloem.nl/blog/transformers he explains narrow vs wide self attention and in his Github implementation he does similarly as I do, however I noticed now that an issue has been raised regarding the same issue you bring up: github.com/pbloem/former/issues/13. And I agree with the point brought up there also, if each head is using same weights it doesn't feel like you can say they are different. I'm having difficulty finding other implementations, but I will keep a close look at this and if I get some more time I will try to spend more time and investigate this. I'm also a bit surprised that when training on this implementation it provides good results if I remember correctly with only 3x32x32 vs 3x256x256 parameters.
@@AladdinPersson Yes both methods should work just fine, but I believe using seperate weights for each head would give better performance, without slowing down the model. it would use more memory of course, but it's almost nothing compared to number of parameters in the feedforward sublayers.
@@sehbanomer8151 I think your inplementation may still have some issues. Since each head shoud have seperate weights, shouldn't there be eight(number of heads) different head_dim*head_dim linear layers instead of one embed_size*embed_size linear layer. Additionally, these two implementations have different number of parameters.
@@66dp97 the key, query & value projection of each head will project an _embed_dim_ dimentional vector to _head_dim_ dimentional space, so for each attention head, the projection matrix will have shape (head_dim, embed_dim). Fusing _n_heads_ seperate linear layers into a single (embed_dim, head_dim * n_heads) linear layer is more GPU friendly, thus faster.
This is cool. It would be helpful to have a section highlighting what parts of the dimensions should be changed if you are using a dataset of a different size or you want to change the input length. ie: keeping the architecture constant but noting how it could be used flexibly
for the dropout in your codes, for example DecoderBlock forward, I think it should be: query = self.norm(x + self.dropout(attention)) instead: query = self.dropout(self.norm(attention + x)) Here is the paper quote: "We apply dropout [33] to the output of each sub-layer, before it is added to the sub-layer input and normalized."
Many thanks to you for this impressive tutorial, amazing job and outstanding explanation, and also thanks for sharing all these resources in the description.
Hey Sahil, I definitely need a refresher and go through transformers again, so I'm not sure if I will be able to give you the best answer right now. So from what I recall the most important part of the masking with regards to padding is that we make sure these are not backpropagated through. We want the network weights and embeddings etc not to learn to be associated with the padded values, and that's what we are trying to accomplish with setting it to -infinity since gradient of softmax will then be 0.
@@AladdinPersson Yeah I get the reason why we do it and the -inf setting. I had doubts with the padding that we use, I feel we need more padding to take care of the cases where both sentences are padded and then we have attention over them. I feel I have made it pretty clear in the comment above.
@@SahilKhose it doesnt matter the padding parts got number in final output , beacuse all the paprameter used to caculate it won't have gradient from it becuse deravate of sofmax is 0 for them
Love your work!! I was very confused when dealing with other tutorials... but your work made me clear about Transformer. I wish only I know you and your work.
Very good tutorial! Just one thing though: this is not how multihead attention is implemented in the original attention is all you need paper. In the paper the input is not split into h smaller vectors, but linearly transformed h times. So their wouldn't be reshape and then linear(head_dim, head_dim) but rather linear(embed_size, head_dim) in each head. Also you can have more heads than heads*head_dim = embed_size. This is because in the paper you would transform your concatenated head-outputs again with a jointly trained matrix WO (concatenation size x embed_size)
Ya,.. agreed,.. this was an extremely difficult architecture to implement,. with .a LOT of moving parts,.. but this has to be the best walkthrough out there,.. sure, there are certain things like the src_mask unsqueeze that were a little tricky to visualize,.. but even barring that, you broke it down quite well! Thank you for this!. I'm so glad that we have all of this implemented in HF/PT hahah
I found this very helpful. I always used to get confused regarding the tensor sizes. Now it's all clear. Thank you very much. Also this is the first time I came across einsum. Thanks again for that too.
It's an extremely useful video for researches trying to implement paper codes. Do make a series implementing other Machine Learning codes described in other papers as well. Please make a video to use this model on an actual NLP task such as translation, etc.
Thank you for saying that I really appreciate it. I have made one other video on transformers for machine translation, and I will do my best to continue making videos and to cover more advanced topics! :)
@@flamingflamingo4021 Yeah for sure: th-cam.com/video/M6adRGJe5cQ/w-d-xo.html It's the last video of an inofficial serie of building Seq2Seq models for the task of machine translation. First video was normal seq2seq, second video was seq2seq+attention and the last video that I linked above is using transformers. These videos were inspired a lot by Bentrevett on Github and I recommend you check him out also if you're interested in NLP :)
excellent video and thank u for sharing this. I have one point about implementation, in "SelfAttention" class for query, value and key matrices (linear layer) you used (head_dim, head_dim) dimension. so these matrices will be shared in all heads. I think it's better to use (embed_dim, embed_dim) matrix to map input to q, k, v vectors and reshape it to have head dimension.
Hi Thanks for the superb video.. I have one doubt regarding selfAttention block: self.values = nn.Linear(self.head_dim, self.head_dim, bias=False) self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False) self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False) " Don't you think that V,K,Q should be computed on the whole embedding vectors and later we should divide based on heads.?
First of all, thank you for the video. The most valuable thing I learned from it is how to create a so complex model step by step from the flow chart. Next, I will find out weither this self-attention model can be used in environmental pollution problems.
Hi everyone! I finished following the this tutorial to the end... But now I am confused on how to "train" and "test/predict" this model? Any help is appreciated! Thanks!
Amazing tutorial! The concept of Transformers became much clearer after following this, although there is one thing I had doubt in. In the part where you implement positional embeddings, there is no implementation of sin and cos functions as given in the paper. Why is that? Thank you again for the tutorial!
@@junweilu4990 In the code, he did use nn.Embedding to replace the original method. But I do not know the logic behind this approach as this does not "truly" give positional features, does it?
@@junweilu4990 I also expected him to use word2vec or some pre-fixed embedding algorithm. I think by using nn.Embedding() he tries to learn word vectors and positional vectors, while training his attention algoritm, encoders and decoders. Very wasteful, I believe. I think the rest of his algorithm will converge only after his word and positional vectors converges, if they ever...
24:41 and other einsum parts... why do we want these particular shapes? dont rush, please take some time to explain all important details in the video :)
It is a great video. I have some confusions understanding it fully. In SelfAttention class you created Linear layers self.values self.keys and self.queries but you didn't use those layers.
This is the best tutorial on Transformers online. I was able to understand the nuts and bolts of it. Kudos to you!! It will be great if you can cover Graph Convolutional Networks from scratch
Excellent! Thank you so much for this! Had a small request, can you please come up with videos on BERT and controlled text generation models like PPLM? Thanks again!
I have viewed your code in github, and I',m wondering why you use the same weight matrix for different heads, isn't will this make the Q,V,K vector for different heads to be the same?
Damn, I'm suffering from the same question. I am confused as to why, here, the kqv projections for the different heads seem to be shared. It seems like we should use nn.Linear(embed_dim, embed_dim), and later divide it into different heads?
I want to use the encoder of transformer for video classification, where each frame of the video will be first passed through a pretrained cnn and the output of this would act as an embedding and then passed as an input tor encoder. Any suggestions on how to do that?
Thank you for your video. You did a great job! I was wondering how to train a transformer if the input form is (batch_size, sequence_length, number_of_features). Let's say number_of_features = 2 (it could be X and Y coordinates in time, for example). What impact does this type of input have on positional encoding, the masking strategy and the attention mechanism?
In your implementation you split the keys, queries and values into heads before passing them through the linear layers, in other implementations however, the keys, queries and values are split after being passed through the linear layers. The paper follows your implementation; according to what I can deduce from the picture of the transformer. Are these two approaches similar or is one better than the other?
That's a great observation and one that I've wondered about and confused over myself, from my understanding the way I did it is not entirely accurate and we should split after the linear layers. Otherwise it doesn't make intuitive sense to me that we have "different heads" if they all share parameters. With that said, the way I did it works and uses a lot fewer parameters but I would use the way you've seen others do it (I haven't done extensive tests to see how much better it performs, let me know if you ever test this).
Thank you the video was very helpful. In the end we got output of dim (2,7,10). So why did we got the probabilities of the next 7 words? And why is the output len dependent on the number of words we feed to the decoder?
i don't know if there are still people who are watching this but i have a question at code level, in the decoder method "forward" when i pass the parameters to layer i had as the fifth one "target_mask" but in the decoder block you decided to put the parameter "device". Did i miss something, is it just an error or there is an other explanation? Thanks a lot
Thank you for your toturial, I have a question : you said in encoder all value, key and query are the same. as the paper said value, query and key are just the same in size not in element. can u plz explain it a little more?
the last question please what is intuitive meaning for the source and target inputs of transformer why model takes x, trg[:, :-1] model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to(device) out = model(x, trg[:, :-1]) what we could get from out? I tried model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to(device) out = model(x, trg) and got torch.Size([2, 8, 10]) to be honest I could not interpret that :(
Okay so I understand your question correctly: Your doubt is why are we using trg[:, :-1] instead of trg First: trg[:, :-1] this means all the batches(sentences) and entire sentences except the last word in all the sentences. Second: We do this because of how the transformer model is developed to train. Unlike RNNs our transformer model does not predict the entire output sentence, instead it predicts one word at a time. So the decoder takes in (t-1) time step's output of the transformer and then predicts the t time step output word. Hence we provide the entire sentence but the last word so as to predict the last word. Refer to the beautiful video by Yannic Kelcher: th-cam.com/video/iDulhoQ2pro/w-d-xo.html Hope your doubt is solved. Let me know if it's still unclear.
Thanks for your video, it seems like there's an error in your multi-head attention implementation. You didn't use the fully-connected layers of query, key and value. Basically the model cannot be trained to learn the attention map.
if I'm using transformer for a speech recognition task (speech-to-text). after training the model, in prediction what should I place on the target parameter if I have only audio file (not transcibed)?
Hey, i'm a bit confused, in the self attention block (around 15:20) you define value, query and key heads, but they are unused in forward method. If I understand correctly, they should be used after each sequence's reshape?
In 52:37, you write out = self.decoder(trg, enc_src, src_mask, trg_mask) but i think it should be out = self.decoder(trg, enc_src, trg_key_padding_mask, trg_mask) beacuse you don't expilictly give source to decoder, instead you give trg with paddings. This is why i think it should be trg_key_padding_mask rather than src_mask. Am I right? Thanks a lot for such a great video :)
I am not understanding sir, does this input sequence is divided into a number of chunks like here you did 256/8 where 8 is the number of attention heads. I am thinking for the self-attention whole of the input embeddings need to transform into three parts namely Q, K and V. and then we need to divide this for 8 times in the case of 8 multi heads. that's why the name is the multi head. Please clear. Regards
This is the best way to learn, through hands on. Great video! Also may I know which font is used in this video? I noticed that your choice of font is very clean and easy to work with!
Is there a reason, why dropout is added at the input/encoder input? `out = self.dropout( (self.word_embedding(x) + self.position_embedding(positions)) )` Can you please discuss on the `why`?
i've got a question here. In order to generate a target sentence, there should be multiple time steps right? The first output word from Decoder will go through Decoder again to generate the secend output word. i cant find where you difine this in this video. Or maybe i understand it wrong.
During training everything is done in parallel (we have the entire translated target sentence) and we utilize these target masks that I talked about in the video. This is a major difference between transformer and normal Seq2Seq, where we actually send in the entire target sentence rather than word by word. When we evaluate the model you're completely right that we need to do multiple time steps (one word at a time) but this is not the case during training. In this video we kind of just do the transformer from scratch, the question you're asking is more related to actually training & evaluating transformer models. I'll try to see if I find code for what you're asking for. So here is a full code example of using transformers (also have a separate video on it): github.com/AladdinPerzon/Machine-Learning-Collection/blob/master/ML/Pytorch/more_advanced/seq2seq_transformer/seq2seq_transformer.py When we actually evaluate the model we need to do it time step by time step and it would look like this (translate sentence function) and I believe THIS is what you're asking for: github.com/AladdinPerzon/Machine-Learning-Collection/blob/master/ML/Pytorch/more_advanced/seq2seq_transformer/utils.py
Thanks for your educational contribution! Just one question: what are the linear layers self.values, self.keys and self.queries for? These are not used inside the forward pass.
I have a basic conceptual knowledge of AI, without knowing how mathematics and all other AI topics work in it. Also i know a Python, but not a PyTorch. But, I looked at several analyzes of the article "Attention is all you need" and everything is clear there for me. But in this video - I can't figure out which block of code refers to the block in the diagram that he displayed on the right side of the screen I mean that, for example, he writes the Encoder section, but in the diagram on the right - this section consists of several blocks And I don't understand where which code belongs to the block in the diagram And also the rest of the details of code are also unclear to me Question: Should there be any required knowledge to understand this video ?
Here's the outline for the video:
0:00 - Introduction
0:54 - Paper Review
11:20 - Attention Mechanism
27:00 - TransformerBlock
32:18 - Encoder
38:20 - DecoderBlock
42:00 - Decoder
46:55 - Forming The Transformer
52:45 - A Small Example
54:25 - Fixing Errors
56:44 - Ending
First thanks for this amazing video, but I have one question regarding the implementation of Self Attention.
To distribute values, keys and queries to heads you just did a reshape for the input, while the original paper suggested to do projection using trainable matrices.
Am I right or I missed up something?
@@alhasanalkhaddour434 yes i think he did the projection already using self.values, self,keys, self.queries cause these are linear layers . the real inputs comes from the parameters passed to forward function see 14.43 for more details
Why did you use self. Values, self. Keys in the init method bcz they are not used at all in forward
Sorry can you share the github link of this special code?
@@riyajatar6859 Actually he fixed that at the end of the video🧐
Attention is not all we need, this video is all we need
You're too kind :)
haha, :)
best explaination ever
Attention to this video is all you need
Attention was never enough
Not found a tutorial so much detail oriented. Now I am completely able to understand the Transformer and Attention Mechanism.Great Work.Thank you😊
I really appreciate you saying that, thanks a lot :)
@@AladdinPerssonHi! You missed one error in your video. In your GitHub code, you have `self.values = nn.Linear(embed_size, embed_size)`, but in your video, you used `self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)`. I couldn't reproduce your results until I noticed this discrepancy.
I watched 3 Transformer videos before this one and thought I would never understand it. Love the way you explained such a complicated topic.
This is undoubtedly one of the best transformer implementation videos I have seen. Thanks for posting such good content. Looking forward to seeing some more paper implementation videos.
In the original paper each head should have seperate weights, but in your code all heads share the same weights. here are two steps to fix it:
1. in __init__: self.queries = nn.Linear(self.embed_size, self.embed_size, bias=False) (same for key and value weights)
2. in forward: put "queries = self.queries(queries)" above "queries = queries.reshape(...)" (also same for keys and values)
Great video btw
Hey, thank you so much for bringing this to my attention ;) When reading through the paper I get the same idea that you do, namely that each head should have separate weights, and when reading blog posts like "The Annotated Transformer" he has done exactly what you describe. From the blog post www.peterbloem.nl/blog/transformers he explains narrow vs wide self attention and in his Github implementation he does similarly as I do, however I noticed now that an issue has been raised regarding the same issue you bring up: github.com/pbloem/former/issues/13.
And I agree with the point brought up there also, if each head is using same weights it doesn't feel like you can say they are different. I'm having difficulty finding other implementations, but I will keep a close look at this and if I get some more time I will try to spend more time and investigate this. I'm also a bit surprised that when training on this implementation it provides good results if I remember correctly with only 3x32x32 vs 3x256x256 parameters.
@@AladdinPersson Yes both methods should work just fine, but I believe using seperate weights for each head would give better performance, without slowing down the model. it would use more memory of course, but it's almost nothing compared to number of parameters in the feedforward sublayers.
@@sehbanomer8151 I think your inplementation may still have some issues. Since each head shoud have seperate weights, shouldn't there be eight(number of heads) different head_dim*head_dim linear layers instead of one embed_size*embed_size linear layer. Additionally, these two implementations have different number of parameters.
@@66dp97 the key, query & value projection of each head will project an _embed_dim_ dimentional vector to _head_dim_ dimentional space, so for each attention head, the projection matrix will have shape (head_dim, embed_dim). Fusing _n_heads_ seperate linear layers into a single (embed_dim, head_dim * n_heads) linear layer is more GPU friendly, thus faster.
you're an absolute saint. idk if i can even put it into words the amount of respect and appreciation I have for you man! thank you!
Great Tutorial! Thanks Aladdin
for actually training it, what would we do?
@@jushkunjuret4386 Can you specify where you're exactly getting stuck?
I have been struggling to implement and understand custom transformer code from various sources. This was perhaps one of the best tutorials.
hey if you are still woring in NLP or transformers i may need your help please reply
This is cool. It would be helpful to have a section highlighting what parts of the dimensions should be changed if you are using a dataset of a different size or you want to change the input length. ie: keeping the architecture constant but noting how it could be used flexibly
for the dropout in your codes, for example DecoderBlock forward, I think it should be:
query = self.norm(x + self.dropout(attention))
instead:
query = self.dropout(self.norm(attention + x))
Here is the paper quote:
"We apply dropout [33] to the output of each sub-layer, before it is added to the sub-layer input and normalized."
Thanks so much for the great work!
I think you're right, I'll look into this some more soon and update the Github code :)
Thank you very much!
making something sophisticated so easy and clear that's what I call magic. Aladdin, you are truly the magician.
This is one of the best explanation videos about a paper to code I've watched in a loong time! Congratz Aladdin dude!
Many thanks to you for this impressive tutorial, amazing job and outstanding explanation, and also thanks for sharing all these resources in the description.
Hey Aladdin,
Really amazing videos brother!
This was the first video of yours that I stumbled upon and I fell in love with your channel.
Hey Sahil, I definitely need a refresher and go through transformers again, so I'm not sure if I will be able to give you the best answer right now. So from what I recall the most important part of the masking with regards to padding is that we make sure these are not backpropagated through. We want the network weights and embeddings etc not to learn to be associated with the padded values, and that's what we are trying to accomplish with setting it to -infinity since gradient of softmax will then be 0.
@@AladdinPersson Yeah I get the reason why we do it and the -inf setting. I had doubts with the padding that we use, I feel we need more padding to take care of the cases where both sentences are padded and then we have attention over them. I feel I have made it pretty clear in the comment above.
@@SahilKhose it doesnt matter the padding parts got number in final output , beacuse all the paprameter used to caculate it won't have gradient from it becuse deravate of sofmax is 0 for them
45:59: WHOA! slow that down! Pause a sec, be emphatic if we're going to change something back up there
This video is all I needed
Love your work!! I was very confused when dealing with other tutorials... but your work made me clear about Transformer. I wish only I know you and your work.
I appreciate the kind words 🙏
wow, the best transformer tutorial I've seen
Very good tutorial!
Just one thing though: this is not how multihead attention is implemented in the original attention is all you need paper. In the paper the input is not split into h smaller vectors, but linearly transformed h times. So their wouldn't be reshape and then linear(head_dim, head_dim) but rather linear(embed_size, head_dim) in each head.
Also you can have more heads than heads*head_dim = embed_size. This is because in the paper you would transform your concatenated head-outputs again with a jointly trained matrix WO (concatenation size x embed_size)
Ya,.. agreed,.. this was an extremely difficult architecture to implement,. with .a LOT of moving parts,.. but this has to be the best walkthrough out there,.. sure, there are certain things like the src_mask unsqueeze that were a little tricky to visualize,.. but even barring that, you broke it down quite well! Thank you for this!. I'm so glad that we have all of this implemented in HF/PT hahah
It is the best description for transformer implementation.
thank you so much.
best regards.
Happiness is all you need, neither attention, nor transformer.
great explanation, much more helpful than the theoretical only explanations
Really thank you!!! This really helps me deeply understand Transformer!!!
I found this very helpful. I always used to get confused regarding the tensor sizes. Now it's all clear. Thank you very much. Also this is the first time I came across einsum. Thanks again for that too.
Appreciate the kind words 🙏
This was really helpful! Can you also do a tutorial for Decision Transformer for reinforcement learning?
Hi, I really like your channel. I have been learning from your tutorials for a while. Best wishes!
Thanks so much!🎉
It's an extremely useful video for researches trying to implement paper codes. Do make a series implementing other Machine Learning codes described in other papers as well.
Please make a video to use this model on an actual NLP task such as translation, etc.
Thank you for saying that I really appreciate it. I have made one other video on transformers for machine translation, and I will do my best to continue making videos and to cover more advanced topics! :)
@@AladdinPersson I can't seem to find it. Can you please paste the link here, please? I'd truly appreciate it. :)
@@flamingflamingo4021 Yeah for sure: th-cam.com/video/M6adRGJe5cQ/w-d-xo.html
It's the last video of an inofficial serie of building Seq2Seq models for the task of machine translation. First video was normal seq2seq, second video was seq2seq+attention and the last video that I linked above is using transformers. These videos were inspired a lot by Bentrevett on Github and I recommend you check him out also if you're interested in NLP :)
excellent video and thank u for sharing this. I have one point about implementation, in "SelfAttention" class for query, value and key matrices (linear layer) you used (head_dim, head_dim) dimension. so these matrices will be shared in all heads. I think it's better to use (embed_dim, embed_dim) matrix to map input to q, k, v vectors and reshape it to have head dimension.
At 54:02, why do we need to pass the truncated trg[:, :-1] ? I thought we already applied the mask to prevent the model from looking into the future?
Hi Thanks for the superb video.. I have one doubt regarding selfAttention block:
self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)
" Don't you think that V,K,Q should be computed on the whole embedding vectors and later we should divide based on heads.?
In the Jay Alammar blog there is no split of the embeddings in order to compute attention for each head.
in SelfAttention, you have not used the linears self.keys, self.values, self.queries in forward method, whats the use of those layers?
Great Explanation: Loved it
First of all, thank you for the video. The most valuable thing I learned from it is how to create a so complex model step by step from the flow chart. Next, I will find out weither this self-attention model can be used in environmental pollution problems.
Hi everyone! I finished following the this tutorial to the end... But now I am confused on how to "train" and "test/predict" this model? Any help is appreciated! Thanks!
The best video I watched on youtube! Why I found you so late!!!
Dude, you rock! I bow to your expertise 🙏😊
One of the best resource on the internet!
You'll be happy to know chat-gpt recommended this video and gave a link to it when I asked for resources explaining transformers
Amazing tutorial! The concept of Transformers became much clearer after following this, although there is one thing I had doubt in. In the part where you implement positional embeddings, there is no implementation of sin and cos functions as given in the paper. Why is that?
Thank you again for the tutorial!
Having the same question too. Did he use "nn.Embedding" to replace the original method ?
@@junweilu4990 In the code, he did use nn.Embedding to replace the original method. But I do not know the logic behind this approach as this does not "truly" give positional features, does it?
@@junweilu4990 I also expected him to use word2vec or some pre-fixed embedding algorithm. I think by using nn.Embedding() he tries to learn word vectors and positional vectors, while training his attention algoritm, encoders and decoders. Very wasteful, I believe. I think the rest of his algorithm will converge only after his word and positional vectors converges, if they ever...
Gonna try this for my uni assignment! Thank you
Great work! Really helped me. Thanks.
Superb...Hats off. Thank you for explanation.
Compliments for the video, really gives better insight into a complex architecture. Thanks for sharing all this information.
24:41 and other einsum parts... why do we want these particular shapes? dont rush, please take some time to explain all important details in the video :)
Thank you very much for the info!
Very detailed and clear! Thank you very much!
Thanks a lot for the kind words🙏
It is a great video. I have some confusions understanding it fully.
In SelfAttention class you created Linear layers self.values self.keys and self.queries but you didn't use those layers.
This is the best tutorial on Transformers online. I was able to understand the nuts and bolts of it. Kudos to you!! It will be great if you can cover Graph Convolutional Networks from scratch
Great Job!! Thanks for the video!
Thank you! great explanation, I just wonder why in the attention mechanism you have to inizialize self.queries, self.keys ecc as Linear layers
From paper, the attention mechanism is fully connected, which means you should use linear layers.
Excellent! Thank you so much for this! Had a small request, can you please come up with videos on BERT and controlled text generation models like PPLM? Thanks again!
Thank you for the comment! I will look into it, got a few videos that I'm planning but will come back to this in the future for sure :)
I have viewed your code in github, and I',m wondering why you use the same weight matrix for different heads, isn't will this make the Q,V,K vector for different heads to be the same?
Damn, I'm suffering from the same question.
I am confused as to why, here, the kqv projections for the different heads seem to be shared. It seems like we should use nn.Linear(embed_dim, embed_dim), and later divide it into different heads?
World's best TH-cam channel evaaaa.
I want to use the encoder of transformer for video classification, where each frame of the video will be first passed through a pretrained cnn and the output of this would act as an embedding and then passed as an input tor encoder. Any suggestions on how to do that?
Thank you for your video. You did a great job! I was wondering how to train a transformer if the input form is (batch_size, sequence_length, number_of_features). Let's say number_of_features = 2 (it could be X and Y coordinates in time, for example). What impact does this type of input have on positional encoding, the masking strategy and the attention mechanism?
I can't seem to understand the necessity for self.keys, self.queries and self.values in the SelfAttention class. Am I missing something?
In your implementation you split the keys, queries and values into heads before passing them through the linear layers, in other implementations however, the keys, queries and values are split after being passed through the linear layers. The paper follows your implementation; according to what I can deduce from the picture of the transformer. Are these two approaches similar or is one better than the other?
That's a great observation and one that I've wondered about and confused over myself, from my understanding the way I did it is not entirely accurate and we should split after the linear layers. Otherwise it doesn't make intuitive sense to me that we have "different heads" if they all share parameters. With that said, the way I did it works and uses a lot fewer parameters but I would use the way you've seen others do it (I haven't done extensive tests to see how much better it performs, let me know if you ever test this).
Great video, advanced my understanding.
Also love your voice btw, feels calming 👍
At 17:57 you forgot to apply linear projections before reshaping.
excellent work mate cleared all my doubts
Thank you the video was very helpful. In the end we got output of dim (2,7,10). So why did we got the probabilities of the next 7 words? And why is the output len dependent on the number of words we feed to the decoder?
Thanks Aladdin. The video helped a lot.
Thx for this amazing tutorial. I think the "energy" (Q * K_transpose) should be divided by the square root of head_dim instead of embedding_size.
Very nice! Congratulations!!
i don't know if there are still people who are watching this but i have a question at code level, in the decoder method "forward" when i pass the parameters to layer i had as the fifth one "target_mask" but in the decoder block you decided to put the parameter "device". Did i miss something, is it just an error or there is an other explanation? Thanks a lot
Thank you for your toturial, I have a question : you said in encoder all value, key and query are the same. as the paper said value, query and key are just the same in size not in element. can u plz explain it a little more?
the last question please what is intuitive meaning for the source and target inputs of transformer why model takes x, trg[:, :-1]
model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to(device)
out = model(x, trg[:, :-1])
what we could get from out?
I tried
model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to(device)
out = model(x, trg)
and got
torch.Size([2, 8, 10])
to be honest I could not interpret that :(
Okay so I understand your question correctly:
Your doubt is why are we using trg[:, :-1] instead of trg
First:
trg[:, :-1] this means all the batches(sentences) and entire sentences except the last word in all the sentences.
Second:
We do this because of how the transformer model is developed to train. Unlike RNNs our transformer model does not predict the entire output sentence, instead it predicts one word at a time. So the decoder takes in (t-1) time step's output of the transformer and then predicts the t time step output word. Hence we provide the entire sentence but the last word so as to predict the last word.
Refer to the beautiful video by Yannic Kelcher:
th-cam.com/video/iDulhoQ2pro/w-d-xo.html
Hope your doubt is solved. Let me know if it's still unclear.
Please make a video on using the Temporal Fusion Transformers.
Why do you set bias=False for nn.Linear of keys, values and queries?
Thank you so much!
Thanks for your video, it seems like there's an error in your multi-head attention implementation. You didn't use the fully-connected layers of query, key and value. Basically the model cannot be trained to learn the attention map.
I got confused a bit. What you are sending from the encoder to the decoder, Do they represent queries and keys, or keys and values??
Hi, at 18:45, energy = queries* keys. Are you doing outer product?
if I'm using transformer for a speech recognition task (speech-to-text). after training the model, in prediction what should I place on the target parameter if I have only audio file (not transcibed)?
did you get ur answer?
Hey, i'm a bit confused, in the self attention block (around 15:20) you define value, query and key heads, but they are unused in forward method. If I understand correctly, they should be used after each sequence's reshape?
why do we need the src_mask, can't the model learn by itself that there is some padding here?
Shouldn't it be query_len instead of key_len 17:38 ?
In decoder block in forward function why src_mask passed in transformer?
In 52:37, you write out = self.decoder(trg, enc_src, src_mask, trg_mask) but i think it should be out = self.decoder(trg, enc_src, trg_key_padding_mask, trg_mask) beacuse you don't expilictly give source to decoder, instead you give trg with paddings. This is why i think it should be trg_key_padding_mask rather than src_mask. Am I right? Thanks a lot for such a great video :)
These transformers are even luckier than me. All they get is HEADS!
Exactly what I need
I am not understanding sir, does this input sequence is divided into a number of chunks like here you did 256/8 where 8 is the number of attention heads. I am thinking for the self-attention whole of the input embeddings need to transform into three parts namely Q, K and V. and then we need to divide this for 8 times in the case of 8 multi heads. that's why the name is the multi head. Please clear. Regards
This is the best way to learn, through hands on. Great video! Also may I know which font is used in this video? I noticed that your choice of font is very clean and easy to work with!
I love your toturials
Why is the final output shape [2,7,10]?
Is there a reason, why dropout is added at the input/encoder input?
`out = self.dropout(
(self.word_embedding(x) + self.position_embedding(positions))
)`
Can you please discuss on the `why`?
i've got a question here. In order to generate a target sentence, there should be multiple time steps right?
The first output word from Decoder will go through Decoder again to generate the secend output word.
i cant find where you difine this in this video. Or maybe i understand it wrong.
During training everything is done in parallel (we have the entire translated target sentence) and we utilize these target masks that I talked about in the video. This is a major difference between transformer and normal Seq2Seq, where we actually send in the entire target sentence rather than word by word. When we evaluate the model you're completely right that we need to do multiple time steps (one word at a time) but this is not the case during training. In this video we kind of just do the transformer from scratch, the question you're asking is more related to actually training & evaluating transformer models. I'll try to see if I find code for what you're asking for.
So here is a full code example of using transformers (also have a separate video on it): github.com/AladdinPerzon/Machine-Learning-Collection/blob/master/ML/Pytorch/more_advanced/seq2seq_transformer/seq2seq_transformer.py
When we actually evaluate the model we need to do it time step by time step and it would look like this (translate sentence function) and I believe THIS is what you're asking for: github.com/AladdinPerzon/Machine-Learning-Collection/blob/master/ML/Pytorch/more_advanced/seq2seq_transformer/utils.py
@@AladdinPersson Thank you, the link explained it pretty well. Thank you a lot.
Thanks for your educational contribution! Just one question: what are the linear layers self.values, self.keys and self.queries for? These are not used inside the forward pass.
I have a basic conceptual knowledge of AI, without knowing how mathematics and all other AI topics work in it.
Also i know a Python, but not a PyTorch.
But, I looked at several analyzes of the article "Attention is all you need" and everything is clear there for me.
But in this video - I can't figure out which block of code refers to the block in the diagram that he displayed on the right side of the screen
I mean that, for example, he writes the Encoder section, but in the diagram on the right - this section consists of several blocks
And I don't understand where which code belongs to the block in the diagram
And also the rest of the details of code are also unclear to me
Question: Should there be any required knowledge to understand this video ?
Thank you so much!!!
Thanks a lot for the video, this was great an its helping me a lot.
Why are you not using `sin` or `cos` function for positional encoding?
fantastic, awesome videos as ever.
@Aladdin Persson. Thank you. Great lesson. Which IDE and theme you used ?
please make a video on implementation of Video Summarization With Frame Index Vision Transformer