First thanks for this amazing video, but I have one question regarding the implementation of Self Attention. To distribute values, keys and queries to heads you just did a reshape for the input, while the original paper suggested to do projection using trainable matrices. Am I right or I missed up something?
@@alhasanalkhaddour434 yes i think he did the projection already using self.values, self,keys, self.queries cause these are linear layers . the real inputs comes from the parameters passed to forward function see 14.43 for more details
@@AladdinPerssonHi! You missed one error in your video. In your GitHub code, you have `self.values = nn.Linear(embed_size, embed_size)`, but in your video, you used `self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)`. I couldn't reproduce your results until I noticed this discrepancy.
This is undoubtedly one of the best transformer implementation videos I have seen. Thanks for posting such good content. Looking forward to seeing some more paper implementation videos.
In the original paper each head should have seperate weights, but in your code all heads share the same weights. here are two steps to fix it: 1. in __init__: self.queries = nn.Linear(self.embed_size, self.embed_size, bias=False) (same for key and value weights) 2. in forward: put "queries = self.queries(queries)" above "queries = queries.reshape(...)" (also same for keys and values) Great video btw
Hey, thank you so much for bringing this to my attention ;) When reading through the paper I get the same idea that you do, namely that each head should have separate weights, and when reading blog posts like "The Annotated Transformer" he has done exactly what you describe. From the blog post www.peterbloem.nl/blog/transformers he explains narrow vs wide self attention and in his Github implementation he does similarly as I do, however I noticed now that an issue has been raised regarding the same issue you bring up: github.com/pbloem/former/issues/13. And I agree with the point brought up there also, if each head is using same weights it doesn't feel like you can say they are different. I'm having difficulty finding other implementations, but I will keep a close look at this and if I get some more time I will try to spend more time and investigate this. I'm also a bit surprised that when training on this implementation it provides good results if I remember correctly with only 3x32x32 vs 3x256x256 parameters.
@@AladdinPersson Yes both methods should work just fine, but I believe using seperate weights for each head would give better performance, without slowing down the model. it would use more memory of course, but it's almost nothing compared to number of parameters in the feedforward sublayers.
@@sehbanomer8151 I think your inplementation may still have some issues. Since each head shoud have seperate weights, shouldn't there be eight(number of heads) different head_dim*head_dim linear layers instead of one embed_size*embed_size linear layer. Additionally, these two implementations have different number of parameters.
@@66dp97 the key, query & value projection of each head will project an _embed_dim_ dimentional vector to _head_dim_ dimentional space, so for each attention head, the projection matrix will have shape (head_dim, embed_dim). Fusing _n_heads_ seperate linear layers into a single (embed_dim, head_dim * n_heads) linear layer is more GPU friendly, thus faster.
This is cool. It would be helpful to have a section highlighting what parts of the dimensions should be changed if you are using a dataset of a different size or you want to change the input length. ie: keeping the architecture constant but noting how it could be used flexibly
Many thanks to you for this impressive tutorial, amazing job and outstanding explanation, and also thanks for sharing all these resources in the description.
Ya,.. agreed,.. this was an extremely difficult architecture to implement,. with .a LOT of moving parts,.. but this has to be the best walkthrough out there,.. sure, there are certain things like the src_mask unsqueeze that were a little tricky to visualize,.. but even barring that, you broke it down quite well! Thank you for this!. I'm so glad that we have all of this implemented in HF/PT hahah
Hey Sahil, I definitely need a refresher and go through transformers again, so I'm not sure if I will be able to give you the best answer right now. So from what I recall the most important part of the masking with regards to padding is that we make sure these are not backpropagated through. We want the network weights and embeddings etc not to learn to be associated with the padded values, and that's what we are trying to accomplish with setting it to -infinity since gradient of softmax will then be 0.
@@AladdinPersson Yeah I get the reason why we do it and the -inf setting. I had doubts with the padding that we use, I feel we need more padding to take care of the cases where both sentences are padded and then we have attention over them. I feel I have made it pretty clear in the comment above.
@@SahilKhose it doesnt matter the padding parts got number in final output , beacuse all the paprameter used to caculate it won't have gradient from it becuse deravate of sofmax is 0 for them
Love your work!! I was very confused when dealing with other tutorials... but your work made me clear about Transformer. I wish only I know you and your work.
First of all, thank you for the video. The most valuable thing I learned from it is how to create a so complex model step by step from the flow chart. Next, I will find out weither this self-attention model can be used in environmental pollution problems.
I found this very helpful. I always used to get confused regarding the tensor sizes. Now it's all clear. Thank you very much. Also this is the first time I came across einsum. Thanks again for that too.
It's an extremely useful video for researches trying to implement paper codes. Do make a series implementing other Machine Learning codes described in other papers as well. Please make a video to use this model on an actual NLP task such as translation, etc.
Thank you for saying that I really appreciate it. I have made one other video on transformers for machine translation, and I will do my best to continue making videos and to cover more advanced topics! :)
@@flamingflamingo4021 Yeah for sure: th-cam.com/video/M6adRGJe5cQ/w-d-xo.html It's the last video of an inofficial serie of building Seq2Seq models for the task of machine translation. First video was normal seq2seq, second video was seq2seq+attention and the last video that I linked above is using transformers. These videos were inspired a lot by Bentrevett on Github and I recommend you check him out also if you're interested in NLP :)
excellent video and thank u for sharing this. I have one point about implementation, in "SelfAttention" class for query, value and key matrices (linear layer) you used (head_dim, head_dim) dimension. so these matrices will be shared in all heads. I think it's better to use (embed_dim, embed_dim) matrix to map input to q, k, v vectors and reshape it to have head dimension.
Very good tutorial! Just one thing though: this is not how multihead attention is implemented in the original attention is all you need paper. In the paper the input is not split into h smaller vectors, but linearly transformed h times. So their wouldn't be reshape and then linear(head_dim, head_dim) but rather linear(embed_size, head_dim) in each head. Also you can have more heads than heads*head_dim = embed_size. This is because in the paper you would transform your concatenated head-outputs again with a jointly trained matrix WO (concatenation size x embed_size)
for the dropout in your codes, for example DecoderBlock forward, I think it should be: query = self.norm(x + self.dropout(attention)) instead: query = self.dropout(self.norm(attention + x)) Here is the paper quote: "We apply dropout [33] to the output of each sub-layer, before it is added to the sub-layer input and normalized."
This is the best tutorial on Transformers online. I was able to understand the nuts and bolts of it. Kudos to you!! It will be great if you can cover Graph Convolutional Networks from scratch
24:41 and other einsum parts... why do we want these particular shapes? dont rush, please take some time to explain all important details in the video :)
10:06, a little confused here. Is the embedding space cut for every token to create equal parts OR is the original input multiplied with weights such that they decrease the embedding space for every word?
Amazing tutorial! The concept of Transformers became much clearer after following this, although there is one thing I had doubt in. In the part where you implement positional embeddings, there is no implementation of sin and cos functions as given in the paper. Why is that? Thank you again for the tutorial!
@@junweilu4990 In the code, he did use nn.Embedding to replace the original method. But I do not know the logic behind this approach as this does not "truly" give positional features, does it?
@@junweilu4990 I also expected him to use word2vec or some pre-fixed embedding algorithm. I think by using nn.Embedding() he tries to learn word vectors and positional vectors, while training his attention algoritm, encoders and decoders. Very wasteful, I believe. I think the rest of his algorithm will converge only after his word and positional vectors converges, if they ever...
Hi Thanks for the superb video.. I have one doubt regarding selfAttention block: self.values = nn.Linear(self.head_dim, self.head_dim, bias=False) self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False) self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False) " Don't you think that V,K,Q should be computed on the whole embedding vectors and later we should divide based on heads.?
It is a great video. I have some confusions understanding it fully. In SelfAttention class you created Linear layers self.values self.keys and self.queries but you didn't use those layers.
the last question please what is intuitive meaning for the source and target inputs of transformer why model takes x, trg[:, :-1] model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to(device) out = model(x, trg[:, :-1]) what we could get from out? I tried model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to(device) out = model(x, trg) and got torch.Size([2, 8, 10]) to be honest I could not interpret that :(
Okay so I understand your question correctly: Your doubt is why are we using trg[:, :-1] instead of trg First: trg[:, :-1] this means all the batches(sentences) and entire sentences except the last word in all the sentences. Second: We do this because of how the transformer model is developed to train. Unlike RNNs our transformer model does not predict the entire output sentence, instead it predicts one word at a time. So the decoder takes in (t-1) time step's output of the transformer and then predicts the t time step output word. Hence we provide the entire sentence but the last word so as to predict the last word. Refer to the beautiful video by Yannic Kelcher: th-cam.com/video/iDulhoQ2pro/w-d-xo.html Hope your doubt is solved. Let me know if it's still unclear.
Great! ❤ Thanks for this master piece. Hmm, I follow you along and I have not any error when I run, since I already noted your error in code and update it 😊. Waiting for your new Video. This is the first video I go along with you. Subscribed! Bell Notification on.
Hey, i'm a bit confused, in the self attention block (around 15:20) you define value, query and key heads, but they are unused in forward method. If I understand correctly, they should be used after each sequence's reshape?
In 52:37, you write out = self.decoder(trg, enc_src, src_mask, trg_mask) but i think it should be out = self.decoder(trg, enc_src, trg_key_padding_mask, trg_mask) beacuse you don't expilictly give source to decoder, instead you give trg with paddings. This is why i think it should be trg_key_padding_mask rather than src_mask. Am I right? Thanks a lot for such a great video :)
Excellent! Thank you so much for this! Had a small request, can you please come up with videos on BERT and controlled text generation models like PPLM? Thanks again!
Thank you the video was very helpful. In the end we got output of dim (2,7,10). So why did we got the probabilities of the next 7 words? And why is the output len dependent on the number of words we feed to the decoder?
In your implementation you split the keys, queries and values into heads before passing them through the linear layers, in other implementations however, the keys, queries and values are split after being passed through the linear layers. The paper follows your implementation; according to what I can deduce from the picture of the transformer. Are these two approaches similar or is one better than the other?
That's a great observation and one that I've wondered about and confused over myself, from my understanding the way I did it is not entirely accurate and we should split after the linear layers. Otherwise it doesn't make intuitive sense to me that we have "different heads" if they all share parameters. With that said, the way I did it works and uses a lot fewer parameters but I would use the way you've seen others do it (I haven't done extensive tests to see how much better it performs, let me know if you ever test this).
i've got a question here. In order to generate a target sentence, there should be multiple time steps right? The first output word from Decoder will go through Decoder again to generate the secend output word. i cant find where you difine this in this video. Or maybe i understand it wrong.
During training everything is done in parallel (we have the entire translated target sentence) and we utilize these target masks that I talked about in the video. This is a major difference between transformer and normal Seq2Seq, where we actually send in the entire target sentence rather than word by word. When we evaluate the model you're completely right that we need to do multiple time steps (one word at a time) but this is not the case during training. In this video we kind of just do the transformer from scratch, the question you're asking is more related to actually training & evaluating transformer models. I'll try to see if I find code for what you're asking for. So here is a full code example of using transformers (also have a separate video on it): github.com/AladdinPerzon/Machine-Learning-Collection/blob/master/ML/Pytorch/more_advanced/seq2seq_transformer/seq2seq_transformer.py When we actually evaluate the model we need to do it time step by time step and it would look like this (translate sentence function) and I believe THIS is what you're asking for: github.com/AladdinPerzon/Machine-Learning-Collection/blob/master/ML/Pytorch/more_advanced/seq2seq_transformer/utils.py
Hi everyone! I finished following the this tutorial to the end... But now I am confused on how to "train" and "test/predict" this model? Any help is appreciated! Thanks!
14:05 is this even correct though? Shouldn't you project Q, K and V from the `embed_size` to `embed_size` and split the resulting tensors? Your implementation seems to treat parts of the embedding dimension separately (instead of jointly attending to the entire embedding space) and using the same weights (instead of having separate weights for each head).
hey, it was a while since I looked at this but I believe I might have made some mistake in this video as I remember going through the implementation a while after and modifying this line in the implementation if I recall correctly. Check the latest one on GitHub
Very nice video. I have a question. The positional encoding you used is different from the one in the paper where they use sin/cos function of word position and vector index. It seems in your code, these positional embedding will be trained unlike the one in the paper. Do you have the code for how positional encoding is done in the paper?
Yes you're right about that, if I recall I did mention it in the video but I could have missed that. There have been other questions about this as well so I might try to implement positional encoding also but as of right now I have not
Oh wait! This adds up for me. They output all 7 words at the same time, but the target mask REMOVES the information as if it wasn't generated yet When you apply a transformer live, you would have to iterate through it
I have viewed your code in github, and I',m wondering why you use the same weight matrix for different heads, isn't will this make the Q,V,K vector for different heads to be the same?
Damn, I'm suffering from the same question. I am confused as to why, here, the kqv projections for the different heads seem to be shared. It seems like we should use nn.Linear(embed_dim, embed_dim), and later divide it into different heads?
Thank you for your video. You did a great job! I was wondering how to train a transformer if the input form is (batch_size, sequence_length, number_of_features). Let's say number_of_features = 2 (it could be X and Y coordinates in time, for example). What impact does this type of input have on positional encoding, the masking strategy and the attention mechanism?
Thanks for your educational contribution! Just one question: what are the linear layers self.values, self.keys and self.queries for? These are not used inside the forward pass.
if I'm using transformer for a speech recognition task (speech-to-text). after training the model, in prediction what should I place on the target parameter if I have only audio file (not transcibed)?
i don't know if there are still people who are watching this but i have a question at code level, in the decoder method "forward" when i pass the parameters to layer i had as the fifth one "target_mask" but in the decoder block you decided to put the parameter "device". Did i miss something, is it just an error or there is an other explanation? Thanks a lot
I am not understanding sir, does this input sequence is divided into a number of chunks like here you did 256/8 where 8 is the number of attention heads. I am thinking for the self-attention whole of the input embeddings need to transform into three parts namely Q, K and V. and then we need to divide this for 8 times in the case of 8 multi heads. that's why the name is the multi head. Please clear. Regards
Hello, This is a great explanation of transformers. I have a question. How did you know that query.shape[0] would give you the number of training examples? Why is it later used in reshaping the keys, query, and values?
Hi Aladdin. I have just finished watching "TransformerBlock" part but I've noticed something. In the first LayerNorm, I wonder why it would be x = self.norm1(attention + queries)? As far as I understand, it should be x = self.norm1(attention + input) where queries = Wq*input instead : ) (btw so far you haven't used the notation of input as input sequence vectors but I hope what I'm saying is clear enough tho..)
How he has it is correct: `queries = Wq*input` only gets calculated at the Attention Module, so queries is still just normal input (values makes more sense, thats how they describe it in the paper)
Here's the outline for the video:
0:00 - Introduction
0:54 - Paper Review
11:20 - Attention Mechanism
27:00 - TransformerBlock
32:18 - Encoder
38:20 - DecoderBlock
42:00 - Decoder
46:55 - Forming The Transformer
52:45 - A Small Example
54:25 - Fixing Errors
56:44 - Ending
First thanks for this amazing video, but I have one question regarding the implementation of Self Attention.
To distribute values, keys and queries to heads you just did a reshape for the input, while the original paper suggested to do projection using trainable matrices.
Am I right or I missed up something?
@@alhasanalkhaddour434 yes i think he did the projection already using self.values, self,keys, self.queries cause these are linear layers . the real inputs comes from the parameters passed to forward function see 14.43 for more details
Why did you use self. Values, self. Keys in the init method bcz they are not used at all in forward
Sorry can you share the github link of this special code?
@@riyajatar6859 Actually he fixed that at the end of the video🧐
Attention is not all we need, this video is all we need
You're too kind :)
haha, :)
best explaination ever
Attention to this video is all you need
Attention was never enough
Not found a tutorial so much detail oriented. Now I am completely able to understand the Transformer and Attention Mechanism.Great Work.Thank you😊
I really appreciate you saying that, thanks a lot :)
@@AladdinPerssonHi! You missed one error in your video. In your GitHub code, you have `self.values = nn.Linear(embed_size, embed_size)`, but in your video, you used `self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)`. I couldn't reproduce your results until I noticed this discrepancy.
I watched 3 Transformer videos before this one and thought I would never understand it. Love the way you explained such a complicated topic.
This is undoubtedly one of the best transformer implementation videos I have seen. Thanks for posting such good content. Looking forward to seeing some more paper implementation videos.
making something sophisticated so easy and clear that's what I call magic. Aladdin, you are truly the magician.
Great Tutorial! Thanks Aladdin
for actually training it, what would we do?
@@jushkunjuret4386 Can you specify where you're exactly getting stuck?
you're an absolute saint. idk if i can even put it into words the amount of respect and appreciation I have for you man! thank you!
In the original paper each head should have seperate weights, but in your code all heads share the same weights. here are two steps to fix it:
1. in __init__: self.queries = nn.Linear(self.embed_size, self.embed_size, bias=False) (same for key and value weights)
2. in forward: put "queries = self.queries(queries)" above "queries = queries.reshape(...)" (also same for keys and values)
Great video btw
Hey, thank you so much for bringing this to my attention ;) When reading through the paper I get the same idea that you do, namely that each head should have separate weights, and when reading blog posts like "The Annotated Transformer" he has done exactly what you describe. From the blog post www.peterbloem.nl/blog/transformers he explains narrow vs wide self attention and in his Github implementation he does similarly as I do, however I noticed now that an issue has been raised regarding the same issue you bring up: github.com/pbloem/former/issues/13.
And I agree with the point brought up there also, if each head is using same weights it doesn't feel like you can say they are different. I'm having difficulty finding other implementations, but I will keep a close look at this and if I get some more time I will try to spend more time and investigate this. I'm also a bit surprised that when training on this implementation it provides good results if I remember correctly with only 3x32x32 vs 3x256x256 parameters.
@@AladdinPersson Yes both methods should work just fine, but I believe using seperate weights for each head would give better performance, without slowing down the model. it would use more memory of course, but it's almost nothing compared to number of parameters in the feedforward sublayers.
@@sehbanomer8151 I think your inplementation may still have some issues. Since each head shoud have seperate weights, shouldn't there be eight(number of heads) different head_dim*head_dim linear layers instead of one embed_size*embed_size linear layer. Additionally, these two implementations have different number of parameters.
@@66dp97 the key, query & value projection of each head will project an _embed_dim_ dimentional vector to _head_dim_ dimentional space, so for each attention head, the projection matrix will have shape (head_dim, embed_dim). Fusing _n_heads_ seperate linear layers into a single (embed_dim, head_dim * n_heads) linear layer is more GPU friendly, thus faster.
I have been struggling to implement and understand custom transformer code from various sources. This was perhaps one of the best tutorials.
hey if you are still woring in NLP or transformers i may need your help please reply
This is cool. It would be helpful to have a section highlighting what parts of the dimensions should be changed if you are using a dataset of a different size or you want to change the input length. ie: keeping the architecture constant but noting how it could be used flexibly
Many thanks to you for this impressive tutorial, amazing job and outstanding explanation, and also thanks for sharing all these resources in the description.
This is one of the best explanation videos about a paper to code I've watched in a loong time! Congratz Aladdin dude!
Ya,.. agreed,.. this was an extremely difficult architecture to implement,. with .a LOT of moving parts,.. but this has to be the best walkthrough out there,.. sure, there are certain things like the src_mask unsqueeze that were a little tricky to visualize,.. but even barring that, you broke it down quite well! Thank you for this!. I'm so glad that we have all of this implemented in HF/PT hahah
Hey Aladdin,
Really amazing videos brother!
This was the first video of yours that I stumbled upon and I fell in love with your channel.
Hey Sahil, I definitely need a refresher and go through transformers again, so I'm not sure if I will be able to give you the best answer right now. So from what I recall the most important part of the masking with regards to padding is that we make sure these are not backpropagated through. We want the network weights and embeddings etc not to learn to be associated with the padded values, and that's what we are trying to accomplish with setting it to -infinity since gradient of softmax will then be 0.
@@AladdinPersson Yeah I get the reason why we do it and the -inf setting. I had doubts with the padding that we use, I feel we need more padding to take care of the cases where both sentences are padded and then we have attention over them. I feel I have made it pretty clear in the comment above.
@@SahilKhose it doesnt matter the padding parts got number in final output , beacuse all the paprameter used to caculate it won't have gradient from it becuse deravate of sofmax is 0 for them
great explanation, much more helpful than the theoretical only explanations
The best video I watched on youtube! Why I found you so late!!!
It is the best description for transformer implementation.
thank you so much.
best regards.
World's best TH-cam channel evaaaa.
Love your work!! I was very confused when dealing with other tutorials... but your work made me clear about Transformer. I wish only I know you and your work.
I appreciate the kind words 🙏
First of all, thank you for the video. The most valuable thing I learned from it is how to create a so complex model step by step from the flow chart. Next, I will find out weither this self-attention model can be used in environmental pollution problems.
wow, the best transformer tutorial I've seen
I found this very helpful. I always used to get confused regarding the tensor sizes. Now it's all clear. Thank you very much. Also this is the first time I came across einsum. Thanks again for that too.
Appreciate the kind words 🙏
One of the best resource on the internet!
Hi, I really like your channel. I have been learning from your tutorials for a while. Best wishes!
It's an extremely useful video for researches trying to implement paper codes. Do make a series implementing other Machine Learning codes described in other papers as well.
Please make a video to use this model on an actual NLP task such as translation, etc.
Thank you for saying that I really appreciate it. I have made one other video on transformers for machine translation, and I will do my best to continue making videos and to cover more advanced topics! :)
@@AladdinPersson I can't seem to find it. Can you please paste the link here, please? I'd truly appreciate it. :)
@@flamingflamingo4021 Yeah for sure: th-cam.com/video/M6adRGJe5cQ/w-d-xo.html
It's the last video of an inofficial serie of building Seq2Seq models for the task of machine translation. First video was normal seq2seq, second video was seq2seq+attention and the last video that I linked above is using transformers. These videos were inspired a lot by Bentrevett on Github and I recommend you check him out also if you're interested in NLP :)
Really thank you!!! This really helps me deeply understand Transformer!!!
This video is all I needed
45:59: WHOA! slow that down! Pause a sec, be emphatic if we're going to change something back up there
excellent video and thank u for sharing this. I have one point about implementation, in "SelfAttention" class for query, value and key matrices (linear layer) you used (head_dim, head_dim) dimension. so these matrices will be shared in all heads. I think it's better to use (embed_dim, embed_dim) matrix to map input to q, k, v vectors and reshape it to have head dimension.
Dude, you rock! I bow to your expertise 🙏😊
Very good tutorial!
Just one thing though: this is not how multihead attention is implemented in the original attention is all you need paper. In the paper the input is not split into h smaller vectors, but linearly transformed h times. So their wouldn't be reshape and then linear(head_dim, head_dim) but rather linear(embed_size, head_dim) in each head.
Also you can have more heads than heads*head_dim = embed_size. This is because in the paper you would transform your concatenated head-outputs again with a jointly trained matrix WO (concatenation size x embed_size)
for the dropout in your codes, for example DecoderBlock forward, I think it should be:
query = self.norm(x + self.dropout(attention))
instead:
query = self.dropout(self.norm(attention + x))
Here is the paper quote:
"We apply dropout [33] to the output of each sub-layer, before it is added to the sub-layer input and normalized."
Thanks so much for the great work!
I think you're right, I'll look into this some more soon and update the Github code :)
Thank you very much!
Compliments for the video, really gives better insight into a complex architecture. Thanks for sharing all this information.
Great Explanation: Loved it
This is the best tutorial on Transformers online. I was able to understand the nuts and bolts of it. Kudos to you!! It will be great if you can cover Graph Convolutional Networks from scratch
In the Jay Alammar blog there is no split of the embeddings in order to compute attention for each head.
Great work! Really helped me. Thanks.
Superb...Hats off. Thank you for explanation.
Gonna try this for my uni assignment! Thank you
24:41 and other einsum parts... why do we want these particular shapes? dont rush, please take some time to explain all important details in the video :)
excellent work mate cleared all my doubts
Great Job!! Thanks for the video!
Thanks Aladdin. The video helped a lot.
Also love your voice btw, feels calming 👍
Very detailed and clear! Thank you very much!
Thanks a lot for the kind words🙏
10:06, a little confused here. Is the embedding space cut for every token to create equal parts OR is the original input multiplied with weights such that they decrease the embedding space for every word?
Happiness is all you need, neither attention, nor transformer.
Great video, advanced my understanding.
Very nice! Congratulations!!
You'll be happy to know chat-gpt recommended this video and gave a link to it when I asked for resources explaining transformers
Thanks a lot for the video, this was great an its helping me a lot.
Thank you very much for the info!
Amazing tutorial! The concept of Transformers became much clearer after following this, although there is one thing I had doubt in. In the part where you implement positional embeddings, there is no implementation of sin and cos functions as given in the paper. Why is that?
Thank you again for the tutorial!
Having the same question too. Did he use "nn.Embedding" to replace the original method ?
@@junweilu4990 In the code, he did use nn.Embedding to replace the original method. But I do not know the logic behind this approach as this does not "truly" give positional features, does it?
@@junweilu4990 I also expected him to use word2vec or some pre-fixed embedding algorithm. I think by using nn.Embedding() he tries to learn word vectors and positional vectors, while training his attention algoritm, encoders and decoders. Very wasteful, I believe. I think the rest of his algorithm will converge only after his word and positional vectors converges, if they ever...
This was really helpful! Can you also do a tutorial for Decision Transformer for reinforcement learning?
fantastic, awesome videos as ever.
At 17:57 you forgot to apply linear projections before reshaping.
Hi, at 18:45, energy = queries* keys. Are you doing outer product?
Hi Thanks for the superb video.. I have one doubt regarding selfAttention block:
self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)
" Don't you think that V,K,Q should be computed on the whole embedding vectors and later we should divide based on heads.?
It is a great video. I have some confusions understanding it fully.
In SelfAttention class you created Linear layers self.values self.keys and self.queries but you didn't use those layers.
Thank you! great explanation, I just wonder why in the attention mechanism you have to inizialize self.queries, self.keys ecc as Linear layers
From paper, the attention mechanism is fully connected, which means you should use linear layers.
Dude! you're amazing!
Thx for this amazing tutorial. I think the "energy" (Q * K_transpose) should be divided by the square root of head_dim instead of embedding_size.
Thanks so much!🎉
in SelfAttention, you have not used the linears self.keys, self.values, self.queries in forward method, whats the use of those layers?
please make a video on implementation of Video Summarization With Frame Index Vision Transformer
the last question please what is intuitive meaning for the source and target inputs of transformer why model takes x, trg[:, :-1]
model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to(device)
out = model(x, trg[:, :-1])
what we could get from out?
I tried
model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to(device)
out = model(x, trg)
and got
torch.Size([2, 8, 10])
to be honest I could not interpret that :(
Okay so I understand your question correctly:
Your doubt is why are we using trg[:, :-1] instead of trg
First:
trg[:, :-1] this means all the batches(sentences) and entire sentences except the last word in all the sentences.
Second:
We do this because of how the transformer model is developed to train. Unlike RNNs our transformer model does not predict the entire output sentence, instead it predicts one word at a time. So the decoder takes in (t-1) time step's output of the transformer and then predicts the t time step output word. Hence we provide the entire sentence but the last word so as to predict the last word.
Refer to the beautiful video by Yannic Kelcher:
th-cam.com/video/iDulhoQ2pro/w-d-xo.html
Hope your doubt is solved. Let me know if it's still unclear.
Great! ❤ Thanks for this master piece. Hmm, I follow you along and I have not any error when I run, since I already noted your error in code and update it 😊. Waiting for your new Video. This is the first video I go along with you. Subscribed! Bell Notification on.
Hey, i'm a bit confused, in the self attention block (around 15:20) you define value, query and key heads, but they are unused in forward method. If I understand correctly, they should be used after each sequence's reshape?
Please make a video on using the Temporal Fusion Transformers.
In 52:37, you write out = self.decoder(trg, enc_src, src_mask, trg_mask) but i think it should be out = self.decoder(trg, enc_src, trg_key_padding_mask, trg_mask) beacuse you don't expilictly give source to decoder, instead you give trg with paddings. This is why i think it should be trg_key_padding_mask rather than src_mask. Am I right? Thanks a lot for such a great video :)
Excellent! Thank you so much for this! Had a small request, can you please come up with videos on BERT and controlled text generation models like PPLM? Thanks again!
Thank you for the comment! I will look into it, got a few videos that I'm planning but will come back to this in the future for sure :)
@Aladdin Persson. Thank you. Great lesson. Which IDE and theme you used ?
Thank you the video was very helpful. In the end we got output of dim (2,7,10). So why did we got the probabilities of the next 7 words? And why is the output len dependent on the number of words we feed to the decoder?
In your implementation you split the keys, queries and values into heads before passing them through the linear layers, in other implementations however, the keys, queries and values are split after being passed through the linear layers. The paper follows your implementation; according to what I can deduce from the picture of the transformer. Are these two approaches similar or is one better than the other?
That's a great observation and one that I've wondered about and confused over myself, from my understanding the way I did it is not entirely accurate and we should split after the linear layers. Otherwise it doesn't make intuitive sense to me that we have "different heads" if they all share parameters. With that said, the way I did it works and uses a lot fewer parameters but I would use the way you've seen others do it (I haven't done extensive tests to see how much better it performs, let me know if you ever test this).
I love your toturials
i've got a question here. In order to generate a target sentence, there should be multiple time steps right?
The first output word from Decoder will go through Decoder again to generate the secend output word.
i cant find where you difine this in this video. Or maybe i understand it wrong.
During training everything is done in parallel (we have the entire translated target sentence) and we utilize these target masks that I talked about in the video. This is a major difference between transformer and normal Seq2Seq, where we actually send in the entire target sentence rather than word by word. When we evaluate the model you're completely right that we need to do multiple time steps (one word at a time) but this is not the case during training. In this video we kind of just do the transformer from scratch, the question you're asking is more related to actually training & evaluating transformer models. I'll try to see if I find code for what you're asking for.
So here is a full code example of using transformers (also have a separate video on it): github.com/AladdinPerzon/Machine-Learning-Collection/blob/master/ML/Pytorch/more_advanced/seq2seq_transformer/seq2seq_transformer.py
When we actually evaluate the model we need to do it time step by time step and it would look like this (translate sentence function) and I believe THIS is what you're asking for: github.com/AladdinPerzon/Machine-Learning-Collection/blob/master/ML/Pytorch/more_advanced/seq2seq_transformer/utils.py
@@AladdinPersson Thank you, the link explained it pretty well. Thank you a lot.
Hi everyone! I finished following the this tutorial to the end... But now I am confused on how to "train" and "test/predict" this model? Any help is appreciated! Thanks!
I can't seem to understand the necessity for self.keys, self.queries and self.values in the SelfAttention class. Am I missing something?
14:05 is this even correct though? Shouldn't you project Q, K and V from the `embed_size` to `embed_size` and split the resulting tensors? Your implementation seems to treat parts of the embedding dimension separately (instead of jointly attending to the entire embedding space) and using the same weights (instead of having separate weights for each head).
hey, it was a while since I looked at this but I believe I might have made some mistake in this video as I remember going through the implementation a while after and modifying this line in the implementation if I recall correctly. Check the latest one on GitHub
Very nice video. I have a question.
The positional encoding you used is different from the one in the paper where they use sin/cos function of word position and vector index. It seems in your code, these positional embedding will be trained unlike the one in the paper. Do you have the code for how positional encoding is done in the paper?
Yes you're right about that, if I recall I did mention it in the video but I could have missed that. There have been other questions about this as well so I might try to implement positional encoding also but as of right now I have not
Shouldn't it be query_len instead of key_len 17:38 ?
Aladdin, thank you so much for the video!
one question regarding the Example, why do you expect the output shape to be [2, 7, 10]?
2 Examples, 7 Tokens, 10 Probabilities
I dont understand myself how I would go about this
Oh wait! This adds up for me.
They output all 7 words at the same time, but the target mask REMOVES the information as if it wasn't generated yet
When you apply a transformer live, you would have to iterate through it
At 54:02, why do we need to pass the truncated trg[:, :-1] ? I thought we already applied the mask to prevent the model from looking into the future?
Why do you apply attention twice? Inside the transformer block and inside the decoder block?
Fantastic video, question, could you revisit this and add how to early stop and hypertune a "from scratch" transformer model?
I have viewed your code in github, and I',m wondering why you use the same weight matrix for different heads, isn't will this make the Q,V,K vector for different heads to be the same?
Damn, I'm suffering from the same question.
I am confused as to why, here, the kqv projections for the different heads seem to be shared. It seems like we should use nn.Linear(embed_dim, embed_dim), and later divide it into different heads?
Superb!
Thank you for your video. You did a great job! I was wondering how to train a transformer if the input form is (batch_size, sequence_length, number_of_features). Let's say number_of_features = 2 (it could be X and Y coordinates in time, for example). What impact does this type of input have on positional encoding, the masking strategy and the attention mechanism?
Why do you set bias=False for nn.Linear of keys, values and queries?
Love this content.
Thanks for your educational contribution! Just one question: what are the linear layers self.values, self.keys and self.queries for? These are not used inside the forward pass.
if I'm using transformer for a speech recognition task (speech-to-text). after training the model, in prediction what should I place on the target parameter if I have only audio file (not transcibed)?
did you get ur answer?
i don't know if there are still people who are watching this but i have a question at code level, in the decoder method "forward" when i pass the parameters to layer i had as the fifth one "target_mask" but in the decoder block you decided to put the parameter "device". Did i miss something, is it just an error or there is an other explanation? Thanks a lot
Thanks so much!
Appreciate it :)
I am not understanding sir, does this input sequence is divided into a number of chunks like here you did 256/8 where 8 is the number of attention heads. I am thinking for the self-attention whole of the input embeddings need to transform into three parts namely Q, K and V. and then we need to divide this for 8 times in the case of 8 multi heads. that's why the name is the multi head. Please clear. Regards
Hello, This is a great explanation of transformers. I have a question. How did you know that query.shape[0] would give you the number of training examples? Why is it later used in reshaping the keys, query, and values?
The first dimension is always the batch size in tensor operations. As any model is trained on batches, and the batch size is the number of samples
why is the input embedding input is 256 dimensions? at 10:02
Hi Aladdin. I have just finished watching "TransformerBlock" part but I've noticed something. In the first LayerNorm, I wonder why it would be x = self.norm1(attention + queries)? As far as I understand, it should be x = self.norm1(attention + input) where queries = Wq*input instead : ) (btw so far you haven't used the notation of input as input sequence vectors but I hope what I'm saying is clear enough tho..)
How he has it is correct: `queries = Wq*input` only gets calculated at the Attention Module, so queries is still just normal input (values makes more sense, thats how they describe it in the paper)
Thank you so much!!!