Pytorch Transformers from Scratch (Attention is all you need)

แชร์
ฝัง
  • เผยแพร่เมื่อ 22 ก.ย. 2024

ความคิดเห็น • 328

  • @AladdinPersson
    @AladdinPersson  4 ปีที่แล้ว +67

    Here's the outline for the video:
    0:00 - Introduction
    0:54 - Paper Review
    11:20 - Attention Mechanism
    27:00 - TransformerBlock
    32:18 - Encoder
    38:20 - DecoderBlock
    42:00 - Decoder
    46:55 - Forming The Transformer
    52:45 - A Small Example
    54:25 - Fixing Errors
    56:44 - Ending

    • @alhasanalkhaddour434
      @alhasanalkhaddour434 3 ปีที่แล้ว

      First thanks for this amazing video, but I have one question regarding the implementation of Self Attention.
      To distribute values, keys and queries to heads you just did a reshape for the input, while the original paper suggested to do projection using trainable matrices.
      Am I right or I missed up something?

    • @feravladimirovna1044
      @feravladimirovna1044 3 ปีที่แล้ว

      @@alhasanalkhaddour434 yes i think he did the projection already using self.values, self,keys, self.queries cause these are linear layers . the real inputs comes from the parameters passed to forward function see 14.43 for more details

    • @riyajatar6859
      @riyajatar6859 2 ปีที่แล้ว

      Why did you use self. Values, self. Keys in the init method bcz they are not used at all in forward

    • @somayehseifi8269
      @somayehseifi8269 2 ปีที่แล้ว

      Sorry can you share the github link of this special code?

    • @yufancao4369
      @yufancao4369 ปีที่แล้ว

      @@riyajatar6859 Actually he fixed that at the end of the video🧐

  • @rhronsky
    @rhronsky 4 ปีที่แล้ว +265

    Attention is not all we need, this video is all we need

  • @pratikhmanas5486
    @pratikhmanas5486 4 ปีที่แล้ว +155

    Not found a tutorial so much detail oriented. Now I am completely able to understand the Transformer and Attention Mechanism.Great Work.Thank you😊

    • @AladdinPersson
      @AladdinPersson  4 ปีที่แล้ว +11

      I really appreciate you saying that, thanks a lot :)

    • @NICe-wm9xn
      @NICe-wm9xn ปีที่แล้ว +1

      ​@@AladdinPerssonHi! You missed one error in your video. In your GitHub code, you have `self.values = nn.Linear(embed_size, embed_size)`, but in your video, you used `self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)`. I couldn't reproduce your results until I noticed this discrepancy.

  • @bhargav7476
    @bhargav7476 3 ปีที่แล้ว +33

    I watched 3 Transformer videos before this one and thought I would never understand it. Love the way you explained such a complicated topic.

  • @nishantraj95
    @nishantraj95 2 ปีที่แล้ว +35

    This is undoubtedly one of the best transformer implementation videos I have seen. Thanks for posting such good content. Looking forward to seeing some more paper implementation videos.

  • @sehbanomer8151
    @sehbanomer8151 4 ปีที่แล้ว +58

    In the original paper each head should have seperate weights, but in your code all heads share the same weights. here are two steps to fix it:
    1. in __init__: self.queries = nn.Linear(self.embed_size, self.embed_size, bias=False) (same for key and value weights)
    2. in forward: put "queries = self.queries(queries)" above "queries = queries.reshape(...)" (also same for keys and values)
    Great video btw

    • @AladdinPersson
      @AladdinPersson  4 ปีที่แล้ว +12

      Hey, thank you so much for bringing this to my attention ;) When reading through the paper I get the same idea that you do, namely that each head should have separate weights, and when reading blog posts like "The Annotated Transformer" he has done exactly what you describe. From the blog post www.peterbloem.nl/blog/transformers he explains narrow vs wide self attention and in his Github implementation he does similarly as I do, however I noticed now that an issue has been raised regarding the same issue you bring up: github.com/pbloem/former/issues/13.
      And I agree with the point brought up there also, if each head is using same weights it doesn't feel like you can say they are different. I'm having difficulty finding other implementations, but I will keep a close look at this and if I get some more time I will try to spend more time and investigate this. I'm also a bit surprised that when training on this implementation it provides good results if I remember correctly with only 3x32x32 vs 3x256x256 parameters.

    • @sehbanomer8151
      @sehbanomer8151 4 ปีที่แล้ว +4

      @@AladdinPersson Yes both methods should work just fine, but I believe using seperate weights for each head would give better performance, without slowing down the model. it would use more memory of course, but it's almost nothing compared to number of parameters in the feedforward sublayers.

    • @66dp97
      @66dp97 2 ปีที่แล้ว +2

      @@sehbanomer8151 I think your inplementation may still have some issues. Since each head shoud have seperate weights, shouldn't there be eight(number of heads) different head_dim*head_dim linear layers instead of one embed_size*embed_size linear layer. Additionally, these two implementations have different number of parameters.

    • @sehbanomer8151
      @sehbanomer8151 2 ปีที่แล้ว +6

      @@66dp97 the key, query & value projection of each head will project an _embed_dim_ dimentional vector to _head_dim_ dimentional space, so for each attention head, the projection matrix will have shape (head_dim, embed_dim). Fusing _n_heads_ seperate linear layers into a single (embed_dim, head_dim * n_heads) linear layer is more GPU friendly, thus faster.

  • @cc-to2jn
    @cc-to2jn 2 ปีที่แล้ว +9

    you're an absolute saint. idk if i can even put it into words the amount of respect and appreciation I have for you man! thank you!

  • @gautamvashishtha3923
    @gautamvashishtha3923 ปีที่แล้ว +4

    Great Tutorial! Thanks Aladdin

    • @jushkunjuret4386
      @jushkunjuret4386 ปีที่แล้ว

      for actually training it, what would we do?

    • @gautamvashishtha3923
      @gautamvashishtha3923 ปีที่แล้ว

      @@jushkunjuret4386 Can you specify where you're exactly getting stuck?

  • @niranjansitapure4740
    @niranjansitapure4740 ปีที่แล้ว +3

    I have been struggling to implement and understand custom transformer code from various sources. This was perhaps one of the best tutorials.

    • @Priyanshuc2425
      @Priyanshuc2425 2 หลายเดือนก่อน

      hey if you are still woring in NLP or transformers i may need your help please reply

  • @chefmemesupreme
    @chefmemesupreme ปีที่แล้ว +3

    This is cool. It would be helpful to have a section highlighting what parts of the dimensions should be changed if you are using a dataset of a different size or you want to change the input length. ie: keeping the architecture constant but noting how it could be used flexibly

  • @nicknguyen690
    @nicknguyen690 4 ปีที่แล้ว +3

    for the dropout in your codes, for example DecoderBlock forward, I think it should be:
    query = self.norm(x + self.dropout(attention))
    instead:
    query = self.dropout(self.norm(attention + x))
    Here is the paper quote:
    "We apply dropout [33] to the output of each sub-layer, before it is added to the sub-layer input and normalized."

    • @nicknguyen690
      @nicknguyen690 4 ปีที่แล้ว +1

      Thanks so much for the great work!

    • @AladdinPersson
      @AladdinPersson  4 ปีที่แล้ว +3

      I think you're right, I'll look into this some more soon and update the Github code :)

    • @上上上上上昇気流_改
      @上上上上上昇気流_改 3 ปีที่แล้ว

      Thank you very much!

  • @mykytahordia
    @mykytahordia ปีที่แล้ว

    making something sophisticated so easy and clear that's what I call magic. Aladdin, you are truly the magician.

  • @TheClaxterix
    @TheClaxterix 2 ปีที่แล้ว

    This is one of the best explanation videos about a paper to code I've watched in a loong time! Congratz Aladdin dude!

  • @FawzyBasily
    @FawzyBasily ปีที่แล้ว +2

    Many thanks to you for this impressive tutorial, amazing job and outstanding explanation, and also thanks for sharing all these resources in the description.

  • @SahilKhose
    @SahilKhose 3 ปีที่แล้ว +2

    Hey Aladdin,
    Really amazing videos brother!
    This was the first video of yours that I stumbled upon and I fell in love with your channel.

    • @AladdinPersson
      @AladdinPersson  3 ปีที่แล้ว +1

      Hey Sahil, I definitely need a refresher and go through transformers again, so I'm not sure if I will be able to give you the best answer right now. So from what I recall the most important part of the masking with regards to padding is that we make sure these are not backpropagated through. We want the network weights and embeddings etc not to learn to be associated with the padded values, and that's what we are trying to accomplish with setting it to -infinity since gradient of softmax will then be 0.

    • @SahilKhose
      @SahilKhose 3 ปีที่แล้ว

      @@AladdinPersson Yeah I get the reason why we do it and the -inf setting. I had doubts with the padding that we use, I feel we need more padding to take care of the cases where both sentences are padded and then we have attention over them. I feel I have made it pretty clear in the comment above.

    • @RicardoMlu-tw2ig
      @RicardoMlu-tw2ig 13 วันที่ผ่านมา

      @@SahilKhose it doesnt matter the padding parts got number in final output , beacuse all the paprameter used to caculate it won't have gradient from it becuse deravate of sofmax is 0 for them

  • @ScriptureFirst
    @ScriptureFirst 3 ปีที่แล้ว +1

    45:59: WHOA! slow that down! Pause a sec, be emphatic if we're going to change something back up there

  • @F30-Jet
    @F30-Jet 11 หลายเดือนก่อน +1

    This video is all I needed

  • @쉔송이
    @쉔송이 4 ปีที่แล้ว

    Love your work!! I was very confused when dealing with other tutorials... but your work made me clear about Transformer. I wish only I know you and your work.

    • @AladdinPersson
      @AladdinPersson  4 ปีที่แล้ว +1

      I appreciate the kind words 🙏

  • @marksaroufim
    @marksaroufim 2 ปีที่แล้ว +1

    wow, the best transformer tutorial I've seen

  • @danielmaier6665
    @danielmaier6665 2 ปีที่แล้ว

    Very good tutorial!
    Just one thing though: this is not how multihead attention is implemented in the original attention is all you need paper. In the paper the input is not split into h smaller vectors, but linearly transformed h times. So their wouldn't be reshape and then linear(head_dim, head_dim) but rather linear(embed_size, head_dim) in each head.
    Also you can have more heads than heads*head_dim = embed_size. This is because in the paper you would transform your concatenated head-outputs again with a jointly trained matrix WO (concatenation size x embed_size)

  • @bingochipspass08
    @bingochipspass08 ปีที่แล้ว

    Ya,.. agreed,.. this was an extremely difficult architecture to implement,. with .a LOT of moving parts,.. but this has to be the best walkthrough out there,.. sure, there are certain things like the src_mask unsqueeze that were a little tricky to visualize,.. but even barring that, you broke it down quite well! Thank you for this!. I'm so glad that we have all of this implemented in HF/PT hahah

  • @MozhganRahmatinia
    @MozhganRahmatinia ปีที่แล้ว

    It is the best description for transformer implementation.
    thank you so much.
    best regards.

  • @yashsvidixit7169
    @yashsvidixit7169 8 หลายเดือนก่อน

    Happiness is all you need, neither attention, nor transformer.

  • @张晨雨-q8j
    @张晨雨-q8j 2 ปีที่แล้ว +1

    great explanation, much more helpful than the theoretical only explanations

  • @rayzhang2589
    @rayzhang2589 2 ปีที่แล้ว

    Really thank you!!! This really helps me deeply understand Transformer!!!

  • @soorkie
    @soorkie 4 ปีที่แล้ว

    I found this very helpful. I always used to get confused regarding the tensor sizes. Now it's all clear. Thank you very much. Also this is the first time I came across einsum. Thanks again for that too.

    • @AladdinPersson
      @AladdinPersson  4 ปีที่แล้ว +1

      Appreciate the kind words 🙏

  • @ShamailMulla-x5n
    @ShamailMulla-x5n 2 หลายเดือนก่อน

    This was really helpful! Can you also do a tutorial for Decision Transformer for reinforcement learning?

  • @xiangzhang7723
    @xiangzhang7723 3 ปีที่แล้ว

    Hi, I really like your channel. I have been learning from your tutorials for a while. Best wishes!

  • @RicardoMlu-tw2ig
    @RicardoMlu-tw2ig 13 วันที่ผ่านมา

    Thanks so much!🎉

  • @flamingflamingo4021
    @flamingflamingo4021 4 ปีที่แล้ว +1

    It's an extremely useful video for researches trying to implement paper codes. Do make a series implementing other Machine Learning codes described in other papers as well.
    Please make a video to use this model on an actual NLP task such as translation, etc.

    • @AladdinPersson
      @AladdinPersson  4 ปีที่แล้ว +1

      Thank you for saying that I really appreciate it. I have made one other video on transformers for machine translation, and I will do my best to continue making videos and to cover more advanced topics! :)

    • @flamingflamingo4021
      @flamingflamingo4021 4 ปีที่แล้ว

      @@AladdinPersson I can't seem to find it. Can you please paste the link here, please? I'd truly appreciate it. :)

    • @AladdinPersson
      @AladdinPersson  4 ปีที่แล้ว

      @@flamingflamingo4021 Yeah for sure: th-cam.com/video/M6adRGJe5cQ/w-d-xo.html
      It's the last video of an inofficial serie of building Seq2Seq models for the task of machine translation. First video was normal seq2seq, second video was seq2seq+attention and the last video that I linked above is using transformers. These videos were inspired a lot by Bentrevett on Github and I recommend you check him out also if you're interested in NLP :)

  • @alikhodabakhsh2653
    @alikhodabakhsh2653 ปีที่แล้ว

    excellent video and thank u for sharing this. I have one point about implementation, in "SelfAttention" class for query, value and key matrices (linear layer) you used (head_dim, head_dim) dimension. so these matrices will be shared in all heads. I think it's better to use (embed_dim, embed_dim) matrix to map input to q, k, v vectors and reshape it to have head dimension.

  • @algorithmo134
    @algorithmo134 วันที่ผ่านมา

    At 54:02, why do we need to pass the truncated trg[:, :-1] ? I thought we already applied the mask to prevent the model from looking into the future?

  • @anujsingh5961
    @anujsingh5961 2 ปีที่แล้ว +1

    Hi Thanks for the superb video.. I have one doubt regarding selfAttention block:
    self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
    self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
    self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)
    " Don't you think that V,K,Q should be computed on the whole embedding vectors and later we should divide based on heads.?

  • @anas.2k866
    @anas.2k866 ปีที่แล้ว

    In the Jay Alammar blog there is no split of the embeddings in order to compute attention for each head.

  • @avinashrai6725
    @avinashrai6725 ปีที่แล้ว +2

    in SelfAttention, you have not used the linears self.keys, self.values, self.queries in forward method, whats the use of those layers?

  • @KinjalMondal-yj9fx
    @KinjalMondal-yj9fx 2 หลายเดือนก่อน

    Great Explanation: Loved it

  • @qiguosun129
    @qiguosun129 3 ปีที่แล้ว

    First of all, thank you for the video. The most valuable thing I learned from it is how to create a so complex model step by step from the flow chart. Next, I will find out weither this self-attention model can be used in environmental pollution problems.

  • @siennypoole4366
    @siennypoole4366 2 ปีที่แล้ว +1

    Hi everyone! I finished following the this tutorial to the end... But now I am confused on how to "train" and "test/predict" this model? Any help is appreciated! Thanks!

  • @YL-nx3yk
    @YL-nx3yk 3 ปีที่แล้ว +1

    The best video I watched on youtube! Why I found you so late!!!

  • @srinivasvinnakota1747
    @srinivasvinnakota1747 2 ปีที่แล้ว

    Dude, you rock! I bow to your expertise 🙏😊

  • @shahnawazrshaikh9108
    @shahnawazrshaikh9108 2 ปีที่แล้ว

    One of the best resource on the internet!

  • @robertpaulson2052
    @robertpaulson2052 ปีที่แล้ว

    You'll be happy to know chat-gpt recommended this video and gave a link to it when I asked for resources explaining transformers

  • @kaustubhkulkarni
    @kaustubhkulkarni 3 ปีที่แล้ว +2

    Amazing tutorial! The concept of Transformers became much clearer after following this, although there is one thing I had doubt in. In the part where you implement positional embeddings, there is no implementation of sin and cos functions as given in the paper. Why is that?
    Thank you again for the tutorial!

    • @junweilu4990
      @junweilu4990 3 ปีที่แล้ว

      Having the same question too. Did he use "nn.Embedding" to replace the original method ?

    • @kaustubhkulkarni
      @kaustubhkulkarni 3 ปีที่แล้ว

      @@junweilu4990 In the code, he did use nn.Embedding to replace the original method. But I do not know the logic behind this approach as this does not "truly" give positional features, does it?

    • @sahhaf1234
      @sahhaf1234 2 ปีที่แล้ว

      @@junweilu4990 I also expected him to use word2vec or some pre-fixed embedding algorithm. I think by using nn.Embedding() he tries to learn word vectors and positional vectors, while training his attention algoritm, encoders and decoders. Very wasteful, I believe. I think the rest of his algorithm will converge only after his word and positional vectors converges, if they ever...

  • @TomatoBananaPotato
    @TomatoBananaPotato 3 ปีที่แล้ว

    Gonna try this for my uni assignment! Thank you

  • @zhengyuancui7837
    @zhengyuancui7837 ปีที่แล้ว

    Great work! Really helped me. Thanks.

  • @parthasarathyk5476
    @parthasarathyk5476 2 ปีที่แล้ว

    Superb...Hats off. Thank you for explanation.

  • @geriskenderi
    @geriskenderi 3 ปีที่แล้ว

    Compliments for the video, really gives better insight into a complex architecture. Thanks for sharing all this information.

  • @olegborisov5091
    @olegborisov5091 3 ปีที่แล้ว +2

    24:41 and other einsum parts... why do we want these particular shapes? dont rush, please take some time to explain all important details in the video :)

  • @matteofabro4486
    @matteofabro4486 4 ปีที่แล้ว +1

    Thank you very much for the info!

  • @fuat7775
    @fuat7775 3 ปีที่แล้ว

    Very detailed and clear! Thank you very much!

    • @AladdinPersson
      @AladdinPersson  3 ปีที่แล้ว

      Thanks a lot for the kind words🙏

  • @panchajanya91
    @panchajanya91 2 ปีที่แล้ว +2

    It is a great video. I have some confusions understanding it fully.
    In SelfAttention class you created Linear layers self.values self.keys and self.queries but you didn't use those layers.

  • @thatipelli1
    @thatipelli1 3 ปีที่แล้ว +1

    This is the best tutorial on Transformers online. I was able to understand the nuts and bolts of it. Kudos to you!! It will be great if you can cover Graph Convolutional Networks from scratch

  • @joeyk2346
    @joeyk2346 3 ปีที่แล้ว

    Great Job!! Thanks for the video!

  • @allessandroable
    @allessandroable 3 ปีที่แล้ว +3

    Thank you! great explanation, I just wonder why in the attention mechanism you have to inizialize self.queries, self.keys ecc as Linear layers

    • @ayyythatguy
      @ayyythatguy 2 ปีที่แล้ว

      From paper, the attention mechanism is fully connected, which means you should use linear layers.

  • @czarhada
    @czarhada 4 ปีที่แล้ว +3

    Excellent! Thank you so much for this! Had a small request, can you please come up with videos on BERT and controlled text generation models like PPLM? Thanks again!

    • @AladdinPersson
      @AladdinPersson  4 ปีที่แล้ว +1

      Thank you for the comment! I will look into it, got a few videos that I'm planning but will come back to this in the future for sure :)

  • @sidneybassett2683
    @sidneybassett2683 2 ปีที่แล้ว +1

    I have viewed your code in github, and I',m wondering why you use the same weight matrix for different heads, isn't will this make the Q,V,K vector for different heads to be the same?

    • @zhuchencao2527
      @zhuchencao2527 2 ปีที่แล้ว

      Damn, I'm suffering from the same question.
      I am confused as to why, here, the kqv projections for the different heads seem to be shared. It seems like we should use nn.Linear(embed_dim, embed_dim), and later divide it into different heads?

  • @teetanrobotics5363
    @teetanrobotics5363 3 ปีที่แล้ว

    World's best TH-cam channel evaaaa.

  • @arjunpukale3310
    @arjunpukale3310 4 ปีที่แล้ว +1

    I want to use the encoder of transformer for video classification, where each frame of the video will be first passed through a pretrained cnn and the output of this would act as an embedding and then passed as an input tor encoder. Any suggestions on how to do that?

  • @michaeldurand9309
    @michaeldurand9309 ปีที่แล้ว +1

    Thank you for your video. You did a great job! I was wondering how to train a transformer if the input form is (batch_size, sequence_length, number_of_features). Let's say number_of_features = 2 (it could be X and Y coordinates in time, for example). What impact does this type of input have on positional encoding, the masking strategy and the attention mechanism?

  • @xanyula2738
    @xanyula2738 3 ปีที่แล้ว +1

    I can't seem to understand the necessity for self.keys, self.queries and self.values in the SelfAttention class. Am I missing something?

  • @kyde8392
    @kyde8392 4 ปีที่แล้ว +8

    In your implementation you split the keys, queries and values into heads before passing them through the linear layers, in other implementations however, the keys, queries and values are split after being passed through the linear layers. The paper follows your implementation; according to what I can deduce from the picture of the transformer. Are these two approaches similar or is one better than the other?

    • @AladdinPersson
      @AladdinPersson  4 ปีที่แล้ว +9

      That's a great observation and one that I've wondered about and confused over myself, from my understanding the way I did it is not entirely accurate and we should split after the linear layers. Otherwise it doesn't make intuitive sense to me that we have "different heads" if they all share parameters. With that said, the way I did it works and uses a lot fewer parameters but I would use the way you've seen others do it (I haven't done extensive tests to see how much better it performs, let me know if you ever test this).

  • @garrettosborne4364
    @garrettosborne4364 3 ปีที่แล้ว

    Great video, advanced my understanding.

  • @RicardoMlu-tw2ig
    @RicardoMlu-tw2ig 13 วันที่ผ่านมา

    Also love your voice btw, feels calming 👍

  • @roman-bushuiev
    @roman-bushuiev 2 ปีที่แล้ว +1

    At 17:57 you forgot to apply linear projections before reshaping.

  • @kolla_teja
    @kolla_teja 3 ปีที่แล้ว

    excellent work mate cleared all my doubts

  • @yashrathi6862
    @yashrathi6862 2 ปีที่แล้ว +1

    Thank you the video was very helpful. In the end we got output of dim (2,7,10). So why did we got the probabilities of the next 7 words? And why is the output len dependent on the number of words we feed to the decoder?

  • @amankushwaha8927
    @amankushwaha8927 ปีที่แล้ว

    Thanks Aladdin. The video helped a lot.

  • @davidray6126
    @davidray6126 ปีที่แล้ว

    Thx for this amazing tutorial. I think the "energy" (Q * K_transpose) should be divided by the square root of head_dim instead of embedding_size.

  • @MrWilliamducfer
    @MrWilliamducfer 4 ปีที่แล้ว

    Very nice! Congratulations!!

  • @N3xUss99
    @N3xUss99 9 หลายเดือนก่อน

    i don't know if there are still people who are watching this but i have a question at code level, in the decoder method "forward" when i pass the parameters to layer i had as the fifth one "target_mask" but in the decoder block you decided to put the parameter "device". Did i miss something, is it just an error or there is an other explanation? Thanks a lot

  • @somayehseifi8269
    @somayehseifi8269 2 ปีที่แล้ว

    Thank you for your toturial, I have a question : you said in encoder all value, key and query are the same. as the paper said value, query and key are just the same in size not in element. can u plz explain it a little more?

  • @feravladimirovna1044
    @feravladimirovna1044 4 ปีที่แล้ว +1

    the last question please what is intuitive meaning for the source and target inputs of transformer why model takes x, trg[:, :-1]
    model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to(device)
    out = model(x, trg[:, :-1])
    what we could get from out?
    I tried
    model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to(device)
    out = model(x, trg)
    and got
    torch.Size([2, 8, 10])
    to be honest I could not interpret that :(

    • @SahilKhose
      @SahilKhose 3 ปีที่แล้ว

      Okay so I understand your question correctly:
      Your doubt is why are we using trg[:, :-1] instead of trg
      First:
      trg[:, :-1] this means all the batches(sentences) and entire sentences except the last word in all the sentences.
      Second:
      We do this because of how the transformer model is developed to train. Unlike RNNs our transformer model does not predict the entire output sentence, instead it predicts one word at a time. So the decoder takes in (t-1) time step's output of the transformer and then predicts the t time step output word. Hence we provide the entire sentence but the last word so as to predict the last word.
      Refer to the beautiful video by Yannic Kelcher:
      th-cam.com/video/iDulhoQ2pro/w-d-xo.html
      Hope your doubt is solved. Let me know if it's still unclear.

  • @sargamgupta7134
    @sargamgupta7134 ปีที่แล้ว

    Please make a video on using the Temporal Fusion Transformers.

  • @matejmnoucek2865
    @matejmnoucek2865 2 ปีที่แล้ว

    Why do you set bias=False for nn.Linear of keys, values and queries?

  • @shazm4020
    @shazm4020 2 ปีที่แล้ว

    Thank you so much!

  • @shuyangli522
    @shuyangli522 3 ปีที่แล้ว +1

    Thanks for your video, it seems like there's an error in your multi-head attention implementation. You didn't use the fully-connected layers of query, key and value. Basically the model cannot be trained to learn the attention map.

  • @MohamedAli-dk6cb
    @MohamedAli-dk6cb 5 หลายเดือนก่อน

    I got confused a bit. What you are sending from the encoder to the decoder, Do they represent queries and keys, or keys and values??

  • @ocean6709
    @ocean6709 ปีที่แล้ว

    Hi, at 18:45, energy = queries* keys. Are you doing outer product?

  • @abdulrahmanadel8917
    @abdulrahmanadel8917 3 ปีที่แล้ว +1

    if I'm using transformer for a speech recognition task (speech-to-text). after training the model, in prediction what should I place on the target parameter if I have only audio file (not transcibed)?

    • @popamaji
      @popamaji 2 ปีที่แล้ว

      did you get ur answer?

  • @polimetakrylanmetylu2483
    @polimetakrylanmetylu2483 2 ปีที่แล้ว +2

    Hey, i'm a bit confused, in the self attention block (around 15:20) you define value, query and key heads, but they are unused in forward method. If I understand correctly, they should be used after each sequence's reshape?

  • @raphaels2103
    @raphaels2103 2 ปีที่แล้ว

    why do we need the src_mask, can't the model learn by itself that there is some padding here?

  • @himanikumar7979
    @himanikumar7979 หลายเดือนก่อน

    Shouldn't it be query_len instead of key_len 17:38 ?

  • @mohdkashif7295
    @mohdkashif7295 3 ปีที่แล้ว

    In decoder block in forward function why src_mask passed in transformer?

  • @toygunkarabas9853
    @toygunkarabas9853 9 หลายเดือนก่อน

    In 52:37, you write out = self.decoder(trg, enc_src, src_mask, trg_mask) but i think it should be out = self.decoder(trg, enc_src, trg_key_padding_mask, trg_mask) beacuse you don't expilictly give source to decoder, instead you give trg with paddings. This is why i think it should be trg_key_padding_mask rather than src_mask. Am I right? Thanks a lot for such a great video :)

  • @FLLCI
    @FLLCI 2 ปีที่แล้ว

    These transformers are even luckier than me. All they get is HEADS!

  • @kaustubhshete6250
    @kaustubhshete6250 3 ปีที่แล้ว

    Exactly what I need

  • @krishnachauhan2822
    @krishnachauhan2822 3 ปีที่แล้ว

    I am not understanding sir, does this input sequence is divided into a number of chunks like here you did 256/8 where 8 is the number of attention heads. I am thinking for the self-attention whole of the input embeddings need to transform into three parts namely Q, K and V. and then we need to divide this for 8 times in the case of 8 multi heads. that's why the name is the multi head. Please clear. Regards

  • @kelvink007
    @kelvink007 3 ปีที่แล้ว

    This is the best way to learn, through hands on. Great video! Also may I know which font is used in this video? I noticed that your choice of font is very clean and easy to work with!

  • @nasirop7551
    @nasirop7551 3 ปีที่แล้ว

    I love your toturials

  • @sidazhou
    @sidazhou 2 ปีที่แล้ว

    Why is the final output shape [2,7,10]?

  • @suryaprakashsahu6142
    @suryaprakashsahu6142 2 ปีที่แล้ว

    Is there a reason, why dropout is added at the input/encoder input?
    `out = self.dropout(
    (self.word_embedding(x) + self.position_embedding(positions))
    )`
    Can you please discuss on the `why`?

  • @jianweitang4790
    @jianweitang4790 4 ปีที่แล้ว +2

    i've got a question here. In order to generate a target sentence, there should be multiple time steps right?
    The first output word from Decoder will go through Decoder again to generate the secend output word.
    i cant find where you difine this in this video. Or maybe i understand it wrong.

    • @AladdinPersson
      @AladdinPersson  4 ปีที่แล้ว +2

      During training everything is done in parallel (we have the entire translated target sentence) and we utilize these target masks that I talked about in the video. This is a major difference between transformer and normal Seq2Seq, where we actually send in the entire target sentence rather than word by word. When we evaluate the model you're completely right that we need to do multiple time steps (one word at a time) but this is not the case during training. In this video we kind of just do the transformer from scratch, the question you're asking is more related to actually training & evaluating transformer models. I'll try to see if I find code for what you're asking for.
      So here is a full code example of using transformers (also have a separate video on it): github.com/AladdinPerzon/Machine-Learning-Collection/blob/master/ML/Pytorch/more_advanced/seq2seq_transformer/seq2seq_transformer.py
      When we actually evaluate the model we need to do it time step by time step and it would look like this (translate sentence function) and I believe THIS is what you're asking for: github.com/AladdinPerzon/Machine-Learning-Collection/blob/master/ML/Pytorch/more_advanced/seq2seq_transformer/utils.py

    • @jianweitang4790
      @jianweitang4790 4 ปีที่แล้ว

      @@AladdinPersson Thank you, the link explained it pretty well. Thank you a lot.

  • @marcopleines112
    @marcopleines112 2 ปีที่แล้ว

    Thanks for your educational contribution! Just one question: what are the linear layers self.values, self.keys and self.queries for? These are not used inside the forward pass.

  • @SigndogNilsen
    @SigndogNilsen 7 หลายเดือนก่อน

    I have a basic conceptual knowledge of AI, without knowing how mathematics and all other AI topics work in it.
    Also i know a Python, but not a PyTorch.
    But, I looked at several analyzes of the article "Attention is all you need" and everything is clear there for me.
    But in this video - I can't figure out which block of code refers to the block in the diagram that he displayed on the right side of the screen
    I mean that, for example, he writes the Encoder section, but in the diagram on the right - this section consists of several blocks
    And I don't understand where which code belongs to the block in the diagram
    And also the rest of the details of code are also unclear to me
    Question: Should there be any required knowledge to understand this video ?

  • @Hannah13147
    @Hannah13147 3 ปีที่แล้ว

    Thank you so much!!!

  • @alfonsocvu
    @alfonsocvu ปีที่แล้ว

    Thanks a lot for the video, this was great an its helping me a lot.

  • @1potdish271
    @1potdish271 2 ปีที่แล้ว

    Why are you not using `sin` or `cos` function for positional encoding?

  • @deepshankarjha5344
    @deepshankarjha5344 4 ปีที่แล้ว +1

    fantastic, awesome videos as ever.

  • @lencazero4712
    @lencazero4712 ปีที่แล้ว

    @Aladdin Persson. Thank you. Great lesson. Which IDE and theme you used ?

  • @NitishKumar-fl1bg
    @NitishKumar-fl1bg 2 ปีที่แล้ว

    please make a video on implementation of Video Summarization With Frame Index Vision Transformer