Pytorch Transformers from Scratch (Attention is all you need)

แชร์
ฝัง
  • เผยแพร่เมื่อ 24 ธ.ค. 2024

ความคิดเห็น • 328

  • @AladdinPersson
    @AladdinPersson  4 ปีที่แล้ว +67

    Here's the outline for the video:
    0:00 - Introduction
    0:54 - Paper Review
    11:20 - Attention Mechanism
    27:00 - TransformerBlock
    32:18 - Encoder
    38:20 - DecoderBlock
    42:00 - Decoder
    46:55 - Forming The Transformer
    52:45 - A Small Example
    54:25 - Fixing Errors
    56:44 - Ending

    • @alhasanalkhaddour434
      @alhasanalkhaddour434 4 ปีที่แล้ว

      First thanks for this amazing video, but I have one question regarding the implementation of Self Attention.
      To distribute values, keys and queries to heads you just did a reshape for the input, while the original paper suggested to do projection using trainable matrices.
      Am I right or I missed up something?

    • @feravladimirovna1044
      @feravladimirovna1044 3 ปีที่แล้ว

      @@alhasanalkhaddour434 yes i think he did the projection already using self.values, self,keys, self.queries cause these are linear layers . the real inputs comes from the parameters passed to forward function see 14.43 for more details

    • @riyajatar6859
      @riyajatar6859 3 ปีที่แล้ว

      Why did you use self. Values, self. Keys in the init method bcz they are not used at all in forward

    • @somayehseifi8269
      @somayehseifi8269 2 ปีที่แล้ว

      Sorry can you share the github link of this special code?

    • @yufancao4369
      @yufancao4369 ปีที่แล้ว

      @@riyajatar6859 Actually he fixed that at the end of the video🧐

  • @rhronsky
    @rhronsky 4 ปีที่แล้ว +281

    Attention is not all we need, this video is all we need

  • @pratikhmanas5486
    @pratikhmanas5486 4 ปีที่แล้ว +159

    Not found a tutorial so much detail oriented. Now I am completely able to understand the Transformer and Attention Mechanism.Great Work.Thank you😊

    • @AladdinPersson
      @AladdinPersson  4 ปีที่แล้ว +11

      I really appreciate you saying that, thanks a lot :)

    • @NICe-wm9xn
      @NICe-wm9xn ปีที่แล้ว +1

      ​@@AladdinPerssonHi! You missed one error in your video. In your GitHub code, you have `self.values = nn.Linear(embed_size, embed_size)`, but in your video, you used `self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)`. I couldn't reproduce your results until I noticed this discrepancy.

  • @bhargav7476
    @bhargav7476 4 ปีที่แล้ว +34

    I watched 3 Transformer videos before this one and thought I would never understand it. Love the way you explained such a complicated topic.

  • @nishantraj95
    @nishantraj95 3 ปีที่แล้ว +36

    This is undoubtedly one of the best transformer implementation videos I have seen. Thanks for posting such good content. Looking forward to seeing some more paper implementation videos.

  • @mykytahordia
    @mykytahordia ปีที่แล้ว +1

    making something sophisticated so easy and clear that's what I call magic. Aladdin, you are truly the magician.

  • @gautamvashishtha3923
    @gautamvashishtha3923 ปีที่แล้ว +4

    Great Tutorial! Thanks Aladdin

    • @jushkunjuret4386
      @jushkunjuret4386 ปีที่แล้ว

      for actually training it, what would we do?

    • @gautamvashishtha3923
      @gautamvashishtha3923 ปีที่แล้ว

      @@jushkunjuret4386 Can you specify where you're exactly getting stuck?

  • @cc-to2jn
    @cc-to2jn 3 ปีที่แล้ว +9

    you're an absolute saint. idk if i can even put it into words the amount of respect and appreciation I have for you man! thank you!

  • @sehbanomer8151
    @sehbanomer8151 4 ปีที่แล้ว +63

    In the original paper each head should have seperate weights, but in your code all heads share the same weights. here are two steps to fix it:
    1. in __init__: self.queries = nn.Linear(self.embed_size, self.embed_size, bias=False) (same for key and value weights)
    2. in forward: put "queries = self.queries(queries)" above "queries = queries.reshape(...)" (also same for keys and values)
    Great video btw

    • @AladdinPersson
      @AladdinPersson  4 ปีที่แล้ว +14

      Hey, thank you so much for bringing this to my attention ;) When reading through the paper I get the same idea that you do, namely that each head should have separate weights, and when reading blog posts like "The Annotated Transformer" he has done exactly what you describe. From the blog post www.peterbloem.nl/blog/transformers he explains narrow vs wide self attention and in his Github implementation he does similarly as I do, however I noticed now that an issue has been raised regarding the same issue you bring up: github.com/pbloem/former/issues/13.
      And I agree with the point brought up there also, if each head is using same weights it doesn't feel like you can say they are different. I'm having difficulty finding other implementations, but I will keep a close look at this and if I get some more time I will try to spend more time and investigate this. I'm also a bit surprised that when training on this implementation it provides good results if I remember correctly with only 3x32x32 vs 3x256x256 parameters.

    • @sehbanomer8151
      @sehbanomer8151 4 ปีที่แล้ว +5

      @@AladdinPersson Yes both methods should work just fine, but I believe using seperate weights for each head would give better performance, without slowing down the model. it would use more memory of course, but it's almost nothing compared to number of parameters in the feedforward sublayers.

    • @66dp97
      @66dp97 2 ปีที่แล้ว +2

      @@sehbanomer8151 I think your inplementation may still have some issues. Since each head shoud have seperate weights, shouldn't there be eight(number of heads) different head_dim*head_dim linear layers instead of one embed_size*embed_size linear layer. Additionally, these two implementations have different number of parameters.

    • @sehbanomer8151
      @sehbanomer8151 2 ปีที่แล้ว +6

      @@66dp97 the key, query & value projection of each head will project an _embed_dim_ dimentional vector to _head_dim_ dimentional space, so for each attention head, the projection matrix will have shape (head_dim, embed_dim). Fusing _n_heads_ seperate linear layers into a single (embed_dim, head_dim * n_heads) linear layer is more GPU friendly, thus faster.

  • @niranjansitapure4740
    @niranjansitapure4740 ปีที่แล้ว +4

    I have been struggling to implement and understand custom transformer code from various sources. This was perhaps one of the best tutorials.

    • @Priyanshuc2425
      @Priyanshuc2425 5 หลายเดือนก่อน

      hey if you are still woring in NLP or transformers i may need your help please reply

  • @chefmemesupreme
    @chefmemesupreme ปีที่แล้ว +4

    This is cool. It would be helpful to have a section highlighting what parts of the dimensions should be changed if you are using a dataset of a different size or you want to change the input length. ie: keeping the architecture constant but noting how it could be used flexibly

  • @FawzyBasily
    @FawzyBasily ปีที่แล้ว +2

    Many thanks to you for this impressive tutorial, amazing job and outstanding explanation, and also thanks for sharing all these resources in the description.

  • @TheClaxterix
    @TheClaxterix 2 ปีที่แล้ว

    This is one of the best explanation videos about a paper to code I've watched in a loong time! Congratz Aladdin dude!

  • @bingochipspass08
    @bingochipspass08 ปีที่แล้ว

    Ya,.. agreed,.. this was an extremely difficult architecture to implement,. with .a LOT of moving parts,.. but this has to be the best walkthrough out there,.. sure, there are certain things like the src_mask unsqueeze that were a little tricky to visualize,.. but even barring that, you broke it down quite well! Thank you for this!. I'm so glad that we have all of this implemented in HF/PT hahah

  • @SahilKhose
    @SahilKhose 4 ปีที่แล้ว +2

    Hey Aladdin,
    Really amazing videos brother!
    This was the first video of yours that I stumbled upon and I fell in love with your channel.

    • @AladdinPersson
      @AladdinPersson  4 ปีที่แล้ว +1

      Hey Sahil, I definitely need a refresher and go through transformers again, so I'm not sure if I will be able to give you the best answer right now. So from what I recall the most important part of the masking with regards to padding is that we make sure these are not backpropagated through. We want the network weights and embeddings etc not to learn to be associated with the padded values, and that's what we are trying to accomplish with setting it to -infinity since gradient of softmax will then be 0.

    • @SahilKhose
      @SahilKhose 4 ปีที่แล้ว

      @@AladdinPersson Yeah I get the reason why we do it and the -inf setting. I had doubts with the padding that we use, I feel we need more padding to take care of the cases where both sentences are padded and then we have attention over them. I feel I have made it pretty clear in the comment above.

    • @RicardoMlu-tw2ig
      @RicardoMlu-tw2ig 3 หลายเดือนก่อน

      @@SahilKhose it doesnt matter the padding parts got number in final output , beacuse all the paprameter used to caculate it won't have gradient from it becuse deravate of sofmax is 0 for them

  • @张晨雨-q8j
    @张晨雨-q8j 2 ปีที่แล้ว +1

    great explanation, much more helpful than the theoretical only explanations

  • @YL-nx3yk
    @YL-nx3yk 3 ปีที่แล้ว +1

    The best video I watched on youtube! Why I found you so late!!!

  • @MozhganRahmatinia
    @MozhganRahmatinia ปีที่แล้ว

    It is the best description for transformer implementation.
    thank you so much.
    best regards.

  • @teetanrobotics5363
    @teetanrobotics5363 3 ปีที่แล้ว

    World's best TH-cam channel evaaaa.

  • @쉔송이
    @쉔송이 4 ปีที่แล้ว

    Love your work!! I was very confused when dealing with other tutorials... but your work made me clear about Transformer. I wish only I know you and your work.

    • @AladdinPersson
      @AladdinPersson  4 ปีที่แล้ว +1

      I appreciate the kind words 🙏

  • @qiguosun129
    @qiguosun129 3 ปีที่แล้ว

    First of all, thank you for the video. The most valuable thing I learned from it is how to create a so complex model step by step from the flow chart. Next, I will find out weither this self-attention model can be used in environmental pollution problems.

  • @marksaroufim
    @marksaroufim 2 ปีที่แล้ว +1

    wow, the best transformer tutorial I've seen

  • @soorkie
    @soorkie 4 ปีที่แล้ว

    I found this very helpful. I always used to get confused regarding the tensor sizes. Now it's all clear. Thank you very much. Also this is the first time I came across einsum. Thanks again for that too.

    • @AladdinPersson
      @AladdinPersson  4 ปีที่แล้ว +1

      Appreciate the kind words 🙏

  • @shahnawazrshaikh9108
    @shahnawazrshaikh9108 3 ปีที่แล้ว

    One of the best resource on the internet!

  • @xiangzhang7723
    @xiangzhang7723 3 ปีที่แล้ว

    Hi, I really like your channel. I have been learning from your tutorials for a while. Best wishes!

  • @flamingflamingo4021
    @flamingflamingo4021 4 ปีที่แล้ว +1

    It's an extremely useful video for researches trying to implement paper codes. Do make a series implementing other Machine Learning codes described in other papers as well.
    Please make a video to use this model on an actual NLP task such as translation, etc.

    • @AladdinPersson
      @AladdinPersson  4 ปีที่แล้ว +1

      Thank you for saying that I really appreciate it. I have made one other video on transformers for machine translation, and I will do my best to continue making videos and to cover more advanced topics! :)

    • @flamingflamingo4021
      @flamingflamingo4021 4 ปีที่แล้ว

      @@AladdinPersson I can't seem to find it. Can you please paste the link here, please? I'd truly appreciate it. :)

    • @AladdinPersson
      @AladdinPersson  4 ปีที่แล้ว

      @@flamingflamingo4021 Yeah for sure: th-cam.com/video/M6adRGJe5cQ/w-d-xo.html
      It's the last video of an inofficial serie of building Seq2Seq models for the task of machine translation. First video was normal seq2seq, second video was seq2seq+attention and the last video that I linked above is using transformers. These videos were inspired a lot by Bentrevett on Github and I recommend you check him out also if you're interested in NLP :)

  • @rayzhang2589
    @rayzhang2589 3 ปีที่แล้ว

    Really thank you!!! This really helps me deeply understand Transformer!!!

  • @F30-Jet
    @F30-Jet ปีที่แล้ว +1

    This video is all I needed

  • @ScriptureFirst
    @ScriptureFirst 3 ปีที่แล้ว +1

    45:59: WHOA! slow that down! Pause a sec, be emphatic if we're going to change something back up there

  • @alikhodabakhsh2653
    @alikhodabakhsh2653 ปีที่แล้ว

    excellent video and thank u for sharing this. I have one point about implementation, in "SelfAttention" class for query, value and key matrices (linear layer) you used (head_dim, head_dim) dimension. so these matrices will be shared in all heads. I think it's better to use (embed_dim, embed_dim) matrix to map input to q, k, v vectors and reshape it to have head dimension.

  • @srinivasvinnakota1747
    @srinivasvinnakota1747 2 ปีที่แล้ว

    Dude, you rock! I bow to your expertise 🙏😊

  • @danielmaier6665
    @danielmaier6665 3 ปีที่แล้ว

    Very good tutorial!
    Just one thing though: this is not how multihead attention is implemented in the original attention is all you need paper. In the paper the input is not split into h smaller vectors, but linearly transformed h times. So their wouldn't be reshape and then linear(head_dim, head_dim) but rather linear(embed_size, head_dim) in each head.
    Also you can have more heads than heads*head_dim = embed_size. This is because in the paper you would transform your concatenated head-outputs again with a jointly trained matrix WO (concatenation size x embed_size)

  • @nicknguyen690
    @nicknguyen690 4 ปีที่แล้ว +2

    for the dropout in your codes, for example DecoderBlock forward, I think it should be:
    query = self.norm(x + self.dropout(attention))
    instead:
    query = self.dropout(self.norm(attention + x))
    Here is the paper quote:
    "We apply dropout [33] to the output of each sub-layer, before it is added to the sub-layer input and normalized."

    • @nicknguyen690
      @nicknguyen690 4 ปีที่แล้ว +1

      Thanks so much for the great work!

    • @AladdinPersson
      @AladdinPersson  4 ปีที่แล้ว +3

      I think you're right, I'll look into this some more soon and update the Github code :)

    • @上上上上上昇気流_改
      @上上上上上昇気流_改 3 ปีที่แล้ว

      Thank you very much!

  • @geriskenderi
    @geriskenderi 4 ปีที่แล้ว

    Compliments for the video, really gives better insight into a complex architecture. Thanks for sharing all this information.

  • @KinjalMondal-yj9fx
    @KinjalMondal-yj9fx 5 หลายเดือนก่อน

    Great Explanation: Loved it

  • @thatipelli1
    @thatipelli1 3 ปีที่แล้ว +1

    This is the best tutorial on Transformers online. I was able to understand the nuts and bolts of it. Kudos to you!! It will be great if you can cover Graph Convolutional Networks from scratch

  • @anas.2k866
    @anas.2k866 ปีที่แล้ว

    In the Jay Alammar blog there is no split of the embeddings in order to compute attention for each head.

  • @zhengyuancui7837
    @zhengyuancui7837 ปีที่แล้ว

    Great work! Really helped me. Thanks.

  • @parthasarathyk5476
    @parthasarathyk5476 3 ปีที่แล้ว

    Superb...Hats off. Thank you for explanation.

  • @TomatoBananaPotato
    @TomatoBananaPotato 4 ปีที่แล้ว

    Gonna try this for my uni assignment! Thank you

  • @olegborisov5091
    @olegborisov5091 4 ปีที่แล้ว +2

    24:41 and other einsum parts... why do we want these particular shapes? dont rush, please take some time to explain all important details in the video :)

  • @kolla_teja
    @kolla_teja 4 ปีที่แล้ว

    excellent work mate cleared all my doubts

  • @joeyk2346
    @joeyk2346 3 ปีที่แล้ว

    Great Job!! Thanks for the video!

  • @amankushwaha8927
    @amankushwaha8927 2 ปีที่แล้ว

    Thanks Aladdin. The video helped a lot.

  • @RicardoMlu-tw2ig
    @RicardoMlu-tw2ig 3 หลายเดือนก่อน

    Also love your voice btw, feels calming 👍

  • @fuat7775
    @fuat7775 4 ปีที่แล้ว

    Very detailed and clear! Thank you very much!

    • @AladdinPersson
      @AladdinPersson  4 ปีที่แล้ว

      Thanks a lot for the kind words🙏

  • @slov1ker583
    @slov1ker583 23 วันที่ผ่านมา

    10:06, a little confused here. Is the embedding space cut for every token to create equal parts OR is the original input multiplied with weights such that they decrease the embedding space for every word?

  • @yashsvidixit7169
    @yashsvidixit7169 11 หลายเดือนก่อน

    Happiness is all you need, neither attention, nor transformer.

  • @garrettosborne4364
    @garrettosborne4364 3 ปีที่แล้ว

    Great video, advanced my understanding.

  • @MrWilliamducfer
    @MrWilliamducfer 4 ปีที่แล้ว

    Very nice! Congratulations!!

  • @robertpaulson2052
    @robertpaulson2052 ปีที่แล้ว

    You'll be happy to know chat-gpt recommended this video and gave a link to it when I asked for resources explaining transformers

  • @alfonsocvu
    @alfonsocvu ปีที่แล้ว

    Thanks a lot for the video, this was great an its helping me a lot.

  • @matteofabro4486
    @matteofabro4486 4 ปีที่แล้ว +1

    Thank you very much for the info!

  • @kaustubhkulkarni
    @kaustubhkulkarni 3 ปีที่แล้ว +2

    Amazing tutorial! The concept of Transformers became much clearer after following this, although there is one thing I had doubt in. In the part where you implement positional embeddings, there is no implementation of sin and cos functions as given in the paper. Why is that?
    Thank you again for the tutorial!

    • @junweilu4990
      @junweilu4990 3 ปีที่แล้ว

      Having the same question too. Did he use "nn.Embedding" to replace the original method ?

    • @kaustubhkulkarni
      @kaustubhkulkarni 3 ปีที่แล้ว

      @@junweilu4990 In the code, he did use nn.Embedding to replace the original method. But I do not know the logic behind this approach as this does not "truly" give positional features, does it?

    • @sahhaf1234
      @sahhaf1234 2 ปีที่แล้ว

      @@junweilu4990 I also expected him to use word2vec or some pre-fixed embedding algorithm. I think by using nn.Embedding() he tries to learn word vectors and positional vectors, while training his attention algoritm, encoders and decoders. Very wasteful, I believe. I think the rest of his algorithm will converge only after his word and positional vectors converges, if they ever...

  • @ShamailMulla-x5n
    @ShamailMulla-x5n 5 หลายเดือนก่อน

    This was really helpful! Can you also do a tutorial for Decision Transformer for reinforcement learning?

  • @deepshankarjha5344
    @deepshankarjha5344 4 ปีที่แล้ว +1

    fantastic, awesome videos as ever.

  • @roman-bushuiev
    @roman-bushuiev 2 ปีที่แล้ว +1

    At 17:57 you forgot to apply linear projections before reshaping.

  • @ocean6709
    @ocean6709 ปีที่แล้ว

    Hi, at 18:45, energy = queries* keys. Are you doing outer product?

  • @anujsingh5961
    @anujsingh5961 2 ปีที่แล้ว +1

    Hi Thanks for the superb video.. I have one doubt regarding selfAttention block:
    self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
    self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
    self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)
    " Don't you think that V,K,Q should be computed on the whole embedding vectors and later we should divide based on heads.?

  • @panchajanya91
    @panchajanya91 2 ปีที่แล้ว +2

    It is a great video. I have some confusions understanding it fully.
    In SelfAttention class you created Linear layers self.values self.keys and self.queries but you didn't use those layers.

  • @allessandroable
    @allessandroable 4 ปีที่แล้ว +3

    Thank you! great explanation, I just wonder why in the attention mechanism you have to inizialize self.queries, self.keys ecc as Linear layers

    • @ayyythatguy
      @ayyythatguy 3 ปีที่แล้ว

      From paper, the attention mechanism is fully connected, which means you should use linear layers.

  • @obiohagwu788
    @obiohagwu788 2 ปีที่แล้ว

    Dude! you're amazing!

  • @davidray6126
    @davidray6126 ปีที่แล้ว

    Thx for this amazing tutorial. I think the "energy" (Q * K_transpose) should be divided by the square root of head_dim instead of embedding_size.

  • @RicardoMlu-tw2ig
    @RicardoMlu-tw2ig 3 หลายเดือนก่อน

    Thanks so much!🎉

  • @coder-monk
    @coder-monk 2 ปีที่แล้ว +2

    in SelfAttention, you have not used the linears self.keys, self.values, self.queries in forward method, whats the use of those layers?

  • @NitishKumar-fl1bg
    @NitishKumar-fl1bg 2 ปีที่แล้ว

    please make a video on implementation of Video Summarization With Frame Index Vision Transformer

  • @feravladimirovna1044
    @feravladimirovna1044 4 ปีที่แล้ว +1

    the last question please what is intuitive meaning for the source and target inputs of transformer why model takes x, trg[:, :-1]
    model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to(device)
    out = model(x, trg[:, :-1])
    what we could get from out?
    I tried
    model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to(device)
    out = model(x, trg)
    and got
    torch.Size([2, 8, 10])
    to be honest I could not interpret that :(

    • @SahilKhose
      @SahilKhose 4 ปีที่แล้ว

      Okay so I understand your question correctly:
      Your doubt is why are we using trg[:, :-1] instead of trg
      First:
      trg[:, :-1] this means all the batches(sentences) and entire sentences except the last word in all the sentences.
      Second:
      We do this because of how the transformer model is developed to train. Unlike RNNs our transformer model does not predict the entire output sentence, instead it predicts one word at a time. So the decoder takes in (t-1) time step's output of the transformer and then predicts the t time step output word. Hence we provide the entire sentence but the last word so as to predict the last word.
      Refer to the beautiful video by Yannic Kelcher:
      th-cam.com/video/iDulhoQ2pro/w-d-xo.html
      Hope your doubt is solved. Let me know if it's still unclear.

  • @learner3539
    @learner3539 3 ปีที่แล้ว

    Great! ❤ Thanks for this master piece. Hmm, I follow you along and I have not any error when I run, since I already noted your error in code and update it 😊. Waiting for your new Video. This is the first video I go along with you. Subscribed! Bell Notification on.

  • @polimetakrylanmetylu2483
    @polimetakrylanmetylu2483 3 ปีที่แล้ว +2

    Hey, i'm a bit confused, in the self attention block (around 15:20) you define value, query and key heads, but they are unused in forward method. If I understand correctly, they should be used after each sequence's reshape?

  • @sargamgupta7134
    @sargamgupta7134 ปีที่แล้ว

    Please make a video on using the Temporal Fusion Transformers.

  • @toygunkarabas9853
    @toygunkarabas9853 ปีที่แล้ว

    In 52:37, you write out = self.decoder(trg, enc_src, src_mask, trg_mask) but i think it should be out = self.decoder(trg, enc_src, trg_key_padding_mask, trg_mask) beacuse you don't expilictly give source to decoder, instead you give trg with paddings. This is why i think it should be trg_key_padding_mask rather than src_mask. Am I right? Thanks a lot for such a great video :)

  • @czarhada
    @czarhada 4 ปีที่แล้ว +3

    Excellent! Thank you so much for this! Had a small request, can you please come up with videos on BERT and controlled text generation models like PPLM? Thanks again!

    • @AladdinPersson
      @AladdinPersson  4 ปีที่แล้ว +1

      Thank you for the comment! I will look into it, got a few videos that I'm planning but will come back to this in the future for sure :)

  • @lencazero4712
    @lencazero4712 ปีที่แล้ว

    @Aladdin Persson. Thank you. Great lesson. Which IDE and theme you used ?

  • @yashrathi6862
    @yashrathi6862 2 ปีที่แล้ว +1

    Thank you the video was very helpful. In the end we got output of dim (2,7,10). So why did we got the probabilities of the next 7 words? And why is the output len dependent on the number of words we feed to the decoder?

  • @kyde8392
    @kyde8392 4 ปีที่แล้ว +8

    In your implementation you split the keys, queries and values into heads before passing them through the linear layers, in other implementations however, the keys, queries and values are split after being passed through the linear layers. The paper follows your implementation; according to what I can deduce from the picture of the transformer. Are these two approaches similar or is one better than the other?

    • @AladdinPersson
      @AladdinPersson  4 ปีที่แล้ว +9

      That's a great observation and one that I've wondered about and confused over myself, from my understanding the way I did it is not entirely accurate and we should split after the linear layers. Otherwise it doesn't make intuitive sense to me that we have "different heads" if they all share parameters. With that said, the way I did it works and uses a lot fewer parameters but I would use the way you've seen others do it (I haven't done extensive tests to see how much better it performs, let me know if you ever test this).

  • @nasirop7551
    @nasirop7551 3 ปีที่แล้ว

    I love your toturials

  • @jianweitang4790
    @jianweitang4790 4 ปีที่แล้ว +2

    i've got a question here. In order to generate a target sentence, there should be multiple time steps right?
    The first output word from Decoder will go through Decoder again to generate the secend output word.
    i cant find where you difine this in this video. Or maybe i understand it wrong.

    • @AladdinPersson
      @AladdinPersson  4 ปีที่แล้ว +2

      During training everything is done in parallel (we have the entire translated target sentence) and we utilize these target masks that I talked about in the video. This is a major difference between transformer and normal Seq2Seq, where we actually send in the entire target sentence rather than word by word. When we evaluate the model you're completely right that we need to do multiple time steps (one word at a time) but this is not the case during training. In this video we kind of just do the transformer from scratch, the question you're asking is more related to actually training & evaluating transformer models. I'll try to see if I find code for what you're asking for.
      So here is a full code example of using transformers (also have a separate video on it): github.com/AladdinPerzon/Machine-Learning-Collection/blob/master/ML/Pytorch/more_advanced/seq2seq_transformer/seq2seq_transformer.py
      When we actually evaluate the model we need to do it time step by time step and it would look like this (translate sentence function) and I believe THIS is what you're asking for: github.com/AladdinPerzon/Machine-Learning-Collection/blob/master/ML/Pytorch/more_advanced/seq2seq_transformer/utils.py

    • @jianweitang4790
      @jianweitang4790 4 ปีที่แล้ว

      @@AladdinPersson Thank you, the link explained it pretty well. Thank you a lot.

  • @siennypoole4366
    @siennypoole4366 2 ปีที่แล้ว +1

    Hi everyone! I finished following the this tutorial to the end... But now I am confused on how to "train" and "test/predict" this model? Any help is appreciated! Thanks!

  • @xanyula2738
    @xanyula2738 3 ปีที่แล้ว +1

    I can't seem to understand the necessity for self.keys, self.queries and self.values in the SelfAttention class. Am I missing something?

  • @PrzemyslawDolata
    @PrzemyslawDolata 4 หลายเดือนก่อน

    14:05 is this even correct though? Shouldn't you project Q, K and V from the `embed_size` to `embed_size` and split the resulting tensors? Your implementation seems to treat parts of the embedding dimension separately (instead of jointly attending to the entire embedding space) and using the same weights (instead of having separate weights for each head).

    • @AladdinPersson
      @AladdinPersson  4 หลายเดือนก่อน

      hey, it was a while since I looked at this but I believe I might have made some mistake in this video as I remember going through the implementation a while after and modifying this line in the implementation if I recall correctly. Check the latest one on GitHub

  • @anirband
    @anirband 4 ปีที่แล้ว +1

    Very nice video. I have a question.
    The positional encoding you used is different from the one in the paper where they use sin/cos function of word position and vector index. It seems in your code, these positional embedding will be trained unlike the one in the paper. Do you have the code for how positional encoding is done in the paper?

    • @AladdinPersson
      @AladdinPersson  4 ปีที่แล้ว +1

      Yes you're right about that, if I recall I did mention it in the video but I could have missed that. There have been other questions about this as well so I might try to implement positional encoding also but as of right now I have not

  • @himanikumar7979
    @himanikumar7979 4 หลายเดือนก่อน

    Shouldn't it be query_len instead of key_len 17:38 ?

  • @hsoley
    @hsoley 2 ปีที่แล้ว +1

    Aladdin, thank you so much for the video!
    one question regarding the Example, why do you expect the output shape to be [2, 7, 10]?

    • @fib4983
      @fib4983 2 ปีที่แล้ว

      2 Examples, 7 Tokens, 10 Probabilities
      I dont understand myself how I would go about this

    • @fib4983
      @fib4983 2 ปีที่แล้ว

      Oh wait! This adds up for me.
      They output all 7 words at the same time, but the target mask REMOVES the information as if it wasn't generated yet
      When you apply a transformer live, you would have to iterate through it

  • @algorithmo134
    @algorithmo134 3 หลายเดือนก่อน

    At 54:02, why do we need to pass the truncated trg[:, :-1] ? I thought we already applied the mask to prevent the model from looking into the future?

  • @allurbase
    @allurbase ปีที่แล้ว

    Why do you apply attention twice? Inside the transformer block and inside the decoder block?

  • @TheUltimateBaccaratApp
    @TheUltimateBaccaratApp ปีที่แล้ว

    Fantastic video, question, could you revisit this and add how to early stop and hypertune a "from scratch" transformer model?

  • @sidneybassett2683
    @sidneybassett2683 2 ปีที่แล้ว +1

    I have viewed your code in github, and I',m wondering why you use the same weight matrix for different heads, isn't will this make the Q,V,K vector for different heads to be the same?

    • @zhuchencao2527
      @zhuchencao2527 2 ปีที่แล้ว

      Damn, I'm suffering from the same question.
      I am confused as to why, here, the kqv projections for the different heads seem to be shared. It seems like we should use nn.Linear(embed_dim, embed_dim), and later divide it into different heads?

  • @chrisogonas
    @chrisogonas ปีที่แล้ว

    Superb!

  • @michaeldurand9309
    @michaeldurand9309 ปีที่แล้ว

    Thank you for your video. You did a great job! I was wondering how to train a transformer if the input form is (batch_size, sequence_length, number_of_features). Let's say number_of_features = 2 (it could be X and Y coordinates in time, for example). What impact does this type of input have on positional encoding, the masking strategy and the attention mechanism?

  • @matejmnoucek2865
    @matejmnoucek2865 2 ปีที่แล้ว

    Why do you set bias=False for nn.Linear of keys, values and queries?

  • @adamtran5747
    @adamtran5747 2 ปีที่แล้ว

    Love this content.

  • @marcopleines112
    @marcopleines112 2 ปีที่แล้ว

    Thanks for your educational contribution! Just one question: what are the linear layers self.values, self.keys and self.queries for? These are not used inside the forward pass.

  • @abdulrahmanadel8917
    @abdulrahmanadel8917 3 ปีที่แล้ว +1

    if I'm using transformer for a speech recognition task (speech-to-text). after training the model, in prediction what should I place on the target parameter if I have only audio file (not transcibed)?

    • @popamaji
      @popamaji 3 ปีที่แล้ว

      did you get ur answer?

  • @N3xUss99
    @N3xUss99 ปีที่แล้ว

    i don't know if there are still people who are watching this but i have a question at code level, in the decoder method "forward" when i pass the parameters to layer i had as the fifth one "target_mask" but in the decoder block you decided to put the parameter "device". Did i miss something, is it just an error or there is an other explanation? Thanks a lot

  • @fq6475
    @fq6475 4 ปีที่แล้ว +1

    Thanks so much!

  • @krishnachauhan2822
    @krishnachauhan2822 3 ปีที่แล้ว

    I am not understanding sir, does this input sequence is divided into a number of chunks like here you did 256/8 where 8 is the number of attention heads. I am thinking for the self-attention whole of the input embeddings need to transform into three parts namely Q, K and V. and then we need to divide this for 8 times in the case of 8 multi heads. that's why the name is the multi head. Please clear. Regards

  • @saadatkhan2791
    @saadatkhan2791 3 ปีที่แล้ว +1

    Hello, This is a great explanation of transformers. I have a question. How did you know that query.shape[0] would give you the number of training examples? Why is it later used in reshaping the keys, query, and values?

    • @takihasan8310
      @takihasan8310 ปีที่แล้ว +1

      The first dimension is always the batch size in tensor operations. As any model is trained on batches, and the batch size is the number of samples

  • @MrTennis666666
    @MrTennis666666 3 ปีที่แล้ว

    why is the input embedding input is 256 dimensions? at 10:02

  • @onguyenthanh1137
    @onguyenthanh1137 2 ปีที่แล้ว +1

    Hi Aladdin. I have just finished watching "TransformerBlock" part but I've noticed something. In the first LayerNorm, I wonder why it would be x = self.norm1(attention + queries)? As far as I understand, it should be x = self.norm1(attention + input) where queries = Wq*input instead : ) (btw so far you haven't used the notation of input as input sequence vectors but I hope what I'm saying is clear enough tho..)

    • @fib4983
      @fib4983 2 ปีที่แล้ว

      How he has it is correct: `queries = Wq*input` only gets calculated at the Attention Module, so queries is still just normal input (values makes more sense, thats how they describe it in the paper)

  • @Hannah13147
    @Hannah13147 3 ปีที่แล้ว

    Thank you so much!!!