Attention for Neural Networks, Clearly Explained!!!

แชร์
ฝัง
  • เผยแพร่เมื่อ 18 มิ.ย. 2024
  • Attention is one of the most important concepts behind Transformers and Large Language Models, like ChatGPT. However, it's not that complicated. In this StatQuest, we add Attention to a basic Sequence-to-Sequence (Seq2Seq or Encoder-Decoder) model and walk through how it works and is calculated, one step at a time. BAM!!!
    NOTE: This StatQuest is based on two manuscripts. 1) The manuscript that originally introduced Attention to Encoder-Decoder Models: Neural Machine Translation by Jointly Learning to Align and Translate: arxiv.org/abs/1409.0473 and 2) The manuscript that first used the Dot-Product similarity for Attention in a similar context: Effective Approaches to Attention-based Neural Machine Translation arxiv.org/abs/1508.04025
    NOTE: This StatQuest assumes that you are already familiar with basic Encoder-Decoder neural networks. If not, check out the 'Quest: • Sequence-to-Sequence (...
    If you'd like to support StatQuest, please consider...
    Patreon: / statquest
    ...or...
    TH-cam Membership: / @statquest
    ...buying my book, a study guide, a t-shirt or hoodie, or a song from the StatQuest store...
    statquest.org/statquest-store/
    ...or just donating to StatQuest!
    www.paypal.me/statquest
    Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
    / joshuastarmer
    0:00 Awesome song and introduction
    3:14 The Main Idea of Attention
    5:34 A worked out example of Attention
    10:18 The Dot Product Similarity
    11:52 Using similarity scores to calculate Attention values
    13:27 Using Attention values to predict an output word
    14:22 Summary of Attention
    #StatQuest #neuralnetwork #attention

ความคิดเห็น • 392

  • @statquest
    @statquest  ปีที่แล้ว +8

    To learn more about Lightning: lightning.ai/
    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

  • @koofumkim4571
    @koofumkim4571 11 หลายเดือนก่อน +56

    “Statquest is all you need” - I really needed this video for my NLP course but glad it’s out now. I got an A+ for the course, your precious videos helped a lot!

    • @statquest
      @statquest  11 หลายเดือนก่อน +5

      BAM! :)

  • @atharva1509
    @atharva1509 ปีที่แล้ว +122

    Somehow Josh always figures out what video are we going to need!

    • @yashgb
      @yashgb ปีที่แล้ว +1

      Exactly, I was gonna say the same 😃

    • @statquest
      @statquest  ปีที่แล้ว +14

      BAM! :)

    • @yesmanic
      @yesmanic ปีที่แล้ว +2

      Same here 😂

  • @MelUgaddan
    @MelUgaddan 9 หลายเดือนก่อน +6

    The level of explainability from this video is top-notch. I always watch your video first to grasp the concept then do the implementation on my own. Thank you so much for this work !

    • @statquest
      @statquest  9 หลายเดือนก่อน +1

      Glad it was helpful!

  • @Travel-Invest-Repeat
    @Travel-Invest-Repeat 11 หลายเดือนก่อน +8

    Great work, Josh! Listening to my deep learning lectures and reading papers become way easier after watching your videoes, because you explain the big picture and the context so well!! Eagerly waiting for the transformers video!

    • @statquest
      @statquest  11 หลายเดือนก่อน +2

      Coming soon! :)

  • @clockent
    @clockent ปีที่แล้ว +19

    This is awesome mate, can't wait for the next installment! Your tutorials are indispensable!

  • @rutvikjere6392
    @rutvikjere6392 ปีที่แล้ว +9

    I was literally trying to understand attention a couple of days ago and Mr.BAM posts a video about it. Thanks 😊

  • @Murattheoz
    @Murattheoz 9 หลายเดือนก่อน +11

    I feel like I am watching a cartoon as a kid. :)

    • @statquest
      @statquest  9 หลายเดือนก่อน +1

      bam!

  • @dylancam812
    @dylancam812 ปีที่แล้ว +21

    Dang this came out just 2 days after my neural networks final. I’m still so happy to see this video in feed. You do such great work Josh! Please keep it up for all the computer scientists and statisticians that love your videos and eagerly await each new post

    • @statquest
      @statquest  ปีที่แล้ว +1

      Thank you very much! :)

    • @Neiltxu
      @Neiltxu ปีที่แล้ว +1

      @@statquest it came out 3 days before my Deep Learning and NNs final. BAM!!!

    • @statquest
      @statquest  ปีที่แล้ว

      @@Neiltxu Awesome! I hope it helped!

    • @Neiltxu
      @Neiltxu ปีที่แล้ว

      @@statquest for sure! Your videos always help! btw, do you ship to spain? I like the hoodies of your shop

    • @statquest
      @statquest  ปีที่แล้ว

      @@Neiltxu I believe the hoodies ship to Spain. Thank you for supporting StatQuest! :)

  • @aayush1204
    @aayush1204 9 หลายเดือนก่อน +2

    1 million subscribers INCOMING!!!
    Also huge thanks to Josh for providing such insightful videos. These videos really make everything easy to understand, I was trying to understand Attention and BAM!! found this gem.

    • @statquest
      @statquest  9 หลายเดือนก่อน +1

      Thank you very much!!! BAM! :)

  • @SharingFists
    @SharingFists 11 หลายเดือนก่อน +4

    This channel is pure gold. I'm a machine learning and deep learning student.

    • @statquest
      @statquest  11 หลายเดือนก่อน

      Thanks!

  • @aquater1120
    @aquater1120 ปีที่แล้ว +3

    I was just reading the original attention paper and then BAM! You uploaded the video. Thank you for creating the best content on AI on TH-cam!

    • @statquest
      @statquest  ปีที่แล้ว

      Thank you very much! :)

  • @ArpitAnand-yd7tr
    @ArpitAnand-yd7tr 11 หลายเดือนก่อน +2

    The best explanation of Attention that I have come across so far ...
    Thanks a bunch❤

    • @statquest
      @statquest  11 หลายเดือนก่อน

      Thank you very much! :)

  • @sinamon6296
    @sinamon6296 5 หลายเดือนก่อน +3

    Hi mr josh, just wanna say that there is literally no one that makes it so easy for me to understand such complicated concepts. Thank you ! once I get a job I will make sure to give you guru dakshina! (meaning, an offering from students to their teachers)

    • @statquest
      @statquest  5 หลายเดือนก่อน

      Thank you very much! I'm glad my videos are helpful! :)

  • @linhdinh136
    @linhdinh136 ปีที่แล้ว +6

    Thanks for the wholesome contents! Looking for Statquest video on the Transformer.

    • @statquest
      @statquest  ปีที่แล้ว +1

      Wow!!! Thank you so much for supporting StatQuest!!! I'm hoping the StatQuest on Transformers will be out by the end of the month.

    • @linhdinh136
      @linhdinh136 ปีที่แล้ว +1

  • @d_b_
    @d_b_ ปีที่แล้ว +1

    Thanks for this. The way you step through the logic is always very helpful

  • @brunocotrim2415
    @brunocotrim2415 หลายเดือนก่อน +1

    Hello Statquest, I would like to say Thank You for the amazing job, this content helped me understand a lot how Attention works, specially because visual things help me understand better, and the way you join the visual explanation with the verbal one while keeping it interesting is on another level, Amazing work

    • @statquest
      @statquest  หลายเดือนก่อน

      Thank you!

  • @KevinKansas1
    @KevinKansas1 ปีที่แล้ว +6

    The way you explain complex subjects in a easy-to-understand format is amazing! Do you have an idea when will you release a video about transformers? Thank you Josh!

    • @statquest
      @statquest  ปีที่แล้ว +6

      I'm shooting for the end of the month.

    • @JeremyHalfon
      @JeremyHalfon 11 หลายเดือนก่อน

      Hi Josh@@statquest , any update on the following? Would definitely need it for my final tomorrow :))

    • @statquest
      @statquest  11 หลายเดือนก่อน +1

      @@JeremyHalfon I'm finishing my first draft today. Hope to edit it this weekend and record next week.

  • @ncjanardhan
    @ncjanardhan 2 หลายเดือนก่อน +2

    The BEST explanation of Attention models!! Kudos & Thanks 😊

    • @statquest
      @statquest  2 หลายเดือนก่อน

      Thank you very much!

  • @saschahomeier3973
    @saschahomeier3973 11 หลายเดือนก่อน +1

    You have a talent for explaining these things in a straightforward way. Love your videos. You have no video about Transformers yet, right?

    • @statquest
      @statquest  11 หลายเดือนก่อน +1

      The transformers video is currently available to channel members and patreon supporters.

  • @weiyingwang2533
    @weiyingwang2533 11 หลายเดือนก่อน +1

    You are amazing! The best explanation I've ever found on TH-cam.

    • @statquest
      @statquest  11 หลายเดือนก่อน

      Wow, thanks!

  • @familywu3869
    @familywu3869 ปีที่แล้ว +2

    Thank you for the excellent teaching, Josh. Looking forward to the Transformer tutorial. :)

  • @benmelis4117
    @benmelis4117 2 หลายเดือนก่อน +1

    I just wanna let you know that this series is absolutely amazing. So far, as you can see, I've made it to the 89th video, guess that's something. Now it's getting serious tho. Again, love what you're doing here man!!! Thanks!!

    • @statquest
      @statquest  2 หลายเดือนก่อน +1

      Thank you so much!

    • @benmelis4117
      @benmelis4117 หลายเดือนก่อน +1

      @@statquest Personally, since I'm a medical student, I really can't explain how valuable it is to me that you used so many medical examples in the video's. The moment you said in one of the first video's that you are a geneticist I was sold to this series, it's one of my favorite subjects at uni, crazy interesting!

    • @statquest
      @statquest  หลายเดือนก่อน +1

      @@benmelis4117 BAM! :)

  • @won20529jun
    @won20529jun ปีที่แล้ว +1

    I was literally just thinking an Id love an explanation of attention by SQ..!!! Thanks for all your work

  • @lunamita
    @lunamita 3 หลายเดือนก่อน +1

    Can’t thank enough for this guy helped me get my master degree in AI back in 2022, now I’m working as a data scientist and still kept going back to your videos.

    • @statquest
      @statquest  3 หลายเดือนก่อน

      BAM!

  • @rafaeljuniorize
    @rafaeljuniorize 2 หลายเดือนก่อน +1

    this was the most beautiful explanation that i ever had in my entire life, thank you!

    • @statquest
      @statquest  2 หลายเดือนก่อน

      Wow, thank you!

  • @mehmeterenbulut6076
    @mehmeterenbulut6076 9 หลายเดือนก่อน +2

    I was stunned when you start the video with a catch jingle man, cheers :D

    • @statquest
      @statquest  9 หลายเดือนก่อน

      :)

  • @MartinGonzalez-wn4nr
    @MartinGonzalez-wn4nr ปีที่แล้ว +4

    Hi Josh, I just bought your books, Its amazing the way that you explain complex things, read the papers after wach your videos is easier.
    NOTE: waiting for the video of transformes

    • @statquest
      @statquest  ปีที่แล้ว +2

      Glad you like them! I hope the video on Transformers is out soon.

  • @tupaiadhikari
    @tupaiadhikari 10 หลายเดือนก่อน +1

    Thanks Professor Josh for such a great tutorial ! It was very informative !

    • @statquest
      @statquest  10 หลายเดือนก่อน

      My pleasure!

  • @The-Martian73
    @The-Martian73 ปีที่แล้ว +1

    Great, that's really what I was looking for, thanks mr Starmer for the explanation ❤

  • @usser-505
    @usser-505 9 หลายเดือนก่อน +2

    The end is a classic cliffhanger for the series. You talk about how we don't need the LSTMs and I wait for an entire summer for transformers. Good job! :)

    • @statquest
      @statquest  9 หลายเดือนก่อน

      Ha! The good news is that you don't have to wait! You can binge! Here's the link to the transformers video: th-cam.com/video/zxQyTK8quyY/w-d-xo.html

    • @usser-505
      @usser-505 9 หลายเดือนก่อน +1

      @@statquestYeah! I already watched when you released it. I commented on how this deep learning playlist is becoming a series! :)

    • @statquest
      @statquest  9 หลายเดือนก่อน

      @@usser-505 bam!

  • @okay730
    @okay730 11 หลายเดือนก่อน +2

    I'm excited for the video about transformers. Thank you Josh, your videos are extremely helpful

    • @statquest
      @statquest  11 หลายเดือนก่อน +1

      Coming soon!

  • @capyk5455
    @capyk5455 11 หลายเดือนก่อน +1

    You're amazing Josh, thank you so much for all this content

    • @statquest
      @statquest  11 หลายเดือนก่อน

      Glad you enjoy it!

  • @ArpitAnand-yd7tr
    @ArpitAnand-yd7tr 11 หลายเดือนก่อน +1

    Really looking forward to your explanation of Transformers!!!

    • @statquest
      @statquest  11 หลายเดือนก่อน

      Thanks!

  • @abdullahhashmi654
    @abdullahhashmi654 ปีที่แล้ว +1

    Been wanting this video for so long, gonna watch it soon!

  • @jacobverrey4075
    @jacobverrey4075 ปีที่แล้ว +1

    Josh - I've read the original papers and countless online explanations, and this stuff never makes sense to me. You are the one and only reason as to why I understand machine learning. I wouldn't be able to make any progress on my PhD if it wasn't for your videos.

    • @statquest
      @statquest  ปีที่แล้ว

      Thanks! I'm glad my videos are helpful! :)

  • @chessplayer0106
    @chessplayer0106 ปีที่แล้ว +4

    Ah excellent this is exactly what I was looking for!

    • @statquest
      @statquest  ปีที่แล้ว +1

      Thank you!

    • @birdropping
      @birdropping ปีที่แล้ว +1

      @@statquest Can't wait for the next episode on Transformers!

  • @hasansayeed3309
    @hasansayeed3309 ปีที่แล้ว +1

    Amazing video Josh! Waiting for the transformer video. Hopefully it'll come out soon. Thanks for everything!

    • @statquest
      @statquest  ปีที่แล้ว

      Thanks! I'm working on it! :)

  • @rathinarajajeyaraj1502
    @rathinarajajeyaraj1502 ปีที่แล้ว +1

    Much awaited one .... Awesome as always ..

  • @abdullahbinkhaledshovo4969
    @abdullahbinkhaledshovo4969 11 หลายเดือนก่อน +1

    I have been waiting for this for a long time

    • @statquest
      @statquest  11 หลายเดือนก่อน

      Transformers comes out on monday...

  • @rrrprogram8667
    @rrrprogram8667 ปีที่แล้ว +1

    Excellent josh.... So finally MEGA Bammm is approaching.....
    Hope u r doing good...

    • @statquest
      @statquest  ปีที่แล้ว

      Yes! Thank you! I hope you are doing well too! :)

  • @envynoir
    @envynoir ปีที่แล้ว +1

    Godsent! Just what I needed! Thanks Josh.

  • @abrahammahanaim3859
    @abrahammahanaim3859 ปีที่แล้ว +1

    Hey Josh your explanation is easy to understand. Thanks

    • @statquest
      @statquest  ปีที่แล้ว +1

      Glad it was helpful!

  • @yizhou6877
    @yizhou6877 11 หลายเดือนก่อน +2

    I am always amazed by your tutorials! Thanks. And when we can expect the transformer tutorial to be uploaded?

    • @statquest
      @statquest  11 หลายเดือนก่อน

      Tonight!

  • @yoshidasan4780
    @yoshidasan4780 6 หลายเดือนก่อน +1

    first of all thanks a lot Josh! you made it way too understandable for us and i would be forever grateful to you for this !! Have a nice time! and can you please upload videos on Bidirectional LSTM and BERT?

    • @statquest
      @statquest  6 หลายเดือนก่อน

      I'll keep those topics in mind.

  • @AntiPolarity
    @AntiPolarity ปีที่แล้ว +2

    can't wait for the video about Transformers!

  • @souravdey1227
    @souravdey1227 ปีที่แล้ว +1

    Had been waiting for this for months.

    • @statquest
      @statquest  ปีที่แล้ว

      The wait is over! :)

  • @gordongoodwin6279
    @gordongoodwin6279 6 หลายเดือนก่อน +1

    fun fact - if your vectors are scaled/mean-centered, cosine similarity is geometrically equivalent to the pearson correlation, and the dotproduct is the same as the covariance (un-scaled correlation).

    • @statquest
      @statquest  6 หลายเดือนก่อน

      nice.

  • @user-fj2qq7cp2n
    @user-fj2qq7cp2n ปีที่แล้ว +1

    Thank you very much for your explanation! You are always super clear. Will the transformer video be out soon? I have a natural language processing exam in a week and I just NEED your explanation to go through them 😂

    • @statquest
      @statquest  ปีที่แล้ว +1

      Unfortunately I still need a few weeks to work on the transformers video... :(

  • @rajatjain7894
    @rajatjain7894 ปีที่แล้ว +1

    Was eagerly waiting for this video

  • @owlrion
    @owlrion 11 หลายเดือนก่อน +1

    Hey! Great video, this is really helping me with neural networks at the university, do we have a date for when the transformer video comes out?

    • @statquest
      @statquest  11 หลายเดือนก่อน +1

      Soon....

  • @sabaaslam781
    @sabaaslam781 ปีที่แล้ว +1

    Hi Josh! No doubt, you teach in the best way. I have a request, I have been enrolled in PhD and going to start my work on Graphs, Can you please make a video about Graph Neural Networks and its variants, Thanks.

    • @statquest
      @statquest  ปีที่แล้ว

      I'll keep that in mind.

  • @sreerajnr689
    @sreerajnr689 หลายเดือนก่อน

    Your explanation is AMAZING AS ALWAYS!!
    I have 1 doubt. Do we do the attention calculation only on the final layer? For example, if there are 2 layers in encoder and 2 layers in decoder, we use only the outputs from 2nd layer of encoder and 2nd layer of decoder for attention estimation, right?

    • @statquest
      @statquest  หลายเดือนก่อน +1

      I believe that is correct, but, to be honest, I don't think there is a hard rule.

  • @imkgb27
    @imkgb27 ปีที่แล้ว +2

    Many thanks for your great video!
    I have a question. You said that we calculate the similarity score between 'go' and EOS (11:30). But I think the vector (0.01,-0.10) is the context vector for "let's go" instead of "go" since the input includes the output for 'Let's' as well as the embedding vector for 'go'. It seems that the similarity score between 'go' and EOS is actually the similarity score between "let's go" and EOS. Please make it clear!

    • @statquest
      @statquest  ปีที่แล้ว +1

      You can talk about it either way. Yes, it is the context vector for "Let's go", but it's also the encoding, given that we have already encoded "Let's", of the word "go".

  • @rikki146
    @rikki146 ปีที่แล้ว +1

    When I see new vid from Josh, I know today is a good day! BAM!

  • @rishabhsoni
    @rishabhsoni 7 หลายเดือนก่อน

    Superb Videos. One question, is the fully connected layer just simply the softmax layer, there is no hidden layer with weights (meaning no weights are learned)?

    • @statquest
      @statquest  7 หลายเดือนก่อน

      No, there are weights along the connections between the input and output of the fully connected layer, and those outputs are then pumped into the softmax. I apologize for not illustrating the weights in this video. However, I included them in my video on transformers, and it's the same here. Here's the link to the transformers video: th-cam.com/video/zxQyTK8quyY/w-d-xo.html

  • @markus_park
    @markus_park ปีที่แล้ว +1

    Thanks! This was a great video!

    • @statquest
      @statquest  ปีที่แล้ว

      Thank you very much! :)

  • @naomilago
    @naomilago ปีที่แล้ว +1

    The music sang before the video are contagious ❤

  • @manuelcortes1835
    @manuelcortes1835 ปีที่แล้ว

    I have a question that could benefit from clarification: In the final FC layer for word predictions, it is claimed that the Attention Values and 'encodings' are used as input (13:38). By 'encodings', do we mean the short term memories from the top LSTM layer in the decoder?

    • @statquest
      @statquest  ปีที่แล้ว +2

      Yes. We use both the attention values and the LSTM outputs (short-term memories or hidden states) as inputs to the fully connected layer.

  • @luvxxb
    @luvxxb 7 หลายเดือนก่อน +1

    thank you so much for making these great materials

    • @statquest
      @statquest  7 หลายเดือนก่อน

      Thanks!

  • @akashat1836
    @akashat1836 2 หลายเดือนก่อน +1

    Hey Josh! Firstly, Thank you so much for this amazing content!! I can always count on your videos for a better explanation!
    I have one quick clarification to make. Before the fully dense layer. The first two numbers we get are from the [scaled(input1-cell1) + scaled(input2-cell1) ] and [scaled(input1-cell2) + scaled(input2-cell2) ] right?
    And the other two numbers are from the outputs of the decoder, right?

    • @statquest
      @statquest  2 หลายเดือนก่อน

      Yes.

    • @akashat1836
      @akashat1836 2 หลายเดือนก่อน +1

      @@statquest Thank you for the clarification!

  • @carloschau9310
    @carloschau9310 ปีที่แล้ว +1

    thank you sir for your brilliant work!

  • @theelysium1597
    @theelysium1597 11 หลายเดือนก่อน

    Since you asked for video suggestions in another video: A video about the EM and Mean Shift algorithm would be great!

    • @statquest
      @statquest  11 หลายเดือนก่อน +1

      I'll keep that in mind.

  • @madjohnshaft
    @madjohnshaft ปีที่แล้ว +1

    I am currently taking the AI cert program from MIT - I thank you for your channel

    • @statquest
      @statquest  ปีที่แล้ว

      Thanks and good luck!

  • @Rykurex
    @Rykurex 11 หลายเดือนก่อน

    Do you have any courses with start-to-finish projects for people who are only just getting interested in machine learning?
    Your explanations on the mathematical concepts has been great and I'd be more than happy to pay for a course that implements some of these concepts into real world examples

    • @statquest
      @statquest  11 หลายเดือนก่อน

      I don't have a course, but hope to have one one day. In the meantime, here's a list of all of my videos somewhat organized: statquest.org/video-index/ and I do have a book called The StatQuest Illustrated Guide to Machine Learning: statquest.org/statquest-store/

  • @arvinprince918
    @arvinprince918 11 หลายเดือนก่อน

    hey there josh @statquest, your videos are really awsome and super helpful, thus i was wondering when will your video for transformer model come out

    • @statquest
      @statquest  11 หลายเดือนก่อน

      All channel members and patreon supports have access to it right now. It will be available to everyone else in a few weeks.

  • @Xayuap
    @Xayuap ปีที่แล้ว +2

    weeeeee,
    video for tonite,
    tanks a lot

  • @SaumitraAgrawalB22AI054
    @SaumitraAgrawalB22AI054 3 วันที่ผ่านมา

    in the decoder part, the second time when we are passing vamos as input, do we have to calculate the similarity score again? or we should be using the old one only?

    • @statquest
      @statquest  3 วันที่ผ่านมา

      In the decoder we start by calculating the similarity between the token and the input tokens. Then we calculate the similarity between "vamos" and the input tokens. So those are two different similarity calculations.

    • @saumitragrawal2279
      @saumitragrawal2279 2 วันที่ผ่านมา +1

      Thankyou Got it

  • @handsomemehdi3445
    @handsomemehdi3445 9 หลายเดือนก่อน

    Hello, Thank you for the video, but I am so confused that some terms introduced in original 'Attention is All You Need' paper were not mentioned in video, for example, keys, values, and queries. Furthermore, in the paper, authors don't talk about cosine similarity and LSTM application. Can you please clarify this case a little bit much better?

    • @statquest
      @statquest  9 หลายเดือนก่อน +1

      The "Attention is all you need" manuscript did not introduce the concept of attention. That does done years earlier, and that is what this video describes. If you'd like to understand the "Attention is all you need" concept of transformers, check out my video on transformers here: th-cam.com/video/zxQyTK8quyY/w-d-xo.html

  • @lequanghai2k4
    @lequanghai2k4 ปีที่แล้ว +1

    I am stilling learning this so hope next video come out soon

    • @statquest
      @statquest  ปีที่แล้ว

      I'm working on it as fast as I can.

  • @jarsal_firahel
    @jarsal_firahel 8 หลายเดือนก่อน +1

    Before, I was dumb, "guitar"
    But now, people say I'm smart "guitar"
    What is changed ? "guitar"
    Now I watch.....
    StatQueeeeeest ! "guitar guitar"

    • @statquest
      @statquest  8 หลายเดือนก่อน

      bam!

  • @miladafrasiabi5499
    @miladafrasiabi5499 10 หลายเดือนก่อน

    Thank you for the awesome video. I have a question. What does the similarity score entails in reality? I assume that the Ws and Bs are being optimized by backpropagation in order to give larger positive values to synonyms, close to 0 values to unrelated words and large negative values to antonyms. Is this a right assumption?

    • @statquest
      @statquest  10 หลายเดือนก่อน +1

      I believe that is correct. However, if there is one thing I've learned about neural networks, it's that the weights and biases are optimized only to fit the data and the actual values may or may not make any sense beyond that specific criteria.

  • @thanhtrungnguyen8387
    @thanhtrungnguyen8387 11 หลายเดือนก่อน +1

    can't wait for the next StatQuest

    • @statquest
      @statquest  11 หลายเดือนก่อน

      :)

    • @thanhtrungnguyen8387
      @thanhtrungnguyen8387 11 หลายเดือนก่อน

      @@statquest I'm currently trying to fine-tune Roberta so I'm really excited about the following video, hope the following videos will also talk about BERT and fine-tune BERT

    • @statquest
      @statquest  11 หลายเดือนก่อน +1

      @@thanhtrungnguyen8387 I'll keep that in mind.

  • @alexfeng75
    @alexfeng75 ปีที่แล้ว

    Fantastic video, indeed! Is the attention described in the video the same as in the attention paper? I didn't see the mention of QKV in the video and would like to know whether it was omitted to simplify or by mistake.

    • @statquest
      @statquest  ปีที่แล้ว +1

      Are you asking about the QKV notation that appears in the "Attention is all you need" paper? That manuscript arxiv.org/abs/1706.03762 , which came out in 2017, didn't introduce the concept of attention for neural networks. Instead it introduces a more advanced topic - Transformers. The original "how to add attention to neural networks" manuscript arxiv.org/pdf/1409.0473.pdf came out in 2015 and did not use the QKV notation that appeared later in the transformer manuscript. Anyway, my video follows the original, 2015, manuscript. However, I'm working on a video that covers the 2017 manuscript right now. And I've got a long section talking all about the QKV stuff in it.
      That said, in this video, you can think of the output from each LSTM in the decoder as a "Query", and the outputs from each LSTM in the Encoder as the "Keys" and "Values". The "Keys" are used, in conjunction with each "Query" to calculate the Similarity Scores and the "Values" are then scaled by those scores to create the attention values.

    • @alexfeng75
      @alexfeng75 ปีที่แล้ว

      @@statquest Thanks for the reply, Josh. Yes, I was referring to the 2017 paper. I look forward to your video covering it.

  • @Ghost-ip3bx
    @Ghost-ip3bx 9 วันที่ผ่านมา

    Hi StatQuest, I've been a long time fan, your videos have helped me TREMENDOUSLY. For this video I felt however if we could get a larger picture of how attention works first ( how different words can have different weights ( attending to them differently )) and then going through a run with actual values, it'd be great! :) I also felt that the arrows and diagrams got a bit confusing in this one. Again, this is only constructive criticism and maybe it works for others and just not for me ( this video I mean ). Nonetheless, thank you so much for all the time and effort you put into making your videos. You're helping millions of people out there clear their degrees and achieve life goals

    • @statquest
      @statquest  9 วันที่ผ่านมา +1

      Thanks for the feedback! I'm always trying to improve how I make videos. Anyway, I work through the concepts more in my videos on transformers: th-cam.com/video/zxQyTK8quyY/w-d-xo.html and if the diagrams are hard to follow, I also show how it works using matrix math: th-cam.com/video/KphmOJnLAdI/w-d-xo.html

  • @andresg3110
    @andresg3110 ปีที่แล้ว +1

    You are on Fire! Thank you so much

    • @statquest
      @statquest  ปีที่แล้ว +1

      Thank you! :)

  • @juliank7408
    @juliank7408 4 หลายเดือนก่อน +1

    Phew! Lots of things in this model, my brain feels a bit overloaded, haha
    But thanks! Might have to rewatch this

    • @statquest
      @statquest  4 หลายเดือนก่อน

      You can do it!

  • @orlandopalmeira623
    @orlandopalmeira623 2 หลายเดือนก่อน

    Hello, I have a doubt. The initialization of the cell state and hidden state of the decoder is a context vector that is the representation (generated by encoder) of the entire sentence (input)? And what about each hidden state (from encoder) used in decoder? Are they stored somehow? Thanks!!!

    • @statquest
      @statquest  2 หลายเดือนก่อน +1

      1) Yes, the context vector is a representation of the entire input.
      2) The hidden states in the encoder are stored for attention.

    • @orlandopalmeira623
      @orlandopalmeira623 2 หลายเดือนก่อน +1

      @@statquest Thanks!!

  • @patrikszepesi2903
    @patrikszepesi2903 8 หลายเดือนก่อน

    Hi, great video. At 13:49 can you please explain how you get -.3 and 0.3 for the input to the fully connected? THank you

    • @statquest
      @statquest  8 หลายเดือนก่อน +1

      The outputs from the softmax function are multiplied with the short-term memories coming out of the encoders LSTM units. We then add those products together to get -0.3 and 0.3.

  • @tupaiadhikari
    @tupaiadhikari 10 หลายเดือนก่อน

    At 13:38 are we Concatenating the output of the attention values and the output of the decoder LSTM for the translated word (EOS in this case) and then using a weights of dimensions (4*4) to convert into a dimension 4 pre Softmax output?

    • @statquest
      @statquest  10 หลายเดือนก่อน

      yep

    • @statquest
      @statquest  10 หลายเดือนก่อน +1

      If you want to see a more detailed view of what is going on at that stage, check out my video on Transformers: th-cam.com/video/zxQyTK8quyY/w-d-xo.html In that video, I go over every single mathematical operation, rather than gloss over them like I do here.

    • @tupaiadhikari
      @tupaiadhikari 10 หลายเดือนก่อน +1

      @@statquest Thank You Professor Josh for the clarifications !

  • @sciboy123
    @sciboy123 10 หลายเดือนก่อน

    I had a little confusion about the final fully connected layer. It takes in separate attention values for each input word. But doesn't this mean that the dimension of the input depends on how many input words there are (thus it would be difficult to generalize for arbitrarily long sentences)? Did I misunderstand something?

    • @statquest
      @statquest  10 หลายเดือนก่อน

      I can see why this might be confusing because we have 2 input words and two inputs for attention going into the final fully connected layer. However, the number of inputs for attention going into the final fully connected layer is not determined by the number of input words, instead it is determined by the number of LSTM cells we have per layer (or, alternatively, the number of output values from the LSTMs per layer). In this case, we have 2 LSTM cells in a single layer. And thus, regardless of the number of input words, there will only be 2 attention values inputed into the final fully connected layer. If this is confusing, review how the attention values are created at 12:58 - regardless of the number of input words, we add the scaled values together to get one sum per LSTM.

  • @tangt304
    @tangt304 8 หลายเดือนก่อน

    Another awesome video! Josh, will you plan to talk about BERT? Thank you!

    • @statquest
      @statquest  8 หลายเดือนก่อน +1

      I'll keep that in mind.

  • @JL-vg5yj
    @JL-vg5yj ปีที่แล้ว +1

    super clutch my final is on thursday thanks a lot!

  • @shaktisd
    @shaktisd 6 หลายเดือนก่อน

    I have one fundamental question related to how attention model learns, so basically higher attention score is given to those pairs of word which have higher softmax (Q.K) similarity score. Now the question is how relationship in the sentence "The cat didn't climb the tree as it was too tall" is calculated and it knows that in this case "it" refers to tree and not "cat" . Is it from large content of data that the model reads helps it in distinguishing the difference ?

    • @statquest
      @statquest  6 หลายเดือนก่อน +1

      Yes. The more data you have, the better attention is going to work.

  • @umutnacak
    @umutnacak 11 หลายเดือนก่อน

    Great videos! So after watching technical videos I think complicating the math has no effect on removing bias from the model. In the future one can find a model with self-encoder-soft-attention-direct-decoder you name it, but it's still garbage in garbage out. Do you think there is a way to plug a fairness/bias filter to the layers so instead of trying to filter the output of the model you just don't produce unfair output? It's like preventing a disease instead of looking for a cure. Obviously I'm not an expert and just trying to get a direction for my personal ethics research out of this naive question. Thanks!

    • @statquest
      @statquest  11 หลายเดือนก่อน +1

      To be honest, I'm super sure I understand what you are asking about. However, I know that there is something called "constitutional AI" that you might be interested in.

    • @umutnacak
      @umutnacak 11 หลายเดือนก่อน +1

      @@statquest Thanks for the reply. OK this looks promising. Actually they already have a model called Claude. Not sure this is the thing I'm looking for or not, but at least a direction for me to look further into. Thanks again!

  • @mrstriker1847
    @mrstriker1847 ปีที่แล้ว +3

    Please add to the neural network playlist! Or don't it's your video, I just want to be able to find it when I'm looking for it to study for class.

    • @statquest
      @statquest  ปีที่แล้ว

      I'll add it to the playlist, but the best place to find my stuff is here: statquest.org/video-index/

  • @elmehditalbi8972
    @elmehditalbi8972 11 หลายเดือนก่อน

    Could you do a video about Bert? Architectures like these can be very helpful on NLP and I think a lot of folks will benefit from that :)

    • @statquest
      @statquest  11 หลายเดือนก่อน

      I've got a video on transformers coming out soon.

  • @michaelbwin752
    @michaelbwin752 11 หลายเดือนก่อน

    Thank you for this explanation. But my question is how with backprogation are the weights and bias adjusted in such a model like this. if you could explain that i would deeply appreciate it.

    • @statquest
      @statquest  11 หลายเดือนก่อน

      Backpropagation works for models like this just like it works for simpler models. You just use a whole lot of Chain Rule to calculate the derivatives of the loss function (which is cross entropy in this case) with respect to each weight and bias. To learn more about backpropagation, see: th-cam.com/video/IN2XmBhILt4/w-d-xo.html th-cam.com/video/iyn2zdALii8/w-d-xo.html and th-cam.com/video/GKZoOHXGcLo/w-d-xo.html To learn more about cross entropy, see: th-cam.com/video/6ArSys5qHAU/w-d-xo.html and th-cam.com/video/xBEh66V9gZo/w-d-xo.html

  • @seifeddineidani3256
    @seifeddineidani3256 ปีที่แล้ว +1

    Thanks josh, great video! ❤I hope you upload the transformer video soon :)

  • @guillermosainzzarate5110
    @guillermosainzzarate5110 11 หลายเดือนก่อน +1

    Y ahora en español? Casi no lo creo, este canal es increible😭 muchas gracias por tus videos !!!

    • @statquest
      @statquest  11 หลายเดือนก่อน

      Muchas gracias! :)

  • @RafaelRabinovich
    @RafaelRabinovich ปีที่แล้ว

    To really create a translator model, we would have to work a lot through values of linguistics since there are differences in word order, verb conjugation, idioms, etc. Going from one language to another is a big structural challenge for coders.

    • @statquest
      @statquest  ปีที่แล้ว

      That's the way they used to do it - by using linguistics. But very few people do it that way anymore. Now pretty much all translation is done with transformers (which are just encoder-decoder networks with attention, but not the LSTMs). Improvements in translation quality are gained simply by adding more layers of attention and using larger training datasets. For more details, see: en.wikipedia.org/wiki/Natural_language_processing

  • @frogloki882
    @frogloki882 ปีที่แล้ว +2

    Another BAM!

  • @automatescellulaires8543
    @automatescellulaires8543 ปีที่แล้ว +1

    wow, i didn't think i would see this kind of stuff on this channel.

  • @nogur9
    @nogur9 11 หลายเดือนก่อน +1

    Thank you very much!

    • @statquest
      @statquest  11 หลายเดือนก่อน

      You're welcome!

  • @Thepando20
    @Thepando20 10 หลายเดือนก่อน

    Hi, great video SQ as always!
    I had the same question as @manuelcortes1835 and I understand that the encodings are the LSTM outputs. However, in 9:02 the outputs are 0.91 and 0.38, maybe I am missing something here?

    • @statquest
      @statquest  10 หลายเดือนก่อน +1

      Yes, and at 13:36 they are rounded to the nearest 10th so they can fit in the small boxes. Thus 0.91 is rounded to 0.9 and 0.38 is rounded to 0.4.

    • @Thepando20
      @Thepando20 10 หลายเดือนก่อน +1

      Thank you, all clear!

  • @kaixuan5236
    @kaixuan5236 ปีที่แล้ว

    Can you do videos on transformer network and multi-head attention? Love your vids!

    • @statquest
      @statquest  ปีที่แล้ว

      I'm working on it.

  • @Sarifmen
    @Sarifmen 11 หลายเดือนก่อน

    13:15 so the attention for EOS is just 1 number (per LSTM cell) which combines references to all the input words?

    • @statquest
      @statquest  11 หลายเดือนก่อน

      Yep.

  • @Fahhne
    @Fahhne ปีที่แล้ว +1

    Nice video, can't wait for the video about transformers
    (I imagine it will be the next one?)

  • @sagardesai1253
    @sagardesai1253 ปีที่แล้ว +1

    great video thanks

  • @lakshaydulani
    @lakshaydulani ปีที่แล้ว +1

    now thats what i was looking for