ไม่สามารถเล่นวิดีโอนี้
ขออภัยในความไม่สะดวก

The matrix math behind transformer neural networks, one step at a time!!!

แชร์
ฝัง
  • เผยแพร่เมื่อ 19 ส.ค. 2024

ความคิดเห็น • 108

  • @samglick8479
    @samglick8479 4 หลายเดือนก่อน +14

    Josh Starmer is the GOAT. Literally every morning I wake up with some statquest, and it really helps me get ready for my statistics classes for the day. Thank you Josh!

    • @statquest
      @statquest  4 หลายเดือนก่อน +1

      Bam!

    • @leonardfei8154
      @leonardfei8154 4 หลายเดือนก่อน +1

      definitely the goat🐐

  • @navneettiwari9861
    @navneettiwari9861 25 วันที่ผ่านมา +2

    so happy i reached here..passing through all complicated topics and now just few topics away from completion..it's all your dedication to teaching..thank you

    • @statquest
      @statquest  25 วันที่ผ่านมา

      Thanks!

  • @NJCLM
    @NJCLM 3 หลายเดือนก่อน +3

    Very educational, and also innovative in the way of doing it. I have never seen such teaching elsewhere. You are the BEST !

    • @statquest
      @statquest  3 หลายเดือนก่อน

      Thank you! :)

  • @jpfdjsldfji
    @jpfdjsldfji 4 หลายเดือนก่อน +5

    You weren't kidding, it's here! You're a man of your word and a man of the people.

    • @statquest
      @statquest  4 หลายเดือนก่อน

      Thanks!

  • @colekeircom
    @colekeircom 4 หลายเดือนก่อน +1

    As an electronics hobbyist/student from way back in the 70s I like to keep up as best I can with technology. I'm really glad I don't have to remember all the details in this series. There are so many layers upon layers that at times I do ''just keep going to the end'' of the videos. Nevertheless I still manage to learn key aspects and new terms from your excellent teaching abilities. There must be an incredible amount of work involved in creating these lessons.
    I will purchase your book because you deserve some form of appreciation and it'll serve as a great reference resource. Much respect Josh and thanks , Kieron.

    • @statquest
      @statquest  4 หลายเดือนก่อน +1

      Thank you very much!

  • @BaronSpartan
    @BaronSpartan 4 หลายเดือนก่อน +3

    Josh! Thanks for this video, it has been easier for me to see the matricial representation of the computation than using the previous arrows. I really appreciate your explanation using matrices!

    • @statquest
      @statquest  4 หลายเดือนก่อน

      Glad it was helpful!

  • @TheCJD89
    @TheCJD89 4 หลายเดือนก่อน +1

    This is really good. The simple example you used was very effective for demonstrating the inner workings of the transformer.

    • @statquest
      @statquest  4 หลายเดือนก่อน

      Thank you very much!

  • @roro5179
    @roro5179 4 หลายเดือนก่อน +1

    always been a huge fan of the channel and at this point in my life this video really couldn't have come at a better time. Thanks for enabling helping us viewers with some of the best content on the planet (I said what I said)!

    • @statquest
      @statquest  4 หลายเดือนก่อน

      Thanks!

  • @mraarone
    @mraarone 4 หลายเดือนก่อน +5

    DUDE JOSH, FINALLY! I have been waiting for this episode for a year or more. I’m so proud of you bro. You got there!

    • @statquest
      @statquest  4 หลายเดือนก่อน +1

      Thanks a ton!

  • @MakeDataUseful
    @MakeDataUseful 4 หลายเดือนก่อน +1

    Amazing, thank you Josh. You deserve millions more subscribers

    • @statquest
      @statquest  4 หลายเดือนก่อน

      Thank you!

  • @jamesmina7258
    @jamesmina7258 2 หลายเดือนก่อน +1

    Josh Starmer is the GOAT, thank you, dear Josh.

    • @statquest
      @statquest  2 หลายเดือนก่อน

      Thanks!

    • @jamesmina7258
      @jamesmina7258 2 หลายเดือนก่อน

      @@statquest could you make a video about unet?

    • @statquest
      @statquest  2 หลายเดือนก่อน +1

      @@jamesmina7258 I'll keep that in mind.

  • @liuwingki413
    @liuwingki413 3 หลายเดือนก่อน +1

    Thanks for introducing the concepts about transformers

    • @statquest
      @statquest  3 หลายเดือนก่อน

      My pleasure!

  • @NewsLetter-sq1eh
    @NewsLetter-sq1eh 4 หลายเดือนก่อน +2

    Your videos are a didactic stroke of genius! 👍

    • @statquest
      @statquest  4 หลายเดือนก่อน

      Glad you like them!

  • @sachinmohanty4577
    @sachinmohanty4577 หลายเดือนก่อน +1

    I will recommend this video to my friends who wants to study transformer ❤❤

    • @statquest
      @statquest  หลายเดือนก่อน

      Thanks!

  • @itsawonderfullife4802
    @itsawonderfullife4802 4 หลายเดือนก่อน +1

    Wow Sqatch! Long time no see my friend! Good to see you.
    Your videos are so much fun that one does not feel we are actually in the class. Thank you Josh.

    • @statquest
      @statquest  4 หลายเดือนก่อน

      Thanks!

  • @Aa-fk8jg
    @Aa-fk8jg 3 หลายเดือนก่อน +1

    statquest's the best thing i ever found on the internet

    • @statquest
      @statquest  3 หลายเดือนก่อน

      Thank you!

  • @adityabhosale7838
    @adityabhosale7838 4 หลายเดือนก่อน +2

    Please Add this video in your Neural Network Playlist. I recently started watching that playlist

    • @statquest
      @statquest  4 หลายเดือนก่อน +1

      Done!

  • @Hakilia
    @Hakilia 3 หลายเดือนก่อน +1

    following you from 🇨🇩

    • @statquest
      @statquest  3 หลายเดือนก่อน

      bam!

  • @mortezamahdavi2129
    @mortezamahdavi2129 7 วันที่ผ่านมา +1

    Thanks a lot,keep going please please

    • @statquest
      @statquest  6 วันที่ผ่านมา

      Thank you, I will!

  • @pulse6982
    @pulse6982 4 หลายเดือนก่อน +1

    Doings the god’s work, Josh!

    • @statquest
      @statquest  4 หลายเดือนก่อน

      Thank you!

  • @Er1kth3b00s
    @Er1kth3b00s 4 หลายเดือนก่อน

    Amazing video! Can't wait for the next one. By the way, I think there's a small typo at 5:15 where the first query weight in the matrix notation should be 2.22 instead of 0.22

    • @statquest
      @statquest  4 หลายเดือนก่อน

      Oops! Thanks for catching that!

  • @kavinvignesh2832
    @kavinvignesh2832 4 หลายเดือนก่อน +2

    TRIPLE BAM!!!!!!!!

    • @statquest
      @statquest  4 หลายเดือนก่อน +1

      :)

  • @Keshi-lz3ef
    @Keshi-lz3ef 3 หลายเดือนก่อน +1

    Thanks for the great contents! One minor thing - at 5:24 minute, the first element of the Query weight matrix should be 2.22, but not 0.22

    • @statquest
      @statquest  3 หลายเดือนก่อน

      Yep. That's a typo.

  • @DmitryPesegov
    @DmitryPesegov 4 หลายเดือนก่อน

    Great details. But. Please. In the education process it's very important to use some imaginable concepts as a frameworks. For me it's hard to connect why we are doing all that digits with the goal and why it works. Start using the concept of a n-sphere (let it be just a 2D circle, since we are using 2 values for tokens) and explain that we are actually rotating the whole n-spheres (circles) with packed Q and K in them and coding-in cases of different co-directionality of vectors measured by the cosine similarity [-1..1] (and actually divided not by mult of 2-norms but by sqrt of dmnsnlty just for comp.performance (you successfully mentioned this)). And when we are multiplying by V - we are actually doing the "mixing" of values in each dimension wrt QK co-directionality as a vectors in a n-sphere. We rotating the n-spheres by multiplying Q and K by matrices Wq and Wk and when we are doing that it's actually works as a rotation, linear transformation can do more but we will use the cosine similarity after it to measure the alignment of the vectors Q and K. Rotations, co-directionality cases code-ins, mixing. Repeat.

    • @statquest
      @statquest  4 หลายเดือนก่อน

      Noted!

  • @farazsyed.2898
    @farazsyed.2898 3 หลายเดือนก่อน

    need a video on degrees of freedom!!!

    • @statquest
      @statquest  3 หลายเดือนก่อน

      Noted!

  • @kartikchaturvedi7868
    @kartikchaturvedi7868 4 หลายเดือนก่อน +1

    Superrrb Awesome Fantastic video

    • @statquest
      @statquest  4 หลายเดือนก่อน

      Thanks 🤗!

  • @theneumann7
    @theneumann7 4 หลายเดือนก่อน +1

    perfection

    • @statquest
      @statquest  4 หลายเดือนก่อน

      Thank you!

  • @ChenLiangrui
    @ChenLiangrui 2 หลายเดือนก่อน

    do a OPTICS clustering video!

    • @statquest
      @statquest  2 หลายเดือนก่อน +1

      I'll keep that in mind.

  • @nivcohen961
    @nivcohen961 4 หลายเดือนก่อน +2

    Goat

    • @statquest
      @statquest  4 หลายเดือนก่อน

      :)

  • @yosimadsu2189
    @yosimadsu2189 หลายเดือนก่อน

    🙏🏻🙏🏻🙏🏻🙏🏻🙏🏻 Please show us how to train QVK Weights 🙏🏻🙏🏻🙏🏻🙏🏻🙏🏻

    • @statquest
      @statquest  หลายเดือนก่อน

      Sure, see: th-cam.com/video/C9QSpl5nmrY/w-d-xo.html

  • @EzraSchroeder
    @EzraSchroeder 4 หลายเดือนก่อน

    the A B C thing... i think it is inspired by Sesame Street LoL!!!!!!! 🙂

    • @statquest
      @statquest  4 หลายเดือนก่อน

      :)

  • @wilfredomartel7781
    @wilfredomartel7781 4 หลายเดือนก่อน +1

    🎉

    • @statquest
      @statquest  4 หลายเดือนก่อน

      :)

  • @yuvalalmog6000
    @yuvalalmog6000 2 หลายเดือนก่อน

    Will you ever make videos on the subjects of Reinforcement learning, NLP or generative models?

    • @statquest
      @statquest  2 หลายเดือนก่อน

      I think you could argue that this video is about NLP and is also a generative model, and I'll keep the other topic in mind.

    • @yuvalalmog6000
      @yuvalalmog6000 2 หลายเดือนก่อน +1

      ​@@statquest I"ll explain myself better as I admit I phrased it poorly. For deep learning and machine learning you made amazing videos that covered the subjects from basic aspects to advanced ones - thus essentially teaching the whole subject in a fun, creative & enjoyable sequence of videos that can help beginners know it from top to bottom.
      However, for NLP for example you did talk about specific subjects like word embedding or auto-translation, but there are other topics (mostly older things) in that field that are important to learn such as n-grams & HMM.
      So my question was not only about specific advanced topics that connect to others, but rather about a full course that covers the basics of the subject as well.
      Sorry for my bad phrasing and thank you both for your quick answer and amazing videos! 😄

    • @statquest
      @statquest  2 หลายเดือนก่อน +1

      @@yuvalalmog6000 I hope to one day cover HMMs.

  • @faisalsheikh7846
    @faisalsheikh7846 4 หลายเดือนก่อน +1

    Cody finished his story😅

    • @statquest
      @statquest  4 หลายเดือนก่อน

      One more to go - in the next video we'll code this thing up in pytorch.

  • @juansilva-fy6cw
    @juansilva-fy6cw 3 หลายเดือนก่อน

    Kolmogorov-Arnold Networks videoooooo mr bam

    • @statquest
      @statquest  3 หลายเดือนก่อน +1

      I'll keep that in mind.

  • @felipela2227
    @felipela2227 3 หลายเดือนก่อน

    It would be nice if you develop courses of Object Detection, mainly YOLO

    • @statquest
      @statquest  3 หลายเดือนก่อน +1

      I'll keep that in mind.

  • @denisquant
    @denisquant 2 หลายเดือนก่อน

    @statquest in your explanation, which is wonderful, embeding matrix seems to have dimension (length_specific_sequence, embeding dimension), but shouldn't it be (size of the vocabulary, embedding dimension) ?

    • @statquest
      @statquest  2 หลายเดือนก่อน

      I believe the matrices used in the videos are correct as well as how you expect them to be. At 2:38 the embedding matrix has 4 rows, one for each word in the input vocabulary, and 2 columns, one for embedding value. Thus, this matrix is "size of the vocabulary by embedding dimension" - just like you expected it to be. Likewise, for the output/decoder, the embedding has 5 rows for the output vocabulary and 2 columns for the embedding values.

    • @denisquant
      @denisquant 2 หลายเดือนก่อน +1

      ​@@statquest oh I think I see. The sample considered to train the tansformer (on the English side) is a sequence of 3 tokens "", "Let's", "go". But the vocabulary is made up of 4 tokens, not 3. Specifically, the vocabulary that you have considered is: "", "Let's", "go" and "to" ! I was not paying attention to that "to" token. That was that confused to me. I realised that because in the Spanish side it is more clear that we are working with a sample/sequence of 2 tokens but the vocabulary is made up of 5 tokens.
      Am I right? :)
      Thanks you!!

    • @statquest
      @statquest  2 หลายเดือนก่อน

      @@denisquant yep! The input vocabulary allows us to say "Let's go!" and "To go!"

  • @swarnavasarkar8106
    @swarnavasarkar8106 4 หลายเดือนก่อน

    Hey....did you cover the training steps in this video ? Sorry if I missed it

    • @statquest
      @statquest  4 หลายเดือนก่อน +2

      No, just how the math is done when training. We'll cover more details of training in my next video when we learn how to code transformers in PyTorch.

  • @junjiewang5633
    @junjiewang5633 หลายเดือนก่อน

    Is that a typo at 5:35, the query weight T(1,1) might be 2.22 rather than 0.22.

    • @statquest
      @statquest  หลายเดือนก่อน

      yep, that's a typo.

  • @BlayneOliver
    @BlayneOliver 4 หลายเดือนก่อน

    Josh do you know how to use embedding layers to add context to a regression model?
    And do you offer 1-on-1 guidance? I’m stuck on a problem regarding this videos topic

    • @statquest
      @statquest  4 หลายเดือนก่อน

      Hmmm...I'm not sure about the first question and, unfortunately, I don't offer one-on-one guidance.

  • @gui-zx3di
    @gui-zx3di 4 หลายเดือนก่อน

    Usually "vamos" will not be one token but two. How can the algorithm handle this division?

    • @statquest
      @statquest  4 หลายเดือนก่อน +1

      You could split "vamos" into two tokens, "va" and "mos", then the output from the decoder would be "va", "mos", "".

  • @nivcohen961
    @nivcohen961 4 หลายเดือนก่อน +2

    You made me love data science if not you I would learn as a zombie

    • @statquest
      @statquest  4 หลายเดือนก่อน

      bam!

  • @loflog
    @loflog 4 หลายเดือนก่อน

    Question: If all tokens can be calculated in parallel, then why is time-to-first-token such an important metric for model performance?

    • @statquest
      @statquest  4 หลายเดือนก่อน

      That might be related to decoding, which, when doing during inference, is sequential.

    • @kamiltylus
      @kamiltylus 4 หลายเดือนก่อน +1

      The time to first token may be referring to producing the first token by the decoder in the autoregressive setting, where (for example in sentence translation) the model produces one token at a time, then feeds it into itself, to generate the next one, and so on. This process is sequential, while the computation of all the matrices (of already existing embeddings) is parallel).

  • @I.II..III...IIIII.....
    @I.II..III...IIIII..... 4 หลายเดือนก่อน

    10:51 How come each token's maximum similarity isn't with itself?

    • @statquest
      @statquest  4 หลายเดือนก่อน

      This example, trained on just 2 phrases ("what is statquest? and "statquest is what") is too simple to really show off the nuance in how these things work.

    • @I.II..III...IIIII.....
      @I.II..III...IIIII..... 4 หลายเดือนก่อน

      ​@@statquestah so with more training and a bigger dataset we can expect the weights to give values closer to what we intuitively expect, like, as I said, each word having the biggest similarity with itself? Great video to see the matrices in action, and I like the content and don't want to be rude, but I think touching on such details a bit would've been nice. Also, maybe something on Multi-Head Attention?

    • @statquest
      @statquest  4 หลายเดือนก่อน

      @@I.II..III...IIIII..... I believe that is correct. And I'll talk about multi-head attention more in my video on how to code transformers.

  • @user-yc9do4mb5i
    @user-yc9do4mb5i 4 หลายเดือนก่อน

    Why they used square root of dk ? Why not just dk? ... If anyone knows the answer please give a good explaination

    • @statquest
      @statquest  4 หลายเดือนก่อน

      To quote from the original manuscript, "if q and k are independent random variables with mean 0 and variance 1. Then their dot product has mean 0 and variance d_k". Thus, dividing the dot products by the square root of d_k results in variance = 1. That said, unfortunately, as you can see in this illustration, the variance for q and k is much higher than 1, so the theory doesn't actually hold.

  • @statquest
    @statquest  4 หลายเดือนก่อน +3

    The full Neural Networks playlist, from the basics to AI, is here: th-cam.com/video/CqOfi41LfDw/w-d-xo.html
    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

    • @rickymort135
      @rickymort135 4 หลายเดือนก่อน

      Just ordered your book 😊 Thanks for the love and care you put into this

  • @DarkNight0411
    @DarkNight0411 4 หลายเดือนก่อน

    With all due respect, please stop singing at the beginning of your videos. Having that at the beginning of every video is very irritating.

    • @statquest
      @statquest  4 หลายเดือนก่อน

      Noted