Decoder-Only Transformers, ChatGPTs specific Transformer, Clearly Explained!!!

แชร์
ฝัง
  • เผยแพร่เมื่อ 9 มิ.ย. 2024
  • Transformers are taking over AI right now, and quite possibly their most famous use is in ChatGPT. ChatGPT uses a specific type of Transformer called a Decoder-Only Transformer, and this StatQuest shows you how they work, one step at a time. And at the end (at 32:14), we talk about the differences between a Normal Transformer and a Decoder-Only Transformer. BAM!
    NOTE: If you're interested in learning more about Backpropagation, check out these 'Quests:
    The Chain Rule: • The Chain Rule
    Gradient Descent: • Gradient Descent, Step...
    Backpropagation Main Ideas: • Neural Networks Pt. 2:...
    Backpropagation Details Part 1: • Backpropagation Detail...
    Backpropagation Details Part 2: • Backpropagation Detail...
    If you're interested in learning more about the SoftMax function, check out:
    • Neural Networks Part 5...
    If you're interested in learning more about Word Embedding, check out: • Word Embedding and Wor...
    If you'd like to learn more about calculating similarities in the context of neural networks and the Dot Product, check out:
    Cosine Similarity: • Cosine Similarity, Cle...
    Attention: • Attention for Neural N...
    If you'd like to learn more about Normal Transformers, see: • Transformer Neural Net...
    If you'd like to support StatQuest, please consider...
    Patreon: / statquest
    ...or...
    TH-cam Membership: / @statquest
    ...buying my book, a study guide, a t-shirt or hoodie, or a song from the StatQuest store...
    statquest.org/statquest-store/
    ...or just donating to StatQuest!
    paypal: www.paypal.me/statquest
    venmo: @JoshStarmer
    Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
    / joshuastarmer
    0:00 Awesome song and introduction
    1:34 Word Embedding
    7:26 Position Encoding
    10:10 Masked Self-Attention, an Autoregressive method
    22:35 Residual Connections
    23:00 Generating the next word in the prompt
    26:23 Review of encoding and generating the prompt
    27:20 Generating the output, Part 1
    28:46 Masked Self-Attention while generating the output
    30:40 Generating the output, Part 2
    32:14 Normal Transformers vs Decoder-Only Transformers
    #StatQuest

ความคิดเห็น • 316

  • @statquest
    @statquest  9 หลายเดือนก่อน +5

    To learn more about Lightning: lightning.ai/
    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

  • @razodactyl
    @razodactyl 9 หลายเดือนก่อน +47

    Bruh. This channel is criminally underrated.

    • @statquest
      @statquest  9 หลายเดือนก่อน +3

      Thanks!

    • @razodactyl
      @razodactyl 6 หลายเดือนก่อน

      🎉🎉🎉
      Love your work!

    • @EobardUchihaThawne
      @EobardUchihaThawne 4 หลายเดือนก่อน

      Bam!

    • @rickymort135
      @rickymort135 2 หลายเดือนก่อน +1

      Well then criminals should rate it more highly

    • @razodactyl
      @razodactyl 2 หลายเดือนก่อน +1

      @@rickymort135I laughed. ⭐️

  • @ayush_stha
    @ayush_stha 2 หลายเดือนก่อน +6

    This explanation is essential for anyone looking to understand how ChatGPT works. While more in-depth exploration is necessary to grasp all the intricacies fully, I believe this explanation couldn't be better. It's exactly what I needed.

    • @statquest
      @statquest  2 หลายเดือนก่อน +2

      Thanks! I have a video that shows how all of these calculations are done using matrix algebra coming out soon.

  • @peerbr7849
    @peerbr7849 9 หลายเดือนก่อน +15

    And I thought you'd stop at ChatGPT. Thanks for never stopping to learn and teach!

    • @statquest
      @statquest  9 หลายเดือนก่อน

      Thank you!

    • @xspydazx
      @xspydazx 9 หลายเดือนก่อน

      Yes it's a good series

  • @NTesla00
    @NTesla00 9 หลายเดือนก่อน +8

    Haven't had a single stats course in over 3 years but I still keep up with this channel from time to time! Neural networks are way more complex than what I've ever had to deal with, but you manage to break down even these topics into bite size pieces...Bam!!

    • @statquest
      @statquest  9 หลายเดือนก่อน +1

      Thank you so much!!!

  • @gvlokeshkumar
    @gvlokeshkumar 8 หลายเดือนก่อน +4

    Quests on attention, transformer and decoder only transformer are of immeasurable value! Thank you so much! Keep the quests coming!

    • @statquest
      @statquest  8 หลายเดือนก่อน

      Thanks, will do!

  • @sidereal6296
    @sidereal6296 7 หลายเดือนก่อน +2

    I just want to say you are AMAZING. Thank you so much. I would personally love to see a video on backprop to train this, or even just training an RNN since we saw multi dim training, but not training once we get the state machine / unrolling involved. Loved the whole series 🎉

    • @statquest
      @statquest  7 หลายเดือนก่อน

      Thanks! I have notes for training an RNN, but the equations get big really fast. That said, it really is the exact same techniques presented in other videos, just a lot more of them.

  • @cheolyeonbyun9640
    @cheolyeonbyun9640 8 หลายเดือนก่อน +3

    Congrats on 1 million subs statquest!! All the Love from Korea!!

    • @statquest
      @statquest  8 หลายเดือนก่อน +1

      Thank you very much!!! :)

  • @karlnikolasalcala8208
    @karlnikolasalcala8208 7 หลายเดือนก่อน +2

    YOU ARE THE BEST TEACHER EVER JOSHH!! I wish you can feel the raw feeling we feel when we watch your videos

    • @statquest
      @statquest  7 หลายเดือนก่อน

      bam! :)

  • @spartan9729
    @spartan9729 9 หลายเดือนก่อน +2

    Oh my. Thanks for the recap, it was so necessary for this video. It made the concept extremely clear.

    • @statquest
      @statquest  9 หลายเดือนก่อน +1

      Glad it was helpful!

  • @ID10T_6B
    @ID10T_6B 15 วันที่ผ่านมา +1

    This is the only video on youtube that explains how such a complicated thing works so simply.

    • @statquest
      @statquest  15 วันที่ผ่านมา

      bam! :)

  • @dineth9d
    @dineth9d 7 หลายเดือนก่อน +5

    Hey Josh, I’ve been really digging your videos! They’re not only informative and helpful for my studies, but they’re also super entertaining. In fact, you’ve played a big part in my decision to continue pursuing AI Engineering. Could you please do a video about low-rank adaptation(LoRA). I am not good with that.

    • @statquest
      @statquest  7 หลายเดือนก่อน +3

      Thanks! I'll keep that in mind.

  • @konstantinlevin8651
    @konstantinlevin8651 9 หลายเดือนก่อน +1

    Woahh, this is actually cool. We appreciate it a lot Josh!

    • @statquest
      @statquest  9 หลายเดือนก่อน

      Thanks!

  • @namunamu5258
    @namunamu5258 7 หลายเดือนก่อน +3

    Thank you so much! It is an amazing video and I haven't seen a video teaching AI/ML techniques like this anywhere! You're talented. And my research areas span Efficient LLM (LoRA, Quantization, etc). It cannot be better if I can see those concepts

    • @statquest
      @statquest  7 หลายเดือนก่อน +1

      Glad it was helpful!

  • @antonindusek3725
    @antonindusek3725 9 หลายเดือนก่อน +9

    Hello Josh, i am enjoying your videos as they are helping me so much with my studies as well as entertaining me. You are kinda a reason i decided to continue studying bioinformatics. Since you are covering chatGTP and stuff now, could you maybe make a video about AlphaFold architecture in the future? I understand it might not be your topic of interrest, but i would love to lear in more deeply (pun intended). Thanks either way!

    • @statquest
      @statquest  9 หลายเดือนก่อน +2

      I'll keep that in mind.

  • @aseemlimbu7672
    @aseemlimbu7672 9 หลายเดือนก่อน +17

    Triple BAM ❤❤👌👌

    • @statquest
      @statquest  9 หลายเดือนก่อน +1

      Hooray! :)

  • @ciciparsons3651
    @ciciparsons3651 9 หลายเดือนก่อน +1

    awesome, really helpful. Can't wait for another exciting episode!!

    • @statquest
      @statquest  9 หลายเดือนก่อน

      More to come!

  • @bhaskersuri1541
    @bhaskersuri1541 6 หลายเดือนก่อน +1

    This the most brilliant explanation that I have seen!!!!!! You are just awesome!!!!

    • @statquest
      @statquest  6 หลายเดือนก่อน

      Wow, thanks!

  • @al8-.W
    @al8-.W 9 หลายเดือนก่อน +9

    This video is proof that repetition is prime when teaching advanced concepts. I've watched many similar videos in the past and could never get all of these numbers to finally make sense in my mind. With your previous transformer video, I was getting closer but somewhat got lost again with the QVK values. Having to this second video to watch in a row made it clearer for me what all these numbers do and why we need them.

    • @statquest
      @statquest  9 หลายเดือนก่อน +1

      BAM! :)

  • @danberm1755
    @danberm1755 9 หลายเดือนก่อน +1

    Thanks again Josh! I noticed that many GPTs are decoder only. Thanks for clarifying!
    BTW saw that Yannic had a video on history rewrites. Probably not a topic for this channel, but still pretty cool 😁

    • @statquest
      @statquest  9 หลายเดือนก่อน

      Interesting!

  • @gabip265
    @gabip265 9 หลายเดือนก่อน +2

    Another great video as always! Would be amazing if you could continue with Masked Language Models such as BERT in the future!

    • @statquest
      @statquest  9 หลายเดือนก่อน +2

      I'll keep that in mind.

  • @asheeshmathur
    @asheeshmathur 9 หลายเดือนก่อน +1

    Delighted to watch one of the Most Brilliant videos. Hats off. Will join the channel tomorrow, first thing. Meanwhile do you have a one on Probability Density Function.

    • @statquest
      @statquest  9 หลายเดือนก่อน +1

      All of my videos are organized on this page: statquest.org/video-index/

    • @asheeshmathur
      @asheeshmathur 9 หลายเดือนก่อน

      Thanks, All are good, may be I could not find a one on P Density Function. Could you please point me out that specific video.

  • @josephsueke
    @josephsueke 3 หลายเดือนก่อน +1

    incredible! this is such a clear explanation. thank you!

    • @statquest
      @statquest  3 หลายเดือนก่อน

      Thank you!

  • @ayseguldalgic
    @ayseguldalgic 4 หลายเดือนก่อน +1

    Hey Josh! You're a gift for this planet 😍 so thanks this awsome explanations..

    • @statquest
      @statquest  4 หลายเดือนก่อน

      Wow, thank you!

  • @brucewayne6744
    @brucewayne6744 9 หลายเดือนก่อน +1

    Perfect video! Quick question, how are you drawing your lines? This line style is awesome!

    • @statquest
      @statquest  9 หลายเดือนก่อน +1

      I do everything in "keynote".

  • @gustavsnolle8424
    @gustavsnolle8424 8 หลายเดือนก่อน +2

    What an awesome video and channel😁👍. Would you consider doing a video on deep q learning models? I believe everyone would benefit from a video on such a fundamental topic. Thank you for your invaluable work🤩

    • @statquest
      @statquest  8 หลายเดือนก่อน

      I'll keep that in mind.

  • @colmon46
    @colmon46 6 หลายเดือนก่อน +1

    Your videos are awesome! I've never thought I could learn machine learning in such an easy way. Love from china

    • @statquest
      @statquest  6 หลายเดือนก่อน

      Thank you!

  • @chihhaohuang9858
    @chihhaohuang9858 4 หลายเดือนก่อน +1

    BAM... You really killed it. Thanks for your explanation.

    • @statquest
      @statquest  4 หลายเดือนก่อน

      Thank you!

  • @tzhynt
    @tzhynt 2 หลายเดือนก่อน +1

    Greate explanation. It help me a lot. A million heart for u!!

    • @statquest
      @statquest  2 หลายเดือนก่อน

      Thank you!

  • @YumanKumar
    @YumanKumar 3 หลายเดือนก่อน +1

    Amazing Explanation! Double Bam 😊👍

    • @statquest
      @statquest  3 หลายเดือนก่อน

      Thank you! 😃

  • @mitch7w
    @mitch7w 9 หลายเดือนก่อน +1

    Thanks for the excellent explanation!

    • @statquest
      @statquest  9 หลายเดือนก่อน

      You are welcome!

  • @adam.phelps
    @adam.phelps 2 หลายเดือนก่อน +1

    I really enjoyed this video!

    • @statquest
      @statquest  2 หลายเดือนก่อน

      Thank you!

  • @user-se8ld5nn7o
    @user-se8ld5nn7o หลายเดือนก่อน +1

    Hey, fantastic video as usual! Getting hard to find new ways to compliment, haha.
    Just one quick question since you mentioned positional encoding. When generating embeddings from GPT embedding models (e.g., text-embedding-3-large), do the embeddings contain both positional encoding layer and masked-self-attention info in the numbers?

    • @statquest
      @statquest  หลายเดือนก่อน

      I believe it's just the word embeddings.

  • @yuanyuan524
    @yuanyuan524 9 หลายเดือนก่อน +1

    Thanks for clear explanation

    • @statquest
      @statquest  9 หลายเดือนก่อน

      Glad it was helpful!

  • @juaneshberger9567
    @juaneshberger9567 8 หลายเดือนก่อน +1

    great vids, any chance you could make videos on Q-Learning, Deep Q-Learning, and other RL Topics! Keep up the good work!

    • @statquest
      @statquest  8 หลายเดือนก่อน

      I hope to.

  • @Lzyue0092youtube
    @Lzyue0092youtube 3 หลายเดือนก่อน +1

    your series almost save me...love from China💥

    • @statquest
      @statquest  3 หลายเดือนก่อน

      Happy to help!

  • @jugsma6676
    @jugsma6676 15 วันที่ผ่านมา +1

    This is a god level YoutTube channel

    • @statquest
      @statquest  15 วันที่ผ่านมา

      :)

  • @tamoghnamaitra9901
    @tamoghnamaitra9901 7 หลายเดือนก่อน +1

    Great Video. If possible, please do a video on model fine-tuning techniques like PEFT/LoRA

    • @statquest
      @statquest  7 หลายเดือนก่อน

      I'll definitely keep that in mind.

  • @vuhuynh8740
    @vuhuynh8740 2 หลายเดือนก่อน +1

    StatQuest is awesome!!

    • @statquest
      @statquest  2 หลายเดือนก่อน +1

      double bam!!! :)

  • @garychow7719
    @garychow7719 9 หลายเดือนก่อน +1

    thank you! the video is really nice

    • @statquest
      @statquest  9 หลายเดือนก่อน

      Glad you liked it!

  • @julianh7305
    @julianh7305 6 หลายเดือนก่อน +1

    Hi Josh, great video, as always. I was wondering if you would also make a video about Encoder-only Transformers, like Google's BERT for instance, which can also be used for a great variety of tasks.

    • @statquest
      @statquest  6 หลายเดือนก่อน

      I'll keep that in mind.

  • @RayGuo-bo6nr
    @RayGuo-bo6nr 8 หลายเดือนก่อน +1

    What a wonderful video!!! BTW, When will you publish your CD? I will buy it too😄Thanks!

    • @statquest
      @statquest  8 หลายเดือนก่อน

      BAM! Thank you!

  • @hoangminhan460
    @hoangminhan460 8 หลายเดือนก่อน +1

    that's perfect. Can you do more lectures on LLMs? Thanks a lot.

    • @statquest
      @statquest  8 หลายเดือนก่อน +1

      I'll keep that in mind.

  • @101alexmartin
    @101alexmartin 5 หลายเดือนก่อน +1

    Thanks for the great video, Josh. I got a question for you. What should drive my decision on which model to choose when facing a problem? In other words, how to choose between an Encoder-Decoder transformer, Decoder-only transformer or Encoder-only transformer? For instance, why ChatGPT was based on a Decoder-only model, and not on a Encoder-Decoder model or an Encoder-only model (like BERT, which has a similar application)

    • @statquest
      @statquest  5 หลายเดือนก่อน +1

      Well, the reason ChatGPT choose Decoder-Only instead of Encoder-Decoder was that it was shown to work with half as many parameters. As for why they didn't use an Encoder-Only model, let me quote my friend and colleague, Sebastian Raschka: "In brief, encoder-style models are popular for learning embeddings used in classification tasks, encoder-decoder-style models are used in generative tasks where the output heavily relies on the input (for example, translation and summarization), and decoder-only models are used for other types of generative tasks including Q&A." magazine.sebastianraschka.com/p/understanding-encoder-and-decoder#:~:text=In%20brief%2C%20encoder%2Dstyle%20models,other%20types%20of%20generative%20tasks

  • @victorluo1049
    @victorluo1049 2 หลายเดือนก่อน +1

    Hello Josh, thank you again for your video !
    I had one question concerning training the model on next token prediction:
    As training data, would you use "What is statquest " or "What is statquest awesome" ?
    What I mean by that, is when training the model by feeding it an input prompt such as "What is statquest ", do you also feed the model the word that comes after it (for calculating the loss), here "awesome" ?

    • @statquest
      @statquest  2 หลายเดือนก่อน +1

      The training inputs were "What is statquest awesome", and the labels were "is statquest awesome ". I'm working on a video that goes through how to code a transformer and how to prepare the training data. Hopefully it will be out soon.

    • @victorluo1049
      @victorluo1049 2 หลายเดือนก่อน +1

      @@statquest Thank you for your answer. I see that the decoder also learns to embed the input then (here, on the input , the label is "awesome").
      I'm looking forward to your next vide !

  • @exoticcoder5365
    @exoticcoder5365 9 หลายเดือนก่อน +3

    Hey Josh ! Would you mind making videos about graph neural networks ( GNN ) or graph convolutional network ( GCN ), and most importantly, the graph Attention Network ( GAT ) ? I have briefly gone over the maths these days, I already knew the matrix manipulation stuff but I think with your help, it would be much clear like your Transformer series, especially on the attention mechanism in the graph attention network ( GAT ), many Thanks 🙏🏻🙏🏻🙏🏻🙏🏻 appreciated !

    • @statquest
      @statquest  9 หลายเดือนก่อน +1

      I'll keep that in mind.

    • @nahiyan8
      @nahiyan8 9 หลายเดือนก่อน +1

      GNNs really only have two main elements to them, the aggregate function and the update function. The different choices of these two functions give rise to the different variants, GCN, GAT, etc.

  • @random-ds
    @random-ds 3 หลายเดือนก่อน

    Hello Josh! First of all thank you for this great video, as usual it's very simplified and straightforward.
    However, I have a little question. I saw your videos on transformers and this one, but every time I feel like the output is already there waiting to be embedded and then predicted. I mean that why the answer can't be "great" in stead of "awesome", what was the probablities given by the model for "great" and for "awesome" to make the final prediction. Here I gave the example of one extra word (great) but in real life it's the whole dictionary of words that can be predicted. So when generating the output, does it compute the "query" and "key" of the whole dictionary of words and then hopefully the right word has the best softmax probability?
    Thanks in advance for the clarification.

    • @statquest
      @statquest  3 หลายเดือนก่อน

      no, you only calculate the queries, keys and values for the input tokens and the output as it is generated. However, in practice, instead of training on just a few phrases, we train on the entire wikipedia. As a result, the transformer can be much more expressive.

  • @ruksharalam173
    @ruksharalam173 9 หลายเดือนก่อน +1

    A thorough explanation 😀

    • @statquest
      @statquest  9 หลายเดือนก่อน

      Thanks!

    • @ruksharalam173
      @ruksharalam173 9 หลายเดือนก่อน

      @@statquest if possible, could you please do a video on structural differences between llama and GPT?

    • @statquest
      @statquest  9 หลายเดือนก่อน

      @@ruksharalam173 I'll keep that in mind.

  • @terryliu3635
    @terryliu3635 2 หลายเดือนก่อน

    Another great session, thank you!!! Quick question, how do we decide what numbers to use for the Keys and Values?

    • @statquest
      @statquest  2 หลายเดือนก่อน +1

      For the weights? Those are determined with backpropagation: th-cam.com/video/IN2XmBhILt4/w-d-xo.html

  • @jossevandekerchove1020
    @jossevandekerchove1020 6 หลายเดือนก่อน +1

    Can you please make a video about GNN? You are reaaallyy good at explaining

    • @statquest
      @statquest  6 หลายเดือนก่อน

      I'll keep that in mind.

  • @nobiaaaa
    @nobiaaaa 7 หลายเดือนก่อน

    Great explanation! Btw, what is the manuscript that first described the original GPT?

    • @statquest
      @statquest  7 หลายเดือนก่อน

      I believe it is called "Improving Language Understanding by Generative Pre-Training"

  • @NJCLM
    @NJCLM 9 หลายเดือนก่อน +1

    Awsome as always from you !! now we only need en real tutorial with python to creat a mini transformer model. hops it is on the making as my wish list

    • @statquest
      @statquest  9 หลายเดือนก่อน +1

      Working on it!

  • @nivethanyogarajah1493
    @nivethanyogarajah1493 3 หลายเดือนก่อน +1

    Incredible!

    • @statquest
      @statquest  3 หลายเดือนก่อน

      Thank you!

  • @iProFIFA
    @iProFIFA 9 หลายเดือนก่อน +2

    would love to learn about bidirectional transformers next ;-)

    • @statquest
      @statquest  9 หลายเดือนก่อน

      I'll keep that in mind.

    • @cristinaprecioso
      @cristinaprecioso 9 หลายเดือนก่อน +1

      @@statquest Pleeeeease, Josh!

  • @linhdinh136
    @linhdinh136 9 หลายเดือนก่อน

    Thank you, Josh, for yet another excellent video on GPT. I find myself slightly puzzled regarding the input and output used to train the Decoder-only transformer in your example. In the normal Transformer model, the training input would be "what is statquest ," and the output would be "awesome ."
    However, in the case of the Decoder-only model, as far as I understand, the training input remains "what is statquest ," but the output becomes "what is statquest awesome ." Could you help to clarify this? If my understanding is correct, I'm wondering how the Decoder-only transformers know when to stop during inference, considering that there are two tokens within the generated response.

    • @statquest
      @statquest  9 หลายเดือนก่อน +1

      Because the first is technically part of the input, we just ignore it during inference. Alternatively, you could use a different token to indicate the end of the input.

  • @xspydazx
    @xspydazx 9 หลายเดือนก่อน

    Chat gpt is still a dialog system at its heart and has many different models which it gets results from . It softmaxes the outpits acording to the intent , ... So intent detection plays a large role in the chatgpt response .. the transformers are doing major works .. its super interested despite bqttling away with vb net !

  • @cosmicfluke3718
    @cosmicfluke3718 6 หลายเดือนก่อน +1

    We dont have to ask gpt to know stat quest is awesome reply from gpt BAM!!! BAM!! BAM!!

    • @statquest
      @statquest  6 หลายเดือนก่อน

      BAM! :)

  • @Nana-wu6fb
    @Nana-wu6fb 5 หลายเดือนก่อน +1

    Thanks!

    • @statquest
      @statquest  5 หลายเดือนก่อน

      Thank you so much for supporting StatQuest!!! BAM! :)

  • @jacksonrudd3886
    @jacksonrudd3886 9 หลายเดือนก่อน

    Thank you for the incredible content. Josh, quick question for you. I didn't see you mention vertically stacking the decoders in a way where the output of one decoder is the input for the next. From the 'Illustrated Transformer' page (I can't link b.c. youtube won't let me) it seems to be a core aspect of transformers. Thanks again.

    • @statquest
      @statquest  9 หลายเดือนก่อน +1

      Personally, I wouldn't call that a core aspect. Unlike the ability to layer attention units, feeding the output of one decoder into the input of another in a stack, did not influence how the decoder-transformer (or even an encoder-decoder transformer) was designed. In contrast, the ability to layer attention had a big influence on how the transformer was designed.

    • @jacksonrudd3886
      @jacksonrudd3886 9 หลายเดือนก่อน

      That makes sense. I just took another look at the Attention is All You Need paper, and it corroborates your explanation. Thank you.
      Request: a video on the practicalities of creating and training a production LLM. The data volume, the number of parameters and how the production architecture differs from the simplified educational model provided in this video. I think this would allow the audience to better understand what simplifications were (rightly) made for the purpose of explication.
      Also thank you so much for what you do. You are creating some of the best educational content on the internet. I am so jealous of this upcoming generation for having teachers like you :)

    • @statquest
      @statquest  9 หลายเดือนก่อน

      @@jacksonrudd3886 That's right - in the Attention is all you need paper, they just mention the stacking (N=6) in passing and don't spend any time on it. And I'm planning on making the exact video that you want me to make. It may take some time, but it's in the works.

  • @MrHummerle
    @MrHummerle 9 หลายเดือนก่อน +1

    Hi there!
    Came to YT in hope you had a nice video of Rank Robustness. Would be amazing, if you wanted to make a video about it!
    Keep it up!
    Also: nice Dinosaurs!

    • @statquest
      @statquest  9 หลายเดือนก่อน

      Thanks!

  • @shamshersingh9680
    @shamshersingh9680 หลายเดือนก่อน

    Hi Josh, thanks a ton for making such a simple video on such a complex topic. Can you please explain what do you mean when you say "

    • @statquest
      @statquest  หลายเดือนก่อน

      Your comment is missing the quote that you have from the video. Could you retype it in?

    • @shamshersingh9680
      @shamshersingh9680 หลายเดือนก่อน

      @@statquest Yeah. Can you please explain - Note :- If we were training the Decoder-only transformer, then we would use the fact that we made a mistake to modify weights and biases. In contrast when we are just using the model to generate the responses, then it doesn't really matter what words come out right now.

  • @aryamansinha2932
    @aryamansinha2932 3 หลายเดือนก่อน

    hello..first off thank you for this great content
    I had a question/s
    could you give an example of how the embedding neural network is trained? i.e. what is the input and output in the embedding neural network during training? The neural networks I have worked with statements that go along the lines of "given a set of pixels determine whether the picture is a cat or not".. I do not know what the equivalent is with embedding neural networks
    and follow up question..can the embedding neural network be the same for an encoder-decoder model and a decoder-decoder model?

    • @statquest
      @statquest  3 หลายเดือนก่อน +1

      1) We don't train the embedding layer separately from the rest of the transformer. So the inputs are what you see here as well as the ideal outputs that we use for training.
      2) Once trained, yes.

  • @kartikchaturvedi7868
    @kartikchaturvedi7868 9 หลายเดือนก่อน +1

    Superrrb Awesome Fantastic video

    • @statquest
      @statquest  9 หลายเดือนก่อน

      Thank you!

  • @AndreasAlexandrou-to5pw
    @AndreasAlexandrou-to5pw 8 หลายเดือนก่อน

    Had a couple of questions regarding word embedding:
    - Why do we represent each word using two values? Couldn't we just use a single one?
    - What is the purpose of the linear activation function, can't we just pass the summation straight to the embedder output?
    Thanks for the video!

    • @statquest
      @statquest  8 หลายเดือนก่อน

      1) Yes. In these examples I use 2 because that's the minimum required for the math to be interesting enough to highlight what's really going on. However, usually people us 512 or more embedding values.
      2) Yes. The activation functions serve only to be a point where we do summations.

  • @TheTimtimtimtam
    @TheTimtimtimtam 9 หลายเดือนก่อน +2

    Sir Josh, Thank you for making this public. May God Bless you.

    • @statquest
      @statquest  9 หลายเดือนก่อน +1

      Thank you!

  • @yasboyy
    @yasboyy 9 หลายเดือนก่อน

    I just have a question. The Word Embedding network contains weights that were obtained with back propagation. But on which data was it trained ? Is it like a huge superset of our current "what is Statquest awsome EOS" vocabulary ?

    • @statquest
      @statquest  9 หลายเดือนก่อน +1

      In this case, the word embedding network was trained with the same input/output sequences I used for the entire decoder-only transformer. In other words, i trained all of the weights at the same time, rather than training the word embeddings separately.

  • @OliviaB-xu1vc
    @OliviaB-xu1vc 2 หลายเดือนก่อน

    Thank you so much for another great video! I did have a question -- I'm confused about why you can train word embeddings with only linear activation functions because I thought that linear activation functions wouldn't allow you to learn non-linear patterns in the data, so why wouldn't you just not use an activation function at all in that case or use only one?

    • @statquest
      @statquest  2 หลายเดือนก่อน

      For word embeddings specifically, we want to learn linear relationships among the words. This is illustrated in my video on word embeddings: th-cam.com/video/viZrOnJclY0/w-d-xo.html And, technically, when coding a linear activation function, you just omit the activation function.

  • @Modern_Nandi
    @Modern_Nandi 7 หลายเดือนก่อน +1

    Brilliant

    • @statquest
      @statquest  7 หลายเดือนก่อน

      Thanks!

  • @jazzeuphoria
    @jazzeuphoria 9 หลายเดือนก่อน +2

    Bedankt

    • @statquest
      @statquest  9 หลายเดือนก่อน

      HOORAY!!!! Thank you so much for supporting StatQuest! BAM! :)

  • @AbuDurum
    @AbuDurum 9 หลายเดือนก่อน

    Hey Josh. I just want to ask what software you use to make the diagrams?

    • @statquest
      @statquest  9 หลายเดือนก่อน

      I use keynote and show some of my tricks here: th-cam.com/video/crLXJG-EAhk/w-d-xo.html

  • @enchanted_swiftie
    @enchanted_swiftie 9 หลายเดือนก่อน +2

    You didn't use the innocent, cozy, soft bear for softmax 🧸😢 _(in most of the parts)_

    • @statquest
      @statquest  9 หลายเดือนก่อน +1

      Good point!!! I think I need a smaller bear. :)

  • @zhangeluo3947
    @zhangeluo3947 8 หลายเดือนก่อน

    My another question is just for training the encoder-decoder transformer, we can just do Masked-Self-Attention to all the ground true(known) decoded tokens at the same time? Is that right?

    • @statquest
      @statquest  8 หลายเดือนก่อน

      I believe that is correct

  • @ytpah9823
    @ytpah9823 8 หลายเดือนก่อน

    🎯 Key Takeaways for quick navigation:
    00:00 🤖 Decoder-only Transformers are used in ChatGPT to generate responses to input prompts.
    01:48 📊 Word embedding is a common method to convert words into numbers for neural networks like Transformers.
    08:09 🌐 Positional encoding is used in Transformers to maintain word order information in input data.
    10:53 🧩 Masked self-attention in Transformers helps associate words in a sentence by calculating similarities between words.
    16:28 🧮 Softmax function is used to determine the percentage of each word's influence on encoding a given word in self-attention.
    19:56 🧠 Reusing sets of weights for queries, keys, and values allows Transformers to handle prompts of different lengths.
    23:52 🤖 Decoder-only Transformers both encode input prompts and generate responses, enabling training and evaluation.
    25:58 🧠 The decoder-only Transformer process involves several steps, including word embedding, positional encoding, masked self-attention, residual connections, and softmax for generating responses.
    29:09 🤖 Masked self-attention in a decoder-only Transformer ensures it keeps track of significant words in the input when generating the output.
    32:23 🔄 Key differences between a decoder-only Transformer and a regular Transformer include using the same components for encoding and decoding in the decoder-only Transformer, using masked self-attention all the time, and including input and output in the attention mechanism.
    34:15 📚 During training, a regular Transformer uses masked self-attention on known output tokens to learn correct generation without cheating, while a decoder-only Transformer uses masked self-attention throughout the process.

    • @statquest
      @statquest  8 หลายเดือนก่อน

      bam!

  • @aryamansinha2932
    @aryamansinha2932 หลายเดือนก่อน

    one more question... why is there one common FC layer used in the decoder bit (predict statquest given "what is") vs (predicting awesome when given EOS token and "what is statquest")...i would think they would be separate FC layers for both of them since one is predicting the next word..the other is predicting the word in the middle?

    • @statquest
      @statquest  หลายเดือนก่อน

      If you use an encoder-decoder design, you can have different fully connected layers for the different parts of the input and output. However, they decided that this simpler model, with fewer parameters, worked better.

  • @JanKowalski-dm5vr
    @JanKowalski-dm5vr 2 หลายเดือนก่อน

    Great video. Do I understand correctly DNN which is responsible for word embedding, it not only converts the token to its representation as a numeric vector, but already predicts as the next word should be returned ?

    • @statquest
      @statquest  2 หลายเดือนก่อน

      In a transformer the embedding layer alone does not predict the next word because it wasn't specifically trained to do that the way a stand alone word embedding layer (like word2vec) would.

    • @JanKowalski-dm5vr
      @JanKowalski-dm5vr 2 หลายเดือนก่อน

      @@statquest But if we train the whole model at the same time, then backpropagation does not change the weights of the network responsible for word embedding in such a way that they learn to predict the next word? Or don't we train this first network while learning ?

    • @statquest
      @statquest  2 หลายเดือนก่อน +1

      @@JanKowalski-dm5vr It might. But the whole model, word embeddings and attention and everything, is trained to predict the next word, or translate, or whatever it's trained to do. So it's hard to say exactly what the word embedding layer will learn.

  • @txxie
    @txxie 6 หลายเดือนก่อน

    Thank you, your video is great! But I'm really confused about the EOS token. Why does the model keep generating new words after generating the EOS token in the prompt? Should it just stop? What is the difference between the EOS tokens in the prompt and the output?

    • @statquest
      @statquest  6 หลายเดือนก่อน

      I'm not sure I understand your question. After the input prompt, we insert an EOS token so that the decoder will be correctly initialized and then we generate output tokens until a second EOS is generated.

    • @txxie
      @txxie 6 หลายเดือนก่อน

      Thank you for your reply, but most LLMs such as LLaMA and GPT do not use an EOS token to initialize the generation of the output.@@statquest

    • @statquest
      @statquest  6 หลายเดือนก่อน

      @@txxie The versions I've seen do. And if they don't, then they presumably use some other token that fills the same role. So, you can use one special token for both, or you can use two. Either way works.

  • @BooleanDisorder
    @BooleanDisorder 4 หลายเดือนก่อน

    Dude, can you make a video on state space models like Mamba? It's super interesting!

    • @statquest
      @statquest  3 หลายเดือนก่อน +1

      I'll keep that in mind.

    • @BooleanDisorder
      @BooleanDisorder 3 หลายเดือนก่อน +1

      Bam! @@statquest

  • @nmfhlbj
    @nmfhlbj 2 หลายเดือนก่อน

    hi ! thankyou so much for your great video! i have some questions that i havent understand..
    1. how to interpret query, key, and value as in definition? cause ive been watching a lot of attention videos but i still dont get what are they actually and how to convert the input to became the Q,K,V
    2. im doing a research for forecasting stock prices using transformer, but i still dont get it how to do embedding with numerical values as the input (every other videos explained it with words as the inputs).. do you know how?
    3. what is the attention's output shape? is it a matrix or just a regular vector?
    thank you !

    • @statquest
      @statquest  2 หลายเดือนก่อน

      1) Query, Key and Value are terms that come from databases. What they represent is in the video. Is there a time point (minutes and seconds) that is confusing?
      2) You don't need embedding if you start with numbers. The only reason we do embedding with words is to convert them to numbers.
      3) A matrix.

    • @nmfhlbj
      @nmfhlbj 2 หลายเดือนก่อน

      thank you for your answer, here's some follow up answers and questions:
      1) yes there is, around minutes 14 when you explain the query and keys calculation.. i still dont get how can we multiply different weights and get query, then multiply another weight and get keys.. whats the difference of their weights representation?
      2) oh okayy, but in some research they did an embedding to make the numbers smaller.. is it possible?

    • @statquest
      @statquest  2 หลายเดือนก่อน

      1) The different sets of weights allow for the queries to be different from the keys. (if we used the same set of weights for both, they'd be the same).
      2) To be honest, I'm not sure I understand what it means to use embedding to make numbers smaller, but you could tokenize the numbers and use those as input to a word embedding layer.

    • @nmfhlbj
      @nmfhlbj 2 หลายเดือนก่อน +1

      @@statquest okay, i'll search it more later.. thankyou for your time to answering my questions ! 🫶

  • @Primes357
    @Primes357 4 หลายเดือนก่อน

    I didn't understand just one part: how are the weights to calculate Q, K and V for each word in the sentence calculated? Is it also an optimization process? If so, how is the loss function calculated?

    • @statquest
      @statquest  4 หลายเดือนก่อน +1

      At 5:08 I say that all of the Weights in the entire transformer are determined using backpropagation. Specifically, we use cross entropy as the loss function. For more details about cross entropy, see: th-cam.com/video/6ArSys5qHAU/w-d-xo.html and th-cam.com/video/xBEh66V9gZo/w-d-xo.html

  • @sreerajnr689
    @sreerajnr689 หลายเดือนก่อน

    In an encoder-decoder transformer encoder was trained in English and decoder was trained in Spanish which made it possible to do translations. But here, only English is used for both encoding and decoding which makes it impossible to convert the English encoding to Spanish output. So here, would we used both language datasets combined to train the model to enable it to do translations as well?

    • @statquest
      @statquest  หลายเดือนก่อน +1

      Usually the tokens are just fragments of words, instead of entire words. This gives the decoder-only transformer more flexibility in terms of the vocabulary, since it can form new works it was never even trained on by combining the tokens in new ways. In this way, you can train a decoder-only transformer to translate english to spanish.

  • @cromi4194
    @cromi4194 6 หลายเดือนก่อน +3

    Wow this series culminating in a perfect explanation of GPT is the most magnificent piece of education in the history of mankind. Explaining the very climax of data science in this understandable step-by-step way so I can say that I understood it should earn you the noble prize in education! I am so grateful that you never used linear algebra in and of your videos. Professors at university don't understand that using linear algebra prevents everyone from actually understanding what is going on but only learning the formula.
    I have an exam in Data Science on Friday in a week. Can you make a quick video about spectral clustering by Wednesday evening? I will pay you 250$! :)

    • @statquest
      @statquest  6 หลายเดือนก่อน +2

      Thanks! If I could make a video on anything in a week, that would be a miracle. Unfortunately, all of my videos take forever to make.

  • @shinoo5004
    @shinoo5004 8 หลายเดือนก่อน

    Hi josh. Would you mind making a video for retention network?

    • @statquest
      @statquest  8 หลายเดือนก่อน

      I'll keep that in mind.

  • @zhangeluo3947
    @zhangeluo3947 8 หลายเดือนก่อน

    The last question is how stacking all possible different cells of K,Q and V work? By just averaging their different ouputs or linearly transformed by a certain matrix W0?

    • @statquest
      @statquest  8 หลายเดือนก่อน

      They concatenate the outputs (the attention values) into a vector, and then run that vector through a neural network that has the same number of outputs as the word embeddings.

  • @thanhtrungnguyen8387
    @thanhtrungnguyen8387 8 หลายเดือนก่อน

    In 25:42, when the model generates the wrong word, it will be fixed by backpropagation if this is the training process and it will be ignored if this is the generation process, right?

    • @statquest
      @statquest  8 หลายเดือนก่อน +1

      Yes.

  • @huiwencheng4585
    @huiwencheng4585 4 หลายเดือนก่อน +1

    謝謝!

    • @statquest
      @statquest  4 หลายเดือนก่อน

      Thank you so much for supporting StatQuest!!! TRIPLE BAM! :)

  • @Max-ry9wl
    @Max-ry9wl 6 หลายเดือนก่อน

    Hey Josh! I need to solve generation task using decoder only model.
    How I should preprocess corpus for this? I think that splitting in 2 parts and separate parts with token is good solution.
    But I dont understand how train this model and calculate loss. Input for model is tokens_first_part + tokens_second ann output[index of sep:] of model compare with input[indx of:]

    • @statquest
      @statquest  6 หลายเดือนก่อน +1

      I'll create videos on how to code transformers and decoder-only transformers soon.

  • @pfever
    @pfever หลายเดือนก่อน +1

    I don't understand why we need the residual connections.... =''( isn't the word and position encoded values information already included in the masked self-attention values? or is most of the information lost so we need to directly add the word and position encoded values?

    • @statquest
      @statquest  หลายเดือนก่อน

      In theory you do not need them, but in practice they make it much easier to train large neural networks since each component can focus on it's own thing without having to maintain the information that came before it.

    • @pfever
      @pfever หลายเดือนก่อน +1

      @@statquest Bam! Thank you!

  • @ishaansehgal2570
    @ishaansehgal2570 4 หลายเดือนก่อน

    I am a bit confused why are we encoding the input prompt and generating the next predicted word for each word in the input prompt. We don't use this information at all when generating the output part right? For generating the output part we just use the KQV from the input prompt and continue from there? How are the two parts connected

    • @statquest
      @statquest  4 หลายเดือนก่อน +1

      That is correct - we don't use the output until we get to new stuff. However, if we wanted to, we could use the early output for training (since we know what the input is, we can compare it to what the decoder generates).

  • @edphi
    @edphi 6 หลายเดือนก่อน

    Damn i have learn the whole of decoder and encoder models from start to finish including training and deploying but not understand the math the way you opened the pandora box. Now the sine and cosine and query key value and everything is flying out in my head

    • @statquest
      @statquest  6 หลายเดือนก่อน

      bam?

  • @PetrBorkovec-wk1ux
    @PetrBorkovec-wk1ux 2 หลายเดือนก่อน

    I don´t undersatnd how is it possible to add numbers for word embeding to positional encoding and then to self-attention. I think it is the same as to add together for instance length, weight and temperature? Could anybody help me, please??!

  • @adityarajora7219
    @adityarajora7219 6 หลายเดือนก่อน

    Begging, Please teach us the BERT model, BAM!!

    • @statquest
      @statquest  6 หลายเดือนก่อน

      I'll keep that in mind.

  • @AndrewChico
    @AndrewChico 8 วันที่ผ่านมา +1

    Hi Josh, excellent video. I only recently found you but your channel is amazing.
    It looks like you inadvertantly copied over the example value for "is" from the value for "what" starting around 1925 and this continues forward through the video.
    I mostly say this in case you ever adopt these notes directly into a book.

    • @statquest
      @statquest  8 วันที่ผ่านมา

      Thank you! I've corrected my notes and do, in fact, plan on including it in a book soon!

  • @laurentlusinchi519
    @laurentlusinchi519 3 หลายเดือนก่อน

    The embedding values for "what" and "Statquest" are identical before the positional encoding. Is that not a typo ?

    • @statquest
      @statquest  3 หลายเดือนก่อน

      That is correct. In order to illustrate how a decoder-only transformer worked, I had to make the model as simple as possible, and, as a result, some of the nuance in the values for the weights was lost.

  • @raminziaei6411
    @raminziaei6411 8 หลายเดือนก่อน

    Hi Josh! I'm a little bit confused about the whole idea of generating the input first, compare it to the actual input and use to modify weights and biases in the training phase. I cannot find anywhere on the internet that it is mentioned. All I see is that the masked self attention is used on the input sequence to make contextualized version of each word and then, they are used to generate the target tokens. Nowhere can I find that generating the input sequence and compare it to the actual input is part of the process. Can you please clarify?

    • @statquest
      @statquest  8 หลายเดือนก่อน

      What time point, minutes and seconds, are you asking about?

    • @raminziaei6411
      @raminziaei6411 8 หลายเดือนก่อน

      @@statquest The section "generating the next word in the prompt" in this video. mins 23-27

    • @statquest
      @statquest  8 หลายเดือนก่อน

      @@raminziaei6411 The idea of comparing the predicted input sequence to the known input sequence comes from the original manuscript that describes decoder only transformers, GENERATING WIKIPEDIA BY SUMMARIZING LONG SEQUENCES. They say: "Since the model is forced to predict the next token in the input as well as [the output] error signals are propagated from both input and output time-steps during training."

    • @raminziaei6411
      @raminziaei6411 7 หลายเดือนก่อน

      @@statquest Thanks Josh. A related question would be "Does this only hold true if we are considering causal decoder transformers where we have masked self-attention for both input and output sequences? For prefix decoder transformers, where the input has bidirectional self-attention (full self-attention) and the output has masked self-attention, it should not hold true. Is that correct? I mean if the input has bidirectional self-attention, there is no point in predicting the next token in the input, since it has already seen the whole input sequence.

    • @statquest
      @statquest  7 หลายเดือนก่อน

      @@raminziaei6411 I think "Encoder-Only Transformers", like BERT, use full self-attention on the input, even though they still can predict the input. However, I don't know for sure.

  • @shixiancui6870
    @shixiancui6870 8 หลายเดือนก่อน

    Looks like when we encode the prompts, we only need to compute K and V for each input words, then generate outputs token by token starting with EOS. Is this true? I'm a bit confused here because previous half of your video shows that we also need to compute the whole self-attention values for prompts i.e. Q*K*V.
    Edit: maybe it's because of that we need to reuse the same masked self-attention cell for encoding, and cannot avoid computing Q*K*V for prompts.

    • @statquest
      @statquest  8 หลายเดือนก่อน

      I'm not sure I understand your comment. We calculate the "queries" for every token in the input and the output except for the final output .

    • @shixiancui6870
      @shixiancui6870 8 หลายเดือนก่อน

      @@statquest Yes, what I meant was that, although we calculate "queries" for input tokens, but we don't use it to generate the first output token (only use K, V vectors in this case), am I right?

    • @statquest
      @statquest  8 หลายเดือนก่อน

      ​@@shixiancui6870 For the very first input token, "masked attention" only the Value numbers play a significant role. The Query and Key numbers are still used, but they always result in using 100% of the Value numbers for the first token.

  • @LifeObserver-007
    @LifeObserver-007 9 หลายเดือนก่อน +1

    Big thanks! Appreciate your efforts. I enjoyed your great book as well.
    Can you make a video explaining how chatGPTs and other LLMs do logical reasoning?
    The differences you presented between normal tramsformer and decoder transformer does not explain, which is better for what application?

    • @iProFIFA
      @iProFIFA 9 หลายเดือนก่อน +1

      from what i could gather over these videos: encoder-decoder transformers work when you have an encoding sequence and a decoding sequence. you want to transform some input sequence into some other sequence in a different domain. for example when you want to train your model to translate from english to french.
      decoder-only transformers on the other hand have no target sequence, they just keep generating one token at a time. you don't want to translate or somehow transform the input sequence but rather keep generating text. e.g. you start a sentence and the transformer completes it.
      LLMs essentially manipulate symbols, such as words and phrases, based on patterns it has learned during its training. it doesn't have a true understanding of the meaning behind these symbols like humans do, so there isn't really any logical reasoning in them.

    • @statquest
      @statquest  9 หลายเดือนก่อน

      I'll keep those topics in mind. To be honest, I think the motive for using a Decoder-Only model was, because it had fewer overall weights (approximately half the number of weights in an encoder-decoder model), it scaled better. To quote from the original manuscript: "we introduce a decoder-only architecture that can scalably attend to very long sequences, much longer than typical encoder- decoder architectures used in sequence transduction."

  • @zhangeluo3947
    @zhangeluo3947 8 หลายเดือนก่อน

    Hey sir, in terms of training that decoder-only generative transformer, each time for training a input prompt, we just need x ,..., x (all tokens besides the last ) all those tokens' Masked Self-Attention vectors to feed into softmax and observe their ouputs(which are generated tokens immediately after them) against those real(ground true) prompt tokens? Is that true for training only?

    • @statquest
      @statquest  8 หลายเดือนก่อน

      To be honest, I'm not sure I understand your question, but for training, we compare the known tokens to the predicted tokens.

    • @zhangeluo3947
      @zhangeluo3947 8 หลายเดือนก่อน

      Okay, I get that@@statquest

    • @zhangeluo3947
      @zhangeluo3947 8 หลายเดือนก่อน

      By the way, my stupid question is that for decoder-only, the training is just focus on the input prompt right?@@statquest

    • @statquest
      @statquest  8 หลายเดือนก่อน

      @@zhangeluo3947 For decoder only, we can use the input and the output for training.

  • @SakvaUA
    @SakvaUA 7 หลายเดือนก่อน

    So, when one picks encoder-decoder architecture and when decoder only is sufficient?

    • @statquest
      @statquest  7 หลายเดือนก่อน

      It might depend on a the problem, but I think the real question might be when encoder only is best vs when decoder only is best. And that definitely depends on the problem. Encoder-only use unmasked attention all the time, so they are best for problems where looking ahead really is needed.

  • @by301892
    @by301892 หลายเดือนก่อน

    I feel it’s a bit misleading that it seems the tokens of the input sequence is fed in one by one, and that when you put in the first token, it predicts the second token but just ignores it, where in reality it feeds the entire sequence to predict the next target token, and on next iteration, you append the input sequence with the target token as input, and predicts the second target token, and so on. Right?

    • @statquest
      @statquest  หลายเดือนก่อน

      At 26:28 I state that each token in the prompt is processed simultaneously.

    • @by301892
      @by301892 หลายเดือนก่อน +1

      @@statquest gotcha. Thanks for the clarification, sensei.

  • @sreerajnr689
    @sreerajnr689 หลายเดือนก่อน

    Is it the same network that is being used in BERT and GPT? What makes them different?

    • @statquest
      @statquest  หลายเดือนก่อน

      BERT is an encoder-only transformer. The major difference is that in Bert, attention can look at stuff that comes before and after instead of just before.