To learn more about Lightning: lightning.ai/ Support StatQuest by buying my books The StatQuest Illustrated Guide to Machine Learning, The StatQuest Illustrated Guide to Neural Networks and AI, or a Study Guide or Merch!!! statquest.org/statquest-store/
This explanation is essential for anyone looking to understand how ChatGPT works. While more in-depth exploration is necessary to grasp all the intricacies fully, I believe this explanation couldn't be better. It's exactly what I needed.
Haven't had a single stats course in over 3 years but I still keep up with this channel from time to time! Neural networks are way more complex than what I've ever had to deal with, but you manage to break down even these topics into bite size pieces...Bam!!
Was stuck on a stupid detail about the architecture for a couple of hours. Your elaborate illustrations helped me make it clear in my mind right away! Thanks, keep going :)
This video is proof that repetition is prime when teaching advanced concepts. I've watched many similar videos in the past and could never get all of these numbers to finally make sense in my mind. With your previous transformer video, I was getting closer but somewhat got lost again with the QVK values. Having to this second video to watch in a row made it clearer for me what all these numbers do and why we need them.
🎯 Key Takeaways for quick navigation: 00:00 🤖 Decoder-only Transformers are used in ChatGPT to generate responses to input prompts. 01:48 📊 Word embedding is a common method to convert words into numbers for neural networks like Transformers. 08:09 🌐 Positional encoding is used in Transformers to maintain word order information in input data. 10:53 🧩 Masked self-attention in Transformers helps associate words in a sentence by calculating similarities between words. 16:28 🧮 Softmax function is used to determine the percentage of each word's influence on encoding a given word in self-attention. 19:56 🧠 Reusing sets of weights for queries, keys, and values allows Transformers to handle prompts of different lengths. 23:52 🤖 Decoder-only Transformers both encode input prompts and generate responses, enabling training and evaluation. 25:58 🧠 The decoder-only Transformer process involves several steps, including word embedding, positional encoding, masked self-attention, residual connections, and softmax for generating responses. 29:09 🤖 Masked self-attention in a decoder-only Transformer ensures it keeps track of significant words in the input when generating the output. 32:23 🔄 Key differences between a decoder-only Transformer and a regular Transformer include using the same components for encoding and decoding in the decoder-only Transformer, using masked self-attention all the time, and including input and output in the attention mechanism. 34:15 📚 During training, a regular Transformer uses masked self-attention on known output tokens to learn correct generation without cheating, while a decoder-only Transformer uses masked self-attention throughout the process.
Thank you so much! It is an amazing video and I haven't seen a video teaching AI/ML techniques like this anywhere! You're talented. And my research areas span Efficient LLM (LoRA, Quantization, etc). It cannot be better if I can see those concepts
I just want to say you are AMAZING. Thank you so much. I would personally love to see a video on backprop to train this, or even just training an RNN since we saw multi dim training, but not training once we get the state machine / unrolling involved. Loved the whole series 🎉
Thanks! I have notes for training an RNN, but the equations get big really fast. That said, it really is the exact same techniques presented in other videos, just a lot more of them.
Hey Josh, I’ve been really digging your videos! They’re not only informative and helpful for my studies, but they’re also super entertaining. In fact, you’ve played a big part in my decision to continue pursuing AI Engineering. Could you please do a video about low-rank adaptation(LoRA). I am not good with that.
Delighted to watch one of the Most Brilliant videos. Hats off. Will join the channel tomorrow, first thing. Meanwhile do you have a one on Probability Density Function.
Hello Josh, i am enjoying your videos as they are helping me so much with my studies as well as entertaining me. You are kinda a reason i decided to continue studying bioinformatics. Since you are covering chatGTP and stuff now, could you maybe make a video about AlphaFold architecture in the future? I understand it might not be your topic of interrest, but i would love to lear in more deeply (pun intended). Thanks either way!
Thanks again Josh! I noticed that many GPTs are decoder only. Thanks for clarifying! BTW saw that Yannic had a video on history rewrites. Probably not a topic for this channel, but still pretty cool 😁
What an awesome video and channel😁👍. Would you consider doing a video on deep q learning models? I believe everyone would benefit from a video on such a fundamental topic. Thank you for your invaluable work🤩
Hello Josh, thank you again for your video ! I had one question concerning training the model on next token prediction: As training data, would you use "What is statquest " or "What is statquest awesome" ? What I mean by that, is when training the model by feeding it an input prompt such as "What is statquest ", do you also feed the model the word that comes after it (for calculating the loss), here "awesome" ?
The training inputs were "What is statquest awesome", and the labels were "is statquest awesome ". I'm working on a video that goes through how to code a transformer and how to prepare the training data. Hopefully it will be out soon.
@@statquest Thank you for your answer. I see that the decoder also learns to embed the input then (here, on the input , the label is "awesome"). I'm looking forward to your next vide !
Endless thanks to your awesome explanation. Would you mind clarifying the next confusion to me? You've mentioned that the normal transformer uses masked self-attention only during training and normal self-attention during inference, while according to other resources, including your iterations through the examples (I think you followed the masked self-attention mechanism), and in 34:20, how are we supposed to know the next tokens during training (parallel training) while we are restricted by the mask? and thank you in advance.
Thanks for the great video, Josh. I got a question for you. What should drive my decision on which model to choose when facing a problem? In other words, how to choose between an Encoder-Decoder transformer, Decoder-only transformer or Encoder-only transformer? For instance, why ChatGPT was based on a Decoder-only model, and not on a Encoder-Decoder model or an Encoder-only model (like BERT, which has a similar application)
Well, the reason ChatGPT choose Decoder-Only instead of Encoder-Decoder was that it was shown to work with half as many parameters. As for why they didn't use an Encoder-Only model, let me quote my friend and colleague, Sebastian Raschka: "In brief, encoder-style models are popular for learning embeddings used in classification tasks, encoder-decoder-style models are used in generative tasks where the output heavily relies on the input (for example, translation and summarization), and decoder-only models are used for other types of generative tasks including Q&A." magazine.sebastianraschka.com/p/understanding-encoder-and-decoder#:~:text=In%20brief%2C%20encoder%2Dstyle%20models,other%20types%20of%20generative%20tasks
Had a couple of questions regarding word embedding: - Why do we represent each word using two values? Couldn't we just use a single one? - What is the purpose of the linear activation function, can't we just pass the summation straight to the embedder output? Thanks for the video!
1) Yes. In these examples I use 2 because that's the minimum required for the math to be interesting enough to highlight what's really going on. However, usually people us 512 or more embedding values. 2) Yes. The activation functions serve only to be a point where we do summations.
Hey, fantastic video as usual! Getting hard to find new ways to compliment, haha. Just one quick question since you mentioned positional encoding. When generating embeddings from GPT embedding models (e.g., text-embedding-3-large), do the embeddings contain both positional encoding layer and masked-self-attention info in the numbers?
Hi Josh, great video, as always. I was wondering if you would also make a video about Encoder-only Transformers, like Google's BERT for instance, which can also be used for a great variety of tasks.
Hi there! Thank you for the awesome video. It is very helpful and clear. I have a question about the part around 34:10, where you are talking about how normal transformer uses Masked Self-Attention in training. I did not quite understand how that is Masked Self-Attention (i.e. looking at the tokens before). Is the Decoder still looking at the input tokens, or is it looking at the output tokens that come before it? Thank you very much!
During training we know what the output should be, but we don't let the decoder look at tokens that come after the one it is currently processing. So if the output is "my name is fred", then, when we calculate attention for "name", the decoder can't use "is" and "fred". What's going on here might be more obvious if you look at how the math is done with matrix algebra in this video: th-cam.com/video/KphmOJnLAdI/w-d-xo.html (and if you are not already familiar with matrix algebra, see: th-cam.com/video/ZTt9gsGcdDo/w-d-xo.html )
Hey Josh ! Would you mind making videos about graph neural networks ( GNN ) or graph convolutional network ( GCN ), and most importantly, the graph Attention Network ( GAT ) ? I have briefly gone over the maths these days, I already knew the matrix manipulation stuff but I think with your help, it would be much clear like your Transformer series, especially on the attention mechanism in the graph attention network ( GAT ), many Thanks 🙏🏻🙏🏻🙏🏻🙏🏻 appreciated !
GNNs really only have two main elements to them, the aggregate function and the update function. The different choices of these two functions give rise to the different variants, GCN, GAT, etc.
Hi Josh! Thank you for your clearly explained work! You did a great job. I'm a big fan of you! I have a question about the embedding position. Here, you use two activation functions to encode the word to a vector, so that is why for the position embedding, you only use two sine and cosine to encode each word. Am I right? I found from the previous video: Transformer the foundation of chatGPT, you use four activation functions to encode each word, and later, you use four sine and cosine squiggles for the position embedding.
Each word embedding value needs a positional encoding value. So the number of sine and cosine squiggles that you use depends on the number of embedding values that you create for each token.
@@statquest Yeah. Can you please explain - Note :- If we were training the Decoder-only transformer, then we would use the fact that we made a mistake to modify weights and biases. In contrast when we are just using the model to generate the responses, then it doesn't really matter what words come out right now.
Thank you, Josh, for yet another excellent video on GPT. I find myself slightly puzzled regarding the input and output used to train the Decoder-only transformer in your example. In the normal Transformer model, the training input would be "what is statquest ," and the output would be "awesome ." However, in the case of the Decoder-only model, as far as I understand, the training input remains "what is statquest ," but the output becomes "what is statquest awesome ." Could you help to clarify this? If my understanding is correct, I'm wondering how the Decoder-only transformers know when to stop during inference, considering that there are two tokens within the generated response.
Because the first is technically part of the input, we just ignore it during inference. Alternatively, you could use a different token to indicate the end of the input.
Hi there! Came to YT in hope you had a nice video of Rank Robustness. Would be amazing, if you wanted to make a video about it! Keep it up! Also: nice Dinosaurs!
Another amazing video! So the fully connect layer basically maps two dimensional vectors to 5 dimensional vectors, in this case, is the size of the vocab(collection of tokens). Is that correct?
@@statquestGreat! I have another question: How does a fully connected neural network map vectors representing different features (dimensions) to vectors that represent the indices (dimensions) of tokens in the vocabulary?
@@harryliu1005 I'm not sure I understand your question since the video shows exactly how the fully connected layer works. If you're not already familiar with the basics of neural networks, check out these videos: th-cam.com/video/CqOfi41LfDw/w-d-xo.html and th-cam.com/video/83LYR-1IcjA/w-d-xo.html
Wow this series culminating in a perfect explanation of GPT is the most magnificent piece of education in the history of mankind. Explaining the very climax of data science in this understandable step-by-step way so I can say that I understood it should earn you the noble prize in education! I am so grateful that you never used linear algebra in and of your videos. Professors at university don't understand that using linear algebra prevents everyone from actually understanding what is going on but only learning the formula. I have an exam in Data Science on Friday in a week. Can you make a quick video about spectral clustering by Wednesday evening? I will pay you 250$! :)
Hello Josh! First of all thank you for this great video, as usual it's very simplified and straightforward. However, I have a little question. I saw your videos on transformers and this one, but every time I feel like the output is already there waiting to be embedded and then predicted. I mean that why the answer can't be "great" in stead of "awesome", what was the probablities given by the model for "great" and for "awesome" to make the final prediction. Here I gave the example of one extra word (great) but in real life it's the whole dictionary of words that can be predicted. So when generating the output, does it compute the "query" and "key" of the whole dictionary of words and then hopefully the right word has the best softmax probability? Thanks in advance for the clarification.
no, you only calculate the queries, keys and values for the input tokens and the output as it is generated. However, in practice, instead of training on just a few phrases, we train on the entire wikipedia. As a result, the transformer can be much more expressive.
Hi Josh, excellent video. I only recently found you but your channel is amazing. It looks like you inadvertantly copied over the example value for "is" from the value for "what" starting around 1925 and this continues forward through the video. I mostly say this in case you ever adopt these notes directly into a book.
I don't understand why we need the residual connections.... =''( isn't the word and position encoded values information already included in the masked self-attention values? or is most of the information lost so we need to directly add the word and position encoded values?
In theory you do not need them, but in practice they make it much easier to train large neural networks since each component can focus on it's own thing without having to maintain the information that came before it.
Thank you so much for another great video! I did have a question -- I'm confused about why you can train word embeddings with only linear activation functions because I thought that linear activation functions wouldn't allow you to learn non-linear patterns in the data, so why wouldn't you just not use an activation function at all in that case or use only one?
For word embeddings specifically, we want to learn linear relationships among the words. This is illustrated in my video on word embeddings: th-cam.com/video/viZrOnJclY0/w-d-xo.html And, technically, when coding a linear activation function, you just omit the activation function.
Josh, why in min 19:34 the similarity value for "Statquest" is lower for itself than for the word "What"? It shouldn't be larger to itself as is the self-similarity?
hello..first off thank you for this great content I had a question/s could you give an example of how the embedding neural network is trained? i.e. what is the input and output in the embedding neural network during training? The neural networks I have worked with statements that go along the lines of "given a set of pixels determine whether the picture is a cat or not".. I do not know what the equivalent is with embedding neural networks and follow up question..can the embedding neural network be the same for an encoder-decoder model and a decoder-decoder model?
1) We don't train the embedding layer separately from the rest of the transformer. So the inputs are what you see here as well as the ideal outputs that we use for training. 2) Once trained, yes.
I didn't really understand where the fully connected layer came from and how it is connected to the input values. It seems independent of the previous layers and somehow outputs the input words
After painstaking research on this topic, I have realized that all the information is not in the transformer itself. The transformer is just the neural network (i.e., the processor or function). The information is actually in the embedding model. The sequences are all stored there. When training the transformer, the word embedding model needs to be fully trained or trained with the same data. This allows the learned embeddings to fit the transformer model. Here, overfitting is beneficial as the model will be a closed model after training. The transformer can be retrained (tuned) to your specific task. The main task should be text generation corpus/tuned for question and answer. The learned embeddings model can actually generate data as well as the transformer neural network, making it a rich model. I also noticed that to get an effective word embeddings model, the optimization or "fine-tuning" should involve classified and labeled data. This allows clusters of similar terms to appear, as well as emotionally aware embeddings. The transformer is essentially performing an "anything in, anything out" operation, forcing the input and output to converge despite any logical connection between them. Given an input, it produces the desired outcome. As many similar patterns are entered, it will fit this data using the masking strategy. Transformers need to be trained as masked language models to be effective, highlighting the drawbacks of LSTM and RNN models which cannot produce the same results. The more distanced the output is from the input, the more layers and heads you need to deploy.
@@iProFIFA I mean the most important area of focus should be the embeddings model as this is the part which stores the data after training ... The transformer is just the neural network function. .. Also.... Chat gpt is primarily a "chat bot" so all inputs are fed to its Intent detection module , IE classify the user intention ... To send the query to the correct model for a response ... Right now the transformer is seen to be able to fit anything to anything model .... As long as you give an input and target to the model and enough samples the "function (transformer)" can fit the data to the model ... Hence the first model to train is the text generation model to give it many possibilitys to generate some form of text given a seed .... Then tune the model to your specific task , IE question and answer or the New "instruct" style model ... IE : for a true code generation model . The first corpus should be text books , tutorials , forum posts , mass source repos .... This will give it a massive base to generate some form of code given a seed ... Then to pass in the instruct code models , IE build a binary tree , and provide the specified code , ... It can be ANY code language .... Because the model is a transformer , later we can add another fine tune , for specific code translation ... IE csharp examples with Thier equivalent python and rust and javascript etc ... .. now this model can be retrained until it "overfits it's domain model" .... By adding general language corpus and basic questions answer modelling "after" you can now ask question and pursues project objectives ... Because this was originally trained with code model it's preferred output will be code .... The sealed language model outputted .. this is also the issue as all the data is in the embedding model you train it with ... So ... . This embedding model is the most important , as neural networks can fit any numerical data .. Which is the best model for word embeddings ... Fasttext , glove , skipgram ? Are there others ? What's the best optimiser for these models ? Are they interchangeable ? .. Finally , the transformers successes of late are due to the deployment of layers and heads .. this increase give the perception of adding dimensionality(randomness( tweakable settings (all weights and biases) but this is due to attempting to use a single model to fit all tasks ... If you have models for code gen. And others for picture gen , and others for chat gen, then others for business and local domain knowledge . Then it's better to have an intent model In front of all of these dedicated models . And direct the query to the appropriate models and take a softmax of the output to choose the correct output ! ... Again optimising the intent detector with the input and expected output and correct classification of the requested task .... Hopefully I explained it a bit better this time ! Lol .
@@xspydazx yo, your way of writing is really confusing and it seems like you went on a tangent so it’s hard to follow. Also you shouldn’t put spaces before commas and periods.
In 25:42, when the model generates the wrong word, it will be fixed by backpropagation if this is the training process and it will be ignored if this is the generation process, right?
20:56 I think the value of the word "is" is miswritten it should be 1.1 , 0.9 not 2.9,1.3 it should not be same with the value of word 'what' Thank you for your videos btw ur explanation is awesome.
Chat gpt is still a dialog system at its heart and has many different models which it gets results from . It softmaxes the outpits acording to the intent , ... So intent detection plays a large role in the chatgpt response .. the transformers are doing major works .. its super interested despite bqttling away with vb net !
hi ! thankyou so much for your great video! i have some questions that i havent understand.. 1. how to interpret query, key, and value as in definition? cause ive been watching a lot of attention videos but i still dont get what are they actually and how to convert the input to became the Q,K,V 2. im doing a research for forecasting stock prices using transformer, but i still dont get it how to do embedding with numerical values as the input (every other videos explained it with words as the inputs).. do you know how? 3. what is the attention's output shape? is it a matrix or just a regular vector? thank you !
1) Query, Key and Value are terms that come from databases. What they represent is in the video. Is there a time point (minutes and seconds) that is confusing? 2) You don't need embedding if you start with numbers. The only reason we do embedding with words is to convert them to numbers. 3) A matrix.
thank you for your answer, here's some follow up answers and questions: 1) yes there is, around minutes 14 when you explain the query and keys calculation.. i still dont get how can we multiply different weights and get query, then multiply another weight and get keys.. whats the difference of their weights representation? 2) oh okayy, but in some research they did an embedding to make the numbers smaller.. is it possible?
1) The different sets of weights allow for the queries to be different from the keys. (if we used the same set of weights for both, they'd be the same). 2) To be honest, I'm not sure I understand what it means to use embedding to make numbers smaller, but you could tokenize the numbers and use those as input to a word embedding layer.
I feel it’s a bit misleading that it seems the tokens of the input sequence is fed in one by one, and that when you put in the first token, it predicts the second token but just ignores it, where in reality it feeds the entire sequence to predict the next target token, and on next iteration, you append the input sequence with the target token as input, and predicts the second target token, and so on. Right?
Hello Josh, thank you very much for your videos, they are by far the most informative I have seen ! I had a question regarding the training of generative transformers: Can a generative encoder-decoder transformer (we expect it to behave like gpt-3 or llama) be trained with next token prediction ? Because from what I understand, for inference, to generate the output, we encode the input sentence, then we feed to the decoder (embedding layer), then we get the prediction of the first token, which we re-feed to the decoder to generate the next token, until we get a . So we get a sentence as output. However, if the training was done with next token prediction, it means that given an input (sentence), we only try to predict the very next token, which means that we encode the input, we feed to the decoder, we get the token prediction and that's it. In that case, the decoder's embedding layer never sees tokens other than in the training. So during inference, how could it comprehend tokens other than ? Maybe my assumption about the decoder only receiving during next token prediction pre-training is false.
To train an encoder-decoder we do something called "teacher forcing" which allows the network to predict the next token and then continue to predict all of the other tokens that are in the desired output, one token at a time. For details on how teacher forcing works, see: th-cam.com/video/L8HKweZIOmg/w-d-xo.html
@@statquest thank you for your answer! If I understand correctly, it is thus impossible to train an encoder-decoder on next token prediction, as we need longer outputs. On your videos, we see an encoder-decoder which is trained seq2seq for a specific task, like translation. Is it possible to build a task agnostic (like GPT-3) encoder-decoder by pre-training it seq2seq, with next sentence prediction for example ? And concerning task agnostic decoder only models (like GPT), is it because the encoder and decoder share the same structure and weights that it is possible to pre-train it with next token prediction ? Because even if the decoder's embedding layer only sees during training, the encoder's embedding layer sees many different tokens, and since they share weights, the decoder's embedding layer also learns
When I say "task agnostic model", I mean a generative model to which you can feed any prompt as input, and it will generates a text as answer, so not specific to any task So my question is about which task we can train these models on (like next token prediction, masked language modelling), so that they can be task agnostic Sorry if I'm not clear enough !
@@victorluo1049 I think we might be using different definitions for "next token prediction". To me, "next token prediction" can be applied to long outputs because we predict the output one token at a time given the preceding input and predictions. So whatever we predict, we feed it back into the model and then predict the next token. Thus, encoder-decoder and decoder or encoder only transformers all do "next token prediction". If you are using a different definition for "next token prediction", then you might come to a different conclusion.
@@statquest my definition of next token prediction is predicting the token n+1 with the tokens 1 to n as input For example let's say we have tokens 1 to 10 and I use an input window of 3 tokens, then during training : Sample 1. Input : tokens 1 to 3. Target : token 4. Details : we encode tokens 1 to 3, and feed it to the decoder, we also feed to the decoder's embedding layer. Then the decoder outputs a token prediction, which we hope to be equal to token 4. (The loss is probably calculated by comparing the probability distributions ?) Sample 2. Input : tokens 2 to 4. Target : token 5 Sample 3. Input : tokens 3 to 5. Target : token 6 Etc So the tokens 4,5,6 are predicted, but separately. There is no mechanism of feeding back an output (or true value if we use teacher forcing) to the decoder to predict the next output. So here, during each training step, the decoder only receives as input. Which would be problematic during inference, as we are supposed to feed each predicted token (which are different from ) back to the decoder to predict the next one. I may have a misunderstanding about this, but after reading GPT first paper, it feels like this is basically how the training works, they wrote about maximizing the likelihood L(token n+1 | token 1,...,token n) I would have understood if for inference we feed the initial prompt to the encoder, get a token prediction, add it to the initial prompt then re-feed it to the encoder, to get the next token, etc... But after seeing your video I saw that it is not done by iteratively feeding the encoder, but rather by iteratively feeding the decoder, so I am a bit confused (maybe this is actually only true for models that are trained seq2seq ?)
Great video. Do I understand correctly DNN which is responsible for word embedding, it not only converts the token to its representation as a numeric vector, but already predicts as the next word should be returned ?
In a transformer the embedding layer alone does not predict the next word because it wasn't specifically trained to do that the way a stand alone word embedding layer (like word2vec) would.
@@statquest But if we train the whole model at the same time, then backpropagation does not change the weights of the network responsible for word embedding in such a way that they learn to predict the next word? Or don't we train this first network while learning ?
@@JanKowalski-dm5vr It might. But the whole model, word embeddings and attention and everything, is trained to predict the next word, or translate, or whatever it's trained to do. So it's hard to say exactly what the word embedding layer will learn.
I'll try to do that. However, the only significant difference with encoder-only transformers is that they don't use masking when calculating attention.
In an encoder-decoder transformer encoder was trained in English and decoder was trained in Spanish which made it possible to do translations. But here, only English is used for both encoding and decoding which makes it impossible to convert the English encoding to Spanish output. So here, would we used both language datasets combined to train the model to enable it to do translations as well?
Usually the tokens are just fragments of words, instead of entire words. This gives the decoder-only transformer more flexibility in terms of the vocabulary, since it can form new works it was never even trained on by combining the tokens in new ways. In this way, you can train a decoder-only transformer to translate english to spanish.
one more question... why is there one common FC layer used in the decoder bit (predict statquest given "what is") vs (predicting awesome when given EOS token and "what is statquest")...i would think they would be separate FC layers for both of them since one is predicting the next word..the other is predicting the word in the middle?
If you use an encoder-decoder design, you can have different fully connected layers for the different parts of the input and output. However, they decided that this simpler model, with fewer parameters, worked better.
I didn't understand just one part: how are the weights to calculate Q, K and V for each word in the sentence calculated? Is it also an optimization process? If so, how is the loss function calculated?
At 5:08 I say that all of the Weights in the entire transformer are determined using backpropagation. Specifically, we use cross entropy as the loss function. For more details about cross entropy, see: th-cam.com/video/6ArSys5qHAU/w-d-xo.html and th-cam.com/video/xBEh66V9gZo/w-d-xo.html
I am a bit confused why are we encoding the input prompt and generating the next predicted word for each word in the input prompt. We don't use this information at all when generating the output part right? For generating the output part we just use the KQV from the input prompt and continue from there? How are the two parts connected
That is correct - we don't use the output until we get to new stuff. However, if we wanted to, we could use the early output for training (since we know what the input is, we can compare it to what the decoder generates).
Thank you, your video is great! But I'm really confused about the EOS token. Why does the model keep generating new words after generating the EOS token in the prompt? Should it just stop? What is the difference between the EOS tokens in the prompt and the output?
I'm not sure I understand your question. After the input prompt, we insert an EOS token so that the decoder will be correctly initialized and then we generate output tokens until a second EOS is generated.
@@txxie The versions I've seen do. And if they don't, then they presumably use some other token that fills the same role. So, you can use one special token for both, or you can use two. Either way works.
Damn i have learn the whole of decoder and encoder models from start to finish including training and deploying but not understand the math the way you opened the pandora box. Now the sine and cosine and query key value and everything is flying out in my head
Hey Josh! I need to solve generation task using decoder only model. How I should preprocess corpus for this? I think that splitting in 2 parts and separate parts with token is good solution. But I dont understand how train this model and calculate loss. Input for model is tokens_first_part + tokens_second ann output[index of sep:] of model compare with input[indx of:]
To learn more about Lightning: lightning.ai/
Support StatQuest by buying my books The StatQuest Illustrated Guide to Machine Learning, The StatQuest Illustrated Guide to Neural Networks and AI, or a Study Guide or Merch!!! statquest.org/statquest-store/
Bruh. This channel is criminally underrated.
Thanks!
🎉🎉🎉
Love your work!
Bam!
Well then criminals should rate it more highly
@@rickymort135I laughed. ⭐️
This explanation is essential for anyone looking to understand how ChatGPT works. While more in-depth exploration is necessary to grasp all the intricacies fully, I believe this explanation couldn't be better. It's exactly what I needed.
Thanks! I have a video that shows how all of these calculations are done using matrix algebra coming out soon.
And I thought you'd stop at ChatGPT. Thanks for never stopping to learn and teach!
Thank you!
Yes it's a good series
Haven't had a single stats course in over 3 years but I still keep up with this channel from time to time! Neural networks are way more complex than what I've ever had to deal with, but you manage to break down even these topics into bite size pieces...Bam!!
Thank you so much!!!
Was stuck on a stupid detail about the architecture for a couple of hours. Your elaborate illustrations helped me make it clear in my mind right away! Thanks, keep going :)
Glad it helped!
Quests on attention, transformer and decoder only transformer are of immeasurable value! Thank you so much! Keep the quests coming!
Thanks, will do!
Congrats on 1 million subs statquest!! All the Love from Korea!!
Thank you very much!!! :)
Oh my. Thanks for the recap, it was so necessary for this video. It made the concept extremely clear.
Glad it was helpful!
YOU ARE THE BEST TEACHER EVER JOSHH!! I wish you can feel the raw feeling we feel when we watch your videos
bam! :)
This video is proof that repetition is prime when teaching advanced concepts. I've watched many similar videos in the past and could never get all of these numbers to finally make sense in my mind. With your previous transformer video, I was getting closer but somewhat got lost again with the QVK values. Having to this second video to watch in a row made it clearer for me what all these numbers do and why we need them.
BAM! :)
🎯 Key Takeaways for quick navigation:
00:00 🤖 Decoder-only Transformers are used in ChatGPT to generate responses to input prompts.
01:48 📊 Word embedding is a common method to convert words into numbers for neural networks like Transformers.
08:09 🌐 Positional encoding is used in Transformers to maintain word order information in input data.
10:53 🧩 Masked self-attention in Transformers helps associate words in a sentence by calculating similarities between words.
16:28 🧮 Softmax function is used to determine the percentage of each word's influence on encoding a given word in self-attention.
19:56 🧠 Reusing sets of weights for queries, keys, and values allows Transformers to handle prompts of different lengths.
23:52 🤖 Decoder-only Transformers both encode input prompts and generate responses, enabling training and evaluation.
25:58 🧠 The decoder-only Transformer process involves several steps, including word embedding, positional encoding, masked self-attention, residual connections, and softmax for generating responses.
29:09 🤖 Masked self-attention in a decoder-only Transformer ensures it keeps track of significant words in the input when generating the output.
32:23 🔄 Key differences between a decoder-only Transformer and a regular Transformer include using the same components for encoding and decoding in the decoder-only Transformer, using masked self-attention all the time, and including input and output in the attention mechanism.
34:15 📚 During training, a regular Transformer uses masked self-attention on known output tokens to learn correct generation without cheating, while a decoder-only Transformer uses masked self-attention throughout the process.
bam!
Thank you so much! It is an amazing video and I haven't seen a video teaching AI/ML techniques like this anywhere! You're talented. And my research areas span Efficient LLM (LoRA, Quantization, etc). It cannot be better if I can see those concepts
Glad it was helpful!
I just want to say you are AMAZING. Thank you so much. I would personally love to see a video on backprop to train this, or even just training an RNN since we saw multi dim training, but not training once we get the state machine / unrolling involved. Loved the whole series 🎉
Thanks! I have notes for training an RNN, but the equations get big really fast. That said, it really is the exact same techniques presented in other videos, just a lot more of them.
This is the only video on youtube that explains how such a complicated thing works so simply.
bam! :)
It summarized how GPT style of transformer architecture works and also helps us to understand how ChatGPT generates the text. Simply beautiful
Thank you!
Liking this video before i even start watching it as i know the content is simply brilliant!🎉
bam!
Hey Josh, I’ve been really digging your videos! They’re not only informative and helpful for my studies, but they’re also super entertaining. In fact, you’ve played a big part in my decision to continue pursuing AI Engineering. Could you please do a video about low-rank adaptation(LoRA). I am not good with that.
Thanks! I'll keep that in mind.
Woahh, this is actually cool. We appreciate it a lot Josh!
Thanks!
Your videos are awesome! I've never thought I could learn machine learning in such an easy way. Love from china
Thank you!
This the most brilliant explanation that I have seen!!!!!! You are just awesome!!!!
Wow, thanks!
I immediately rushed to amazon and purchased your book. Will get it in few days.
Hooray!!! I also have a new book coming out in early January. It's all about neural networks.
Thanks!
Thank you so much for supporting StatQuest!!! BAM! :)
Delighted to watch one of the Most Brilliant videos. Hats off. Will join the channel tomorrow, first thing. Meanwhile do you have a one on Probability Density Function.
All of my videos are organized on this page: statquest.org/video-index/
Thanks, All are good, may be I could not find a one on P Density Function. Could you please point me out that specific video.
awesome, really helpful. Can't wait for another exciting episode!!
More to come!
Hey Josh! You're a gift for this planet 😍 so thanks this awsome explanations..
Wow, thank you!
Hello Josh, i am enjoying your videos as they are helping me so much with my studies as well as entertaining me. You are kinda a reason i decided to continue studying bioinformatics. Since you are covering chatGTP and stuff now, could you maybe make a video about AlphaFold architecture in the future? I understand it might not be your topic of interrest, but i would love to lear in more deeply (pun intended). Thanks either way!
I'll keep that in mind.
BAM... You really killed it. Thanks for your explanation.
Thank you!
Another great video as always! Would be amazing if you could continue with Masked Language Models such as BERT in the future!
I'll keep that in mind.
incredible! this is such a clear explanation. thank you!
Thank you!
Wonderful Explanation! With great Visualisations!!!
Thank you!
Triple BAM ❤❤👌👌
Hooray! :)
Thanks again Josh! I noticed that many GPTs are decoder only. Thanks for clarifying!
BTW saw that Yannic had a video on history rewrites. Probably not a topic for this channel, but still pretty cool 😁
Interesting!
Bedankt
HOORAY!!!! Thank you so much for supporting StatQuest! BAM! :)
What an awesome video and channel😁👍. Would you consider doing a video on deep q learning models? I believe everyone would benefit from a video on such a fundamental topic. Thank you for your invaluable work🤩
I'll keep that in mind.
Amazing Explanation! Double Bam 😊👍
Thank you! 😃
謝謝!
Thank you so much for supporting StatQuest!!! TRIPLE BAM! :)
Greate explanation. It help me a lot. A million heart for u!!
Thank you!
Sir Josh, Thank you for making this public. May God Bless you.
Thank you!
your series almost save me...love from China💥
Happy to help!
Hello Josh, thank you again for your video !
I had one question concerning training the model on next token prediction:
As training data, would you use "What is statquest " or "What is statquest awesome" ?
What I mean by that, is when training the model by feeding it an input prompt such as "What is statquest ", do you also feed the model the word that comes after it (for calculating the loss), here "awesome" ?
The training inputs were "What is statquest awesome", and the labels were "is statquest awesome ". I'm working on a video that goes through how to code a transformer and how to prepare the training data. Hopefully it will be out soon.
@@statquest Thank you for your answer. I see that the decoder also learns to embed the input then (here, on the input , the label is "awesome").
I'm looking forward to your next vide !
Great Video. If possible, please do a video on model fine-tuning techniques like PEFT/LoRA
I'll definitely keep that in mind.
would love to learn about bidirectional transformers next ;-)
I'll keep that in mind.
@@statquest Pleeeeease, Josh!
Thanks for the excellent explanation!
You are welcome!
Endless thanks to your awesome explanation. Would you mind clarifying the next confusion to me? You've mentioned that the normal transformer uses masked self-attention only during training and normal self-attention during inference, while according to other resources, including your iterations through the examples (I think you followed the masked self-attention mechanism), and in 34:20, how are we supposed to know the next tokens during training (parallel training) while we are restricted by the mask?
and thank you in advance.
We only apply masking to the attention mechanism. For more details, see: th-cam.com/video/KphmOJnLAdI/w-d-xo.html
@@statquest Understood, thank you very much for your quick answer. bam! :)
Thanks for the great video, Josh. I got a question for you. What should drive my decision on which model to choose when facing a problem? In other words, how to choose between an Encoder-Decoder transformer, Decoder-only transformer or Encoder-only transformer? For instance, why ChatGPT was based on a Decoder-only model, and not on a Encoder-Decoder model or an Encoder-only model (like BERT, which has a similar application)
Well, the reason ChatGPT choose Decoder-Only instead of Encoder-Decoder was that it was shown to work with half as many parameters. As for why they didn't use an Encoder-Only model, let me quote my friend and colleague, Sebastian Raschka: "In brief, encoder-style models are popular for learning embeddings used in classification tasks, encoder-decoder-style models are used in generative tasks where the output heavily relies on the input (for example, translation and summarization), and decoder-only models are used for other types of generative tasks including Q&A." magazine.sebastianraschka.com/p/understanding-encoder-and-decoder#:~:text=In%20brief%2C%20encoder%2Dstyle%20models,other%20types%20of%20generative%20tasks
Had a couple of questions regarding word embedding:
- Why do we represent each word using two values? Couldn't we just use a single one?
- What is the purpose of the linear activation function, can't we just pass the summation straight to the embedder output?
Thanks for the video!
1) Yes. In these examples I use 2 because that's the minimum required for the math to be interesting enough to highlight what's really going on. However, usually people us 512 or more embedding values.
2) Yes. The activation functions serve only to be a point where we do summations.
A thorough explanation 😀
Thanks!
@@statquest if possible, could you please do a video on structural differences between llama and GPT?
@@ruksharalam173 I'll keep that in mind.
great vids, any chance you could make videos on Q-Learning, Deep Q-Learning, and other RL Topics! Keep up the good work!
I hope to.
Very good video, thank you!
Thank you!
Hey, fantastic video as usual! Getting hard to find new ways to compliment, haha.
Just one quick question since you mentioned positional encoding. When generating embeddings from GPT embedding models (e.g., text-embedding-3-large), do the embeddings contain both positional encoding layer and masked-self-attention info in the numbers?
I believe it's just the word embeddings.
Can you please make a video about GNN? You are reaaallyy good at explaining
I'll keep that in mind.
I really enjoyed this video!
Thank you!
Hi Josh, great video, as always. I was wondering if you would also make a video about Encoder-only Transformers, like Google's BERT for instance, which can also be used for a great variety of tasks.
I'll keep that in mind.
This is a god level YoutTube channel
:)
Awsome as always from you !! now we only need en real tutorial with python to creat a mini transformer model. hops it is on the making as my wish list
Working on it!
that's perfect. Can you do more lectures on LLMs? Thanks a lot.
I'll keep that in mind.
We dont have to ask gpt to know stat quest is awesome reply from gpt BAM!!! BAM!! BAM!!
BAM! :)
Thank you for a deep explanation, I have a question , does the size of output layer equal the size of the vocabulary ?
yes
If the size of the vocabulary is 5 million, then I need 5 million neurons at the output layer ?
@@laythherzallah3493 Yes. But usually vocabulary sizes are much smaller (in the range of tens of thousands of tokens, rather than millions).
Perfect video! Quick question, how are you drawing your lines? This line style is awesome!
I do everything in "keynote".
Hi there! Thank you for the awesome video. It is very helpful and clear. I have a question about the part around 34:10, where you are talking about how normal transformer uses Masked Self-Attention in training. I did not quite understand how that is Masked Self-Attention (i.e. looking at the tokens before). Is the Decoder still looking at the input tokens, or is it looking at the output tokens that come before it? Thank you very much!
During training we know what the output should be, but we don't let the decoder look at tokens that come after the one it is currently processing. So if the output is "my name is fred", then, when we calculate attention for "name", the decoder can't use "is" and "fred". What's going on here might be more obvious if you look at how the math is done with matrix algebra in this video: th-cam.com/video/KphmOJnLAdI/w-d-xo.html (and if you are not already familiar with matrix algebra, see: th-cam.com/video/ZTt9gsGcdDo/w-d-xo.html )
Hey Josh ! Would you mind making videos about graph neural networks ( GNN ) or graph convolutional network ( GCN ), and most importantly, the graph Attention Network ( GAT ) ? I have briefly gone over the maths these days, I already knew the matrix manipulation stuff but I think with your help, it would be much clear like your Transformer series, especially on the attention mechanism in the graph attention network ( GAT ), many Thanks 🙏🏻🙏🏻🙏🏻🙏🏻 appreciated !
I'll keep that in mind.
GNNs really only have two main elements to them, the aggregate function and the update function. The different choices of these two functions give rise to the different variants, GCN, GAT, etc.
Hi Josh! Thank you for your clearly explained work! You did a great job. I'm a big fan of you! I have a question about the embedding position. Here, you use two activation functions to encode the word to a vector, so that is why for the position embedding, you only use two sine and cosine to encode each word. Am I right? I found from the previous video: Transformer the foundation of chatGPT, you use four activation functions to encode each word, and later, you use four sine and cosine squiggles for the position embedding.
Each word embedding value needs a positional encoding value. So the number of sine and cosine squiggles that you use depends on the number of embedding values that you create for each token.
@@statquest Thank you for your explanation ! Have a nice day!
Hi Josh, thanks a ton for making such a simple video on such a complex topic. Can you please explain what do you mean when you say "
Your comment is missing the quote that you have from the video. Could you retype it in?
@@statquest Yeah. Can you please explain - Note :- If we were training the Decoder-only transformer, then we would use the fact that we made a mistake to modify weights and biases. In contrast when we are just using the model to generate the responses, then it doesn't really matter what words come out right now.
Thanks for clear explanation
Glad it was helpful!
Another great session, thank you!!! Quick question, how do we decide what numbers to use for the Keys and Values?
For the weights? Those are determined with backpropagation: th-cam.com/video/IN2XmBhILt4/w-d-xo.html
Thank you, Josh, for yet another excellent video on GPT. I find myself slightly puzzled regarding the input and output used to train the Decoder-only transformer in your example. In the normal Transformer model, the training input would be "what is statquest ," and the output would be "awesome ."
However, in the case of the Decoder-only model, as far as I understand, the training input remains "what is statquest ," but the output becomes "what is statquest awesome ." Could you help to clarify this? If my understanding is correct, I'm wondering how the Decoder-only transformers know when to stop during inference, considering that there are two tokens within the generated response.
Because the first is technically part of the input, we just ignore it during inference. Alternatively, you could use a different token to indicate the end of the input.
What a wonderful video!!! BTW, When will you publish your CD? I will buy it too😄Thanks!
BAM! Thank you!
Hi there!
Came to YT in hope you had a nice video of Rank Robustness. Would be amazing, if you wanted to make a video about it!
Keep it up!
Also: nice Dinosaurs!
Thanks!
Another amazing video! So the fully connect layer basically maps two dimensional vectors to 5 dimensional vectors, in this case, is the size of the vocab(collection of tokens). Is that correct?
Yes, exactly!
@@statquestGreat! I have another question: How does a fully connected neural network map vectors representing different features (dimensions) to vectors that represent the indices (dimensions) of tokens in the vocabulary?
@@harryliu1005 I'm not sure I understand your question since the video shows exactly how the fully connected layer works. If you're not already familiar with the basics of neural networks, check out these videos: th-cam.com/video/CqOfi41LfDw/w-d-xo.html and th-cam.com/video/83LYR-1IcjA/w-d-xo.html
Wow this series culminating in a perfect explanation of GPT is the most magnificent piece of education in the history of mankind. Explaining the very climax of data science in this understandable step-by-step way so I can say that I understood it should earn you the noble prize in education! I am so grateful that you never used linear algebra in and of your videos. Professors at university don't understand that using linear algebra prevents everyone from actually understanding what is going on but only learning the formula.
I have an exam in Data Science on Friday in a week. Can you make a quick video about spectral clustering by Wednesday evening? I will pay you 250$! :)
Thanks! If I could make a video on anything in a week, that would be a miracle. Unfortunately, all of my videos take forever to make.
thank you! the video is really nice
Glad you liked it!
Hello Josh! First of all thank you for this great video, as usual it's very simplified and straightforward.
However, I have a little question. I saw your videos on transformers and this one, but every time I feel like the output is already there waiting to be embedded and then predicted. I mean that why the answer can't be "great" in stead of "awesome", what was the probablities given by the model for "great" and for "awesome" to make the final prediction. Here I gave the example of one extra word (great) but in real life it's the whole dictionary of words that can be predicted. So when generating the output, does it compute the "query" and "key" of the whole dictionary of words and then hopefully the right word has the best softmax probability?
Thanks in advance for the clarification.
no, you only calculate the queries, keys and values for the input tokens and the output as it is generated. However, in practice, instead of training on just a few phrases, we train on the entire wikipedia. As a result, the transformer can be much more expressive.
Hi Josh, excellent video. I only recently found you but your channel is amazing.
It looks like you inadvertantly copied over the example value for "is" from the value for "what" starting around 1925 and this continues forward through the video.
I mostly say this in case you ever adopt these notes directly into a book.
Thank you! I've corrected my notes and do, in fact, plan on including it in a book soon!
Thank you sir.
thanks!
I don't understand why we need the residual connections.... =''( isn't the word and position encoded values information already included in the masked self-attention values? or is most of the information lost so we need to directly add the word and position encoded values?
In theory you do not need them, but in practice they make it much easier to train large neural networks since each component can focus on it's own thing without having to maintain the information that came before it.
@@statquest Bam! Thank you!
Thank you so much for another great video! I did have a question -- I'm confused about why you can train word embeddings with only linear activation functions because I thought that linear activation functions wouldn't allow you to learn non-linear patterns in the data, so why wouldn't you just not use an activation function at all in that case or use only one?
For word embeddings specifically, we want to learn linear relationships among the words. This is illustrated in my video on word embeddings: th-cam.com/video/viZrOnJclY0/w-d-xo.html And, technically, when coding a linear activation function, you just omit the activation function.
Josh, why in min 19:34 the similarity value for "Statquest" is lower for itself than for the word "What"? It shouldn't be larger to itself as is the self-similarity?
Unfortunately, this example is really too simple to really show off the nuance of what the actual numbers represent.
@@statquest:( wak wak. Thanks though :)
hello..first off thank you for this great content
I had a question/s
could you give an example of how the embedding neural network is trained? i.e. what is the input and output in the embedding neural network during training? The neural networks I have worked with statements that go along the lines of "given a set of pixels determine whether the picture is a cat or not".. I do not know what the equivalent is with embedding neural networks
and follow up question..can the embedding neural network be the same for an encoder-decoder model and a decoder-decoder model?
1) We don't train the embedding layer separately from the rest of the transformer. So the inputs are what you see here as well as the ideal outputs that we use for training.
2) Once trained, yes.
I didn't really understand where the fully connected layer came from and how it is connected to the input values. It seems independent of the previous layers and somehow outputs the input words
The illustration at 26:53 may make things easier to grasp. The residual connections are used as inputs to the fully connected layer.
After painstaking research on this topic, I have realized that all the information is not in the transformer itself. The transformer is just the neural network (i.e., the processor or function). The information is actually in the embedding model. The sequences are all stored there.
When training the transformer, the word embedding model needs to be fully trained or trained with the same data. This allows the learned embeddings to fit the transformer model. Here, overfitting is beneficial as the model will be a closed model after training. The transformer can be retrained (tuned) to your specific task. The main task should be text generation corpus/tuned for question and answer.
The learned embeddings model can actually generate data as well as the transformer neural network, making it a rich model. I also noticed that to get an effective word embeddings model, the optimization or "fine-tuning" should involve classified and labeled data. This allows clusters of similar terms to appear, as well as emotionally aware embeddings.
The transformer is essentially performing an "anything in, anything out" operation, forcing the input and output to converge despite any logical connection between them. Given an input, it produces the desired outcome. As many similar patterns are entered, it will fit this data using the masking strategy. Transformers need to be trained as masked language models to be effective, highlighting the drawbacks of LSTM and RNN models which cannot produce the same results.
The more distanced the output is from the input, the more layers and heads you need to deploy.
... wut?
@@iProFIFA I mean the most important area of focus should be the embeddings model as this is the part which stores the data after training ... The transformer is just the neural network function. ..
Also.... Chat gpt is primarily a "chat bot" so all inputs are fed to its Intent detection module , IE classify the user intention ... To send the query to the correct model for a response ...
Right now the transformer is seen to be able to fit anything to anything model .... As long as you give an input and target to the model and enough samples the "function (transformer)" can fit the data to the model ... Hence the first model to train is the text generation model to give it many possibilitys to generate some form of text given a seed .... Then tune the model to your specific task , IE question and answer or the New "instruct" style model ...
IE : for a true code generation model . The first corpus should be text books , tutorials , forum posts , mass source repos .... This will give it a massive base to generate some form of code given a seed ... Then to pass in the instruct code models , IE build a binary tree , and provide the specified code , ... It can be ANY code language .... Because the model is a transformer , later we can add another fine tune , for specific code translation ... IE csharp examples with Thier equivalent python and rust and javascript etc ... .. now this model can be retrained until it "overfits it's domain model" .... By adding general language corpus and basic questions answer modelling "after" you can now ask question and pursues project objectives ... Because this was originally trained with code model it's preferred output will be code ....
The sealed language model outputted .. this is also the issue as all the data is in the embedding model you train it with ... So ... .
This embedding model is the most important , as neural networks can fit any numerical data ..
Which is the best model for word embeddings ... Fasttext , glove , skipgram ? Are there others ? What's the best optimiser for these models ? Are they interchangeable ? ..
Finally , the transformers successes of late are due to the deployment of layers and heads .. this increase give the perception of adding dimensionality(randomness( tweakable settings (all weights and biases) but this is due to attempting to use a single model to fit all tasks ... If you have models for code gen. And others for picture gen , and others for chat gen, then others for business and local domain knowledge . Then it's better to have an intent model In front of all of these dedicated models . And direct the query to the appropriate models and take a softmax of the output to choose the correct output ! ... Again optimising the intent detector with the input and expected output and correct classification of the requested task ....
Hopefully I explained it a bit better this time ! Lol .
Weights an biases don't take up space (they are just a few numbers in a table) it's the embed model ..
@@xspydazx yo, your way of writing is really confusing and it seems like you went on a tangent so it’s hard to follow. Also you shouldn’t put spaces before commas and periods.
@@xspydazxare you talking about mixture of experts model in your last comment bro??
Superrrb Awesome Fantastic video
Thank you!
In 25:42, when the model generates the wrong word, it will be fixed by backpropagation if this is the training process and it will be ignored if this is the generation process, right?
Yes.
Dude, can you make a video on state space models like Mamba? It's super interesting!
I'll keep that in mind.
Bam! @@statquest
20:56 I think the value of the word "is" is miswritten it should be 1.1 , 0.9 not 2.9,1.3 it should not be same with the value of word 'what' Thank you for your videos btw ur explanation is awesome.
That is correct. Sorry for the typo! :)
Chat gpt is still a dialog system at its heart and has many different models which it gets results from . It softmaxes the outpits acording to the intent , ... So intent detection plays a large role in the chatgpt response .. the transformers are doing major works .. its super interested despite bqttling away with vb net !
hi ! thankyou so much for your great video! i have some questions that i havent understand..
1. how to interpret query, key, and value as in definition? cause ive been watching a lot of attention videos but i still dont get what are they actually and how to convert the input to became the Q,K,V
2. im doing a research for forecasting stock prices using transformer, but i still dont get it how to do embedding with numerical values as the input (every other videos explained it with words as the inputs).. do you know how?
3. what is the attention's output shape? is it a matrix or just a regular vector?
thank you !
1) Query, Key and Value are terms that come from databases. What they represent is in the video. Is there a time point (minutes and seconds) that is confusing?
2) You don't need embedding if you start with numbers. The only reason we do embedding with words is to convert them to numbers.
3) A matrix.
thank you for your answer, here's some follow up answers and questions:
1) yes there is, around minutes 14 when you explain the query and keys calculation.. i still dont get how can we multiply different weights and get query, then multiply another weight and get keys.. whats the difference of their weights representation?
2) oh okayy, but in some research they did an embedding to make the numbers smaller.. is it possible?
1) The different sets of weights allow for the queries to be different from the keys. (if we used the same set of weights for both, they'd be the same).
2) To be honest, I'm not sure I understand what it means to use embedding to make numbers smaller, but you could tokenize the numbers and use those as input to a word embedding layer.
@@statquest okay, i'll search it more later.. thankyou for your time to answering my questions ! 🫶
Great explanation! Btw, what is the manuscript that first described the original GPT?
I believe it is called "Improving Language Understanding by Generative Pre-Training"
I feel it’s a bit misleading that it seems the tokens of the input sequence is fed in one by one, and that when you put in the first token, it predicts the second token but just ignores it, where in reality it feeds the entire sequence to predict the next target token, and on next iteration, you append the input sequence with the target token as input, and predicts the second target token, and so on. Right?
At 26:28 I state that each token in the prompt is processed simultaneously.
@@statquest gotcha. Thanks for the clarification, sensei.
Hello Josh, thank you very much for your videos, they are by far the most informative I have seen !
I had a question regarding the training of generative transformers:
Can a generative encoder-decoder transformer (we expect it to behave like gpt-3 or llama) be trained with next token prediction ?
Because from what I understand, for inference, to generate the output, we encode the input sentence, then we feed to the decoder (embedding layer), then we get the prediction of the first token, which we re-feed to the decoder to generate the next token, until we get a . So we get a sentence as output.
However, if the training was done with next token prediction, it means that given an input (sentence), we only try to predict the very next token, which means that we encode the input, we feed to the decoder, we get the token prediction and that's it. In that case, the decoder's embedding layer never sees tokens other than in the training.
So during inference, how could it comprehend tokens other than ? Maybe my assumption about the decoder only receiving during next token prediction pre-training is false.
To train an encoder-decoder we do something called "teacher forcing" which allows the network to predict the next token and then continue to predict all of the other tokens that are in the desired output, one token at a time. For details on how teacher forcing works, see: th-cam.com/video/L8HKweZIOmg/w-d-xo.html
@@statquest thank you for your answer! If I understand correctly, it is thus impossible to train an encoder-decoder on next token prediction, as we need longer outputs. On your videos, we see an encoder-decoder which is trained seq2seq for a specific task, like translation. Is it possible to build a task agnostic (like GPT-3) encoder-decoder by pre-training it seq2seq, with next sentence prediction for example ?
And concerning task agnostic decoder only models (like GPT), is it because the encoder and decoder share the same structure and weights that it is possible to pre-train it with next token prediction ? Because even if the decoder's embedding layer only sees during training, the encoder's embedding layer sees many different tokens, and since they share weights, the decoder's embedding layer also learns
When I say "task agnostic model", I mean a generative model to which you can feed any prompt as input, and it will generates a text as answer, so not specific to any task
So my question is about which task we can train these models on (like next token prediction, masked language modelling), so that they can be task agnostic
Sorry if I'm not clear enough !
@@victorluo1049 I think we might be using different definitions for "next token prediction". To me, "next token prediction" can be applied to long outputs because we predict the output one token at a time given the preceding input and predictions. So whatever we predict, we feed it back into the model and then predict the next token. Thus, encoder-decoder and decoder or encoder only transformers all do "next token prediction". If you are using a different definition for "next token prediction", then you might come to a different conclusion.
@@statquest my definition of next token prediction is predicting the token n+1 with the tokens 1 to n as input
For example let's say we have tokens 1 to 10 and I use an input window of 3 tokens, then during training :
Sample 1. Input : tokens 1 to 3. Target : token 4. Details : we encode tokens 1 to 3, and feed it to the decoder, we also feed to the decoder's embedding layer. Then the decoder outputs a token prediction, which we hope to be equal to token 4. (The loss is probably calculated by comparing the probability distributions ?)
Sample 2. Input : tokens 2 to 4. Target : token 5
Sample 3. Input : tokens 3 to 5. Target : token 6
Etc
So the tokens 4,5,6 are predicted, but separately. There is no mechanism of feeding back an output (or true value if we use teacher forcing) to the decoder to predict the next output. So here, during each training step, the decoder only receives as input. Which would be problematic during inference, as we are supposed to feed each predicted token (which are different from ) back to the decoder to predict the next one.
I may have a misunderstanding about this, but after reading GPT first paper, it feels like this is basically how the training works, they wrote about maximizing the likelihood L(token n+1 | token 1,...,token n)
I would have understood if for inference we feed the initial prompt to the encoder, get a token prediction, add it to the initial prompt then re-feed it to the encoder, to get the next token, etc... But after seeing your video I saw that it is not done by iteratively feeding the encoder, but rather by iteratively feeding the decoder, so I am a bit confused (maybe this is actually only true for models that are trained seq2seq ?)
Great video. Do I understand correctly DNN which is responsible for word embedding, it not only converts the token to its representation as a numeric vector, but already predicts as the next word should be returned ?
In a transformer the embedding layer alone does not predict the next word because it wasn't specifically trained to do that the way a stand alone word embedding layer (like word2vec) would.
@@statquest But if we train the whole model at the same time, then backpropagation does not change the weights of the network responsible for word embedding in such a way that they learn to predict the next word? Or don't we train this first network while learning ?
@@JanKowalski-dm5vr It might. But the whole model, word embeddings and attention and everything, is trained to predict the next word, or translate, or whatever it's trained to do. So it's hard to say exactly what the word embedding layer will learn.
Please make a video on encoder-only transformers like BERT, onegaishimasu!
I'll try to do that. However, the only significant difference with encoder-only transformers is that they don't use masking when calculating attention.
In an encoder-decoder transformer encoder was trained in English and decoder was trained in Spanish which made it possible to do translations. But here, only English is used for both encoding and decoding which makes it impossible to convert the English encoding to Spanish output. So here, would we used both language datasets combined to train the model to enable it to do translations as well?
Usually the tokens are just fragments of words, instead of entire words. This gives the decoder-only transformer more flexibility in terms of the vocabulary, since it can form new works it was never even trained on by combining the tokens in new ways. In this way, you can train a decoder-only transformer to translate english to spanish.
one more question... why is there one common FC layer used in the decoder bit (predict statquest given "what is") vs (predicting awesome when given EOS token and "what is statquest")...i would think they would be separate FC layers for both of them since one is predicting the next word..the other is predicting the word in the middle?
If you use an encoder-decoder design, you can have different fully connected layers for the different parts of the input and output. However, they decided that this simpler model, with fewer parameters, worked better.
I didn't understand just one part: how are the weights to calculate Q, K and V for each word in the sentence calculated? Is it also an optimization process? If so, how is the loss function calculated?
At 5:08 I say that all of the Weights in the entire transformer are determined using backpropagation. Specifically, we use cross entropy as the loss function. For more details about cross entropy, see: th-cam.com/video/6ArSys5qHAU/w-d-xo.html and th-cam.com/video/xBEh66V9gZo/w-d-xo.html
I am a bit confused why are we encoding the input prompt and generating the next predicted word for each word in the input prompt. We don't use this information at all when generating the output part right? For generating the output part we just use the KQV from the input prompt and continue from there? How are the two parts connected
That is correct - we don't use the output until we get to new stuff. However, if we wanted to, we could use the early output for training (since we know what the input is, we can compare it to what the decoder generates).
Thank you, your video is great! But I'm really confused about the EOS token. Why does the model keep generating new words after generating the EOS token in the prompt? Should it just stop? What is the difference between the EOS tokens in the prompt and the output?
I'm not sure I understand your question. After the input prompt, we insert an EOS token so that the decoder will be correctly initialized and then we generate output tokens until a second EOS is generated.
Thank you for your reply, but most LLMs such as LLaMA and GPT do not use an EOS token to initialize the generation of the output.@@statquest
@@txxie The versions I've seen do. And if they don't, then they presumably use some other token that fills the same role. So, you can use one special token for both, or you can use two. Either way works.
Incredible!
Thank you!
Damn i have learn the whole of decoder and encoder models from start to finish including training and deploying but not understand the math the way you opened the pandora box. Now the sine and cosine and query key value and everything is flying out in my head
bam?
Hey Josh! I need to solve generation task using decoder only model.
How I should preprocess corpus for this? I think that splitting in 2 parts and separate parts with token is good solution.
But I dont understand how train this model and calculate loss. Input for model is tokens_first_part + tokens_second ann output[index of sep:] of model compare with input[indx of:]
I'll create videos on how to code transformers and decoder-only transformers soon.