- You can get the code here: github.com/StatQuest/decoder_transformer_from_scratch - Learn more about GiveInternet.org: giveinternet.org/StatQuest NOTE: Donations up to $30 will be matched by an Angel Investor - so a $30 donation would give $60 to the organization. DOUBLE BAM!!! - The full Neural Networks playlist, from the basics to AI, is here: th-cam.com/video/CqOfi41LfDw/w-d-xo.html - Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
@@statquest Sir we Love you and your work, please don't let such comments to your heart! You may never meet us but there is a generation of statisticians and Data Scientists who owe a lot to you may be all of it!
You will be rememberd for next 1000 years in the history of Statistics and Data Science , You should be named as "Father of Applied Statistics & Machine Learning " , Pls thumbs up if you are with me
Hey Josh, you know what? I used to watch your videos explaining the key ingredients of statistics EVERY DAY in 2020~2021 when I was a freshman. Whatever I click among your videos, it was always the first time for me to learn it. I knew nothing. But I still remember what concept you dealt with in videos and how you explained them. Fortunately now I work as an AI researcher - it's been a year already - although I am a 3rd grade student. You suddenly came to my mind so I've just taken a look at your channel for the first time in a long time. This time I've already knew about all of what you explain in videos. It feels really weird. Everything is all thanks to you and still your explanations are clear, well-visualized and awesome. You are such a big help to the newbies of statistics and machine/deep learning. Always love your works. Please keep it going!!! 🔥
Josh, I want to express my sincerest gratitude. I have been following your videos for years and they have been becoming increasingly more important for my study and career path. You are a hero.
I’ve been trying to make a Neural Network in c++ for like a month now. I was trying to just use 3b1b’s videos but they wernt good enough. But then I found your videos and I’m getting really close to being able to finish the back propagation algorithm. When I started I thought it would look good on my resume but now I’m thinking nobody will care but I’m in too deep to quit
sir first of all huge respect to your content......Sir one more request can u make one video on how to apply transformer on image datasets for different image processing models....like object detection,segmentation.... but only thing is teachers like u make this world more beautiful....
This video's amazing man. Not just this one but every video of yours. Before I began actually learning Machine Learning I used to watch your videos jus for fun and trust me, it had taught me a lot. Thanks for your amazing teaching :) with love from India ❤
AMAZING VIDEOS. Watched all of your nn playlist in 3 days. And now reaching the end i have some questions. One is what are the future planned videos? And two is how do you select activation functions? In fact a video where you create custom models for for different problems and explaining "why to use this" would be great. No need to explain math or programing needed for that. Thank you for all of these videos!
Thanks! I'm glad you like the videos. My guess is the next one will be about encoder-only transformers. I'm also working on a book about neural networks that includes all the content from the videos plus a few bonus things. I've finished the first draft and will start editing it soon.
Great and very didactic as usual, Josh!! Definitely going to wrap my head around this for a while and try a few tweaks! Do you plan on eventually also discussing other non-NLP topics like GANs and Diffusion Models?
Awesome video! Maybe we can have a part 2 where we incorporate multi-head attention? 👌🏽 And then could make this a series on different decoder models and how they differ e.g., mistral uses RoPE and sliding window attention etc…
Thank you. You're a lifesaver when I need this to finish my school project. However, if the input contains a various number of strings, do I add padding after ?
@@statquest Thank you for your help. However, if I use zero padding and include zero as a valid token in the vocabulary, won't the model end up predicting zero-which is meant to represent padding-thereby making the output meaningless?
I'll keep that in mind. But in the mean time, you can thank of an input prompt (like "what is statquest?") as a time series dataset - because the words are ordered and occur sequentially. So, based on a sequence of ordered sequence of tokens, the transformer generates a prediction about what happens next.
🙏🏻🙏🏻🙏🏻🙏🏻🙏🏻 Please please please show us how to train QVK Weights in detail 🙏🏻🙏🏻🙏🏻🙏🏻🙏🏻 You showed us just a simple call to function. But we are curious how it did the math, what to train, and how it can changes values of the weights. ABC
Every single weight and bias in a neural network is trained with backpropagation. To learn more about how this process works, see: th-cam.com/video/IN2XmBhILt4/w-d-xo.html th-cam.com/video/iyn2zdALii8/w-d-xo.html and th-cam.com/video/GKZoOHXGcLo/w-d-xo.html
@@statquest Since both QVK Weights are splitted and the calculations are passing non neural network, imho the back propagation process is quite tricky. In the other hand, the fit function did not tell the order of calculations on each nodes.
I see that the two inputs have the same lenght... what would change if I wanted to train with another phrase, for instance: "What awesome statquest" (uses 4 tokens instead of 5). How can I generate an input with torch.tensor where the input is no longer the same dimension?
hi josh, should embedding weigths be updated during training? for example nn.embedding(vocab_size,d_model) produces random numbers that each token will be referred to the related rows in our embedding matrice, should we update this weights during training? positional embedding weights are constant during our training and the only weights (except other parameters of course, like q,k,v) that prone to change are our nn.embedding weights! I wrote a code for translating amino acids to sequences everything in training works well with accuracy 95-98% but in inference stage I get to the bad results. i recall my model by loading_path=os.path.join(checkpoint_dir, config['model_name']) model.load_checkpoint(loading_path,model,optimizer) but after inference loop my result is like: 'tcc tcc tcc tcc tcc tcc tcc tcc tcc tcc tcc tcc tcc tcc tcc tcc tcc tcc ' :( even we assume my algorithm has overfitted We shouldn't get to this result! also I think other parameters like dropout factor should not be considered in inference stage (p=0 for dropout) I mean we shouldn't just reload the best parameters, we should change some parameters (srry I spoke alot :)) )
I'm confused as to why the values would come from the ENCODER when computing the cross attention between the Encoder and Decoder. Shouldn't the values come from the decoder itself? So if I trained a model to translate from English to German, then wanted to switch out the German for Spanish, I'd expect the new decoder to know what to do with the output of the Encoder. But if the values are coming from the Encoder, then this wouldn't work.
The idea is that the query in the decoder is used to determine how a potential word in the output is related to the words in the input. This done by using a query from the decoder and keys for all of the input words in the encoder. Then, once we have established how much (what percentages) a potential word in the output is related to all of the input word, we then have to determine what that percentage is of. It is of the values. And thus, the values have to come from the encoder. For more details, see: th-cam.com/video/zxQyTK8quyY/w-d-xo.html
22:50 hey Josh you assigned 4 for number of tokens, but we have 5 tokens (including ) , even in the shape of the diagram, as you are pointing, there are 5 boxes (representing 5 outputs).. I got confused And you know what? Words fail me to say how much you affected on my life.. so I won’t say anything 😂
See 26:46 . At 22:50 we just assign a default value for that parameter, however, we don't use that default value when we create the transformer object at 26:46. Instead, we set it to the number of tokens in the vocabulary.
in the very first slide the imports are broken at th-cam.com/video/C9QSpl5nmrY/w-d-xo.html `import torch.nn as nn import` # there's an extra trailing import here.
- You can get the code here: github.com/StatQuest/decoder_transformer_from_scratch
- Learn more about GiveInternet.org: giveinternet.org/StatQuest NOTE: Donations up to $30 will be matched by an Angel Investor - so a $30 donation would give $60 to the organization. DOUBLE BAM!!!
- The full Neural Networks playlist, from the basics to AI, is here: th-cam.com/video/CqOfi41LfDw/w-d-xo.html
- Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
Can't imagine the work that goes into this, writing the code, making diagrams, recording, editing and voice over, you're the goat big J.
Thanks!
he is well compensated
@@thomasalderson368 am I? Maybe it's relative, but hour for hour I'm making significantly less than I did doing data analysis in a lab.
@@statquest Sir we Love you and your work, please don't let such comments to your heart! You may never meet us but there is a generation of statisticians and Data Scientists who owe a lot to you may be all of it!
@@FindEdge Thanks!
You will be rememberd for next 1000 years in the history of Statistics and Data Science , You should be named as "Father of Applied Statistics & Machine Learning " , Pls thumbs up if you are with me
BAM! :)
Hey Josh, you know what? I used to watch your videos explaining the key ingredients of statistics EVERY DAY in 2020~2021 when I was a freshman. Whatever I click among your videos, it was always the first time for me to learn it. I knew nothing. But I still remember what concept you dealt with in videos and how you explained them.
Fortunately now I work as an AI researcher - it's been a year already - although I am a 3rd grade student. You suddenly came to my mind so I've just taken a look at your channel for the first time in a long time. This time I've already knew about all of what you explain in videos. It feels really weird. Everything is all thanks to you and still your explanations are clear, well-visualized and awesome. You are such a big help to the newbies of statistics and machine/deep learning. Always love your works. Please keep it going!!! 🔥
Thank you very much! I'm so happy that my videos were helpful for you. BAM! :)
HUGE RESPECT for all the work you put into your videos
Thank you!
Josh, I want to express my sincerest gratitude. I have been following your videos for years and they have been becoming increasingly more important for my study and career path. You are a hero.
Thank you! :)
sir you deserved millions of views on your TH-cam ❤❤🎉
Thanks!
You said this was going to come out at the end of May. And I’ve been waiting for this for 2 months. Finally, it’s out 😂
I guess better later than never?
I’ve been trying to make a Neural Network in c++ for like a month now. I was trying to just use 3b1b’s videos but they wernt good enough. But then I found your videos and I’m getting really close to being able to finish the back propagation algorithm.
When I started I thought it would look good on my resume but now I’m thinking nobody will care but I’m in too deep to quit
good luck!
sir first of all huge respect to your content......Sir one more request can u make one video on how to apply transformer on image datasets for different image processing models....like object detection,segmentation....
but only thing is teachers like u make this world more beautiful....
Thanks! I'll keep those topics in mind.
100/100 🔥when i search for an explanation video on youtube this is what i expect🔥
Thanks!
Wow - have been waiting for this one! Now that I've wrapped my head around word embeddings, time to code this one up! Thank you @statquest!
Bam! :)
Cool, learn a lot from all of your videos Josh! 🤯
Thanks!
This video's amazing man. Not just this one but every video of yours. Before I began actually learning Machine Learning I used to watch your videos jus for fun and trust me, it had taught me a lot. Thanks for your amazing teaching :) with love from India ❤
Great to hear!
@@statquest :)
It had been sometime since i watched your video. Very good video as always 🎉🎉
Thanks! 😃
I could only watch your videos for getting cheered up by your intro song.
bam! :)
It is party time! Thanks for uploading!
You bet!
AMAZING VIDEOS. Watched all of your nn playlist in 3 days. And now reaching the end i have some questions. One is what are the future planned videos? And two is how do you select activation functions? In fact a video where you create custom models for for different problems and explaining "why to use this" would be great. No need to explain math or programing needed for that.
Thank you for all of these videos!
Thanks! I'm glad you like the videos. My guess is the next one will be about encoder-only transformers. I'm also working on a book about neural networks that includes all the content from the videos plus a few bonus things. I've finished the first draft and will start editing it soon.
Incredible video, Josh! Love your content. Can you please make a video on diffusion models?
I'll keep that in mind.
Thank you very much Josh! Bam @statquest
Finally greatly watied video arrived. Thank you.
Bam! :)
This will be awesome. I am trying to learn the math behind transformers and PyTorch so hopefully this helps give me some intuition
I've got a video all about the math behind transformers here: th-cam.com/video/KphmOJnLAdI/w-d-xo.html
Thanks for making it so easy to understand
You're welcome!
I'm gonna enjoy this one!
bam! :)
Great video! I like the way you teach!
Thanks!
Great and very didactic as usual, Josh!! Definitely going to wrap my head around this for a while and try a few tweaks! Do you plan on eventually also discussing other non-NLP topics like GANs and Diffusion Models?
One day I hope to.
so briliant, please create video scratch more again, i so like it thankyouu
Thanks! Will do!
Amazing explanation 🎉❤ you are the best 😊
Thank you! 😃
Hey... Josh, can you please make a Playlist on all the videos on probability that you've posted so far??? Please ❤❤
I'll keep that in mind, in the mean time, you can go through the Statistics Fundaments in this list: statquest.org/video-index/
Today we learned that statquest is awesome. triple BAM!
Thanks!
Great job! Thanks a million!
Thanks!
I saw the title and right away knew that it is BAM. Can we expect some data analysis, ML projects from scratch?
I hope so.
Awesome video!
Maybe we can have a part 2 where we incorporate multi-head attention? 👌🏽
And then could make this a series on different decoder models and how they differ e.g., mistral uses RoPE and sliding window attention etc…
If you look at the code you'll see how to to create multi-headed attention: github.com/StatQuest/decoder_transformer_from_scratch
God Bless You for the great work you do! Thank you so much
Thank you very much! :)
Thank you
I was in need of this 😊
Glad it was helpful!
Loved it!
Thank you very much
Thank you!
Hi Josh, this video really helped. Can you do one on diffusion models?
I'll keep that in mind.
Really helpful! Thanks
Glad it was helpful!
I really like your teaching
Thank you!
@@statquest I should thank you sir! I love watching your videos!
as always, wonderful content.
Thanks :)
Thanks again!
Thank you! You're the best!!!
You're welcome!
Thank you very much sir...💚
Thanks!
Thank you. You're a lifesaver when I need this to finish my school project. However, if the input contains a various number of strings, do I add padding after ?
Yes, you do that when training a batch of inputs with different lengths.
@@statquest Thank you for your help. However, if I use zero padding and include zero as a valid token in the vocabulary, won't the model end up predicting zero-which is meant to represent padding-thereby making the output meaningless?
@@旭哥-r5b You create a special token for padding.
@@statquest And that token will still be used as the label for training?
@@旭哥-r5b I believe that is is correct.
Please include this in your happy halloween playlist
Thanks! Will do! :)
@@statquest triple bam :)
🎉🎉🎉thank you😊
bam! :)
thank you ! can you please explane how we can use transformer in time series please?
I'll keep that in mind. But in the mean time, you can thank of an input prompt (like "what is statquest?") as a time series dataset - because the words are ordered and occur sequentially. So, based on a sequence of ordered sequence of tokens, the transformer generates a prediction about what happens next.
Great video. Thanks
Glad you liked it!
never stop making videos, or else i'll track you down and make you eat very spicy chillies
bam! :)
Thank you very much!
TRIPLE BAM!!! Thank you so much for supporting StatQuest!!!
Thanks a lot for for this free wonderful content. ❤😊
Thank you!
Thanks!
TRIPLE BAM!!! Thank you for supporting StatQuest!
Finally completed. Took 1.5 months. God i am so slow
BAM! :) It took me over 4 years to make the videos, so 1.5 months isn't bad.
How about an encoder only classifier to round off the series? thanks
I'll keep that in mind.
I want to make a sequence prediction model. How should i test the model? What can i use for inference/ testing? (Not for natural language)
I'm pretty sure you can do it just like shown in this video, just swap out the words for the tokens in your sequence.
you are the best!!! hooray!!!! 😊
Thanks!
Legend.
:)
🙏🏻🙏🏻🙏🏻🙏🏻🙏🏻 Please please please show us how to train QVK Weights in detail 🙏🏻🙏🏻🙏🏻🙏🏻🙏🏻
You showed us just a simple call to function. But we are curious how it did the math, what to train, and how it can changes values of the weights. ABC
Every single weight and bias in a neural network is trained with backpropagation. To learn more about how this process works, see: th-cam.com/video/IN2XmBhILt4/w-d-xo.html th-cam.com/video/iyn2zdALii8/w-d-xo.html and th-cam.com/video/GKZoOHXGcLo/w-d-xo.html
@@statquest Since both QVK Weights are splitted and the calculations are passing non neural network, imho the back propagation process is quite tricky. In the other hand, the fit function did not tell the order of calculations on each nodes.
I see that the two inputs have the same lenght... what would change if I wanted to train with another phrase, for instance: "What awesome statquest" (uses 4 tokens instead of 5). How can I generate an input with torch.tensor where the input is no longer the same dimension?
It depends. If you want to train everything in a batch, all at once, you can add a "" token and mask that out when calculating attention.
thank you sm fr bro
Any time!
Optimus prime has been real quiet since this one dropped😬😬😬😬😬
:)
Respect
Thanks!
gold
Thanks!
Triple Bam!!!
:)
🎉
:)
love youuu
:)
hi josh,
should embedding weigths be updated during training? for example nn.embedding(vocab_size,d_model) produces random numbers that each token will be referred to the related rows in our embedding matrice, should we update this weights during training? positional embedding weights are constant during our training and the only weights (except other parameters of course, like q,k,v) that prone to change are our nn.embedding weights!
I wrote a code for translating amino acids to sequences
everything in training works well with accuracy 95-98%
but in inference stage I get to the bad results. i recall my model by
loading_path=os.path.join(checkpoint_dir, config['model_name'])
model.load_checkpoint(loading_path,model,optimizer)
but after inference loop my result is like:
'tcc tcc tcc tcc tcc tcc tcc tcc tcc tcc tcc tcc tcc tcc tcc tcc tcc tcc ' :(
even we assume my algorithm has overfitted We shouldn't get to this result!
also I think other parameters like dropout factor should not be considered in inference stage (p=0 for dropout)
I mean we shouldn't just reload the best parameters, we should change some parameters (srry I spoke alot :)) )
The word embedding weights are updated during training.
Bam!
Peanut Butter and Jaaam ;)
:)
BAM!!
Thanks Nosson!
Sir can you include how to make the chatbot to hold a conversation with
I'll keep that in mind.
Let's start from the basics. ChatGPT is not a transformer. It's an application.
Yep, that's correct.
I'm confused as to why the values would come from the ENCODER when computing the cross attention between the Encoder and Decoder. Shouldn't the values come from the decoder itself?
So if I trained a model to translate from English to German, then wanted to switch out the German for Spanish, I'd expect the new decoder to know what to do with the output of the Encoder. But if the values are coming from the Encoder, then this wouldn't work.
The idea is that the query in the decoder is used to determine how a potential word in the output is related to the words in the input. This done by using a query from the decoder and keys for all of the input words in the encoder. Then, once we have established how much (what percentages) a potential word in the output is related to all of the input word, we then have to determine what that percentage is of. It is of the values. And thus, the values have to come from the encoder. For more details, see: th-cam.com/video/zxQyTK8quyY/w-d-xo.html
🎉🎉🎉
Triple 🎉!
22:50 hey Josh you assigned 4 for number of tokens, but we have 5 tokens (including ) , even in the shape of the diagram, as you are pointing, there are 5 boxes (representing 5 outputs).. I got confused
And you know what? Words fail me to say how much you affected on my life.. so I won’t say anything 😂
See 26:46 . At 22:50 we just assign a default value for that parameter, however, we don't use that default value when we create the transformer object at 26:46. Instead, we set it to the number of tokens in the vocabulary.
amazinggg
Thanks!
reply with :) if you are think statquest is fully hydrated while recording these
really excited for the book btw
bam! :)
Bam!
:)
Horray!
:)
I have imported a torch. Do I light it now?
:)
GTP :)
Corrected! ;)
What is that extra "import" at line 2, @1.37
That's called a typo.
Triple Bam :)
:)
Baaaam!❤
:)
Wish you could be Prime Minister of the United Kingdom!
Ha! :)
in the very first slide the imports are broken at th-cam.com/video/C9QSpl5nmrY/w-d-xo.html
`import torch.nn as nn import` # there's an extra trailing import here.
Yep, that's a typo. That's why it's best to download the code. Here's the link: github.com/StatQuest/decoder_transformer_from_scratch
Damnn bro 😮😮😮😮
:)
GTP
Corrected! :)
Ya misspelled ChatGPT - Generative Pre-trained Transformer
Corrected! :)
From scratch in pytorch, huh.
I decided to skip doing it in assembly. ;)
ARTIFICIAL NEURAL NETWORKS ARE AWESOMEEEEEEEEEE🔥🔥🔥🔥🦾🦾🦾🗣🗣🗣🗣💯💯💯💯
bam! :)