What are Transformer Models and how do they work?

Serrano.Academy

มุมมอง 132 461

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 29 ธ.ค. 2024

ความคิดเห็น • 170

@zafersahinoglu5913 ปีที่แล้ว ⁺⁶⁵
Luis Serrano, this set of 3 videos to explain how LLMs and transformers works is truly the best explanation available. Appreciate your contribution to the literature.
@anipacify1163 10 หลายเดือนก่อน ⁺²²
Best playlist on transformers and attention. Period . There is nothing better on TH-cam. Goated playlist. Thank you soo much !
@tonym4953 7 หลายเดือนก่อน ⁺²
This is what you get when the teacher is concerned with actually imparting knowledge and learning as opposed to showing off how sophisticated their own knowledge is. I am incredibly appreciative of his humility and passion. An example to follow in my own life.
@RuliManurung 8 หลายเดือนก่อน ⁺⁶
What an awesome series of lectures. I spent 10 years teaching undergraduate-level Artificial Intelligence and NLP courses, so I can really appreciate the skill in breaking down and demystifying these concepts. Great job! I would say the only thing missing from these videos is that you don't really cover how the learning/training process works in detail, but presumably that would detract from the focus of these videos, and you cover it elsewhere.
@anupamjain345 8 หลายเดือนก่อน
Thanks! Never came across anyone explaining anything in such a great detail, you are amazing !!!
@SerranoAcademy 8 หลายเดือนก่อน
@anupamjain345, thank you so much for your really kind contribution, and for your nice words!
@BunnyLaden 14 วันที่ผ่านมา
Thank you Luis! This series is THE BEST explanation I've seen on attention and transformers. Your gift of clear explanation is truly a gift to those of us seeking knowledge.
@YannickBurky 10 หลายเดือนก่อน ⁺⁸
This video is incredible. I've been looking for material to help me understand this mess for quite some time now, but everything about this video is perfect: the tone, the speed of speech, the explanations, the hierarchy of knowledge... I'm screaming with joy!
@lightninghell4 ปีที่แล้ว ⁺⁷
I've seen many videos on transformers but this series os the first where I understood the topic at a deep enough level to appreciate it.
@Glomly 11 หลายเดือนก่อน ⁺¹
This is the BEST explanation ever you can find on the internet. I'm serious
@analagunapradas3840 10 หลายเดือนก่อน
Agree, definitly the BEST, as always ;) Luis Serrano
@Aleks-ng3pp 11 หลายเดือนก่อน ⁺¹⁰
In the previous video you said you would explain how to compute Q, K, and V matrices in this one. But I don't see.
@rudyfigaro1861 7 หลายเดือนก่อน ⁺¹
Luis has a talent for breaking down complex problems into simple steps and then build the whole thing back so that ordinary people can understand.
@nc2581 7 หลายเดือนก่อน ⁺⁵
Thank you for this fantastic video series on Transformers! The first two videos were particularly enlightening. I'm fascinated by how the query, key, and value vectors evolve before each attention module. It would be wonderful to gain a deeper understanding of the encoder-decoder architecture, particularly why the first attention belongs to the encoder while the subsequent ones are part of the decoder. Also, I'm intrigued by the visualization of linear transformations at each step during training, especially when outputs are recycled back into the decoder. Eagerly awaiting more insights!
@vishalmishra3046 4 หลายเดือนก่อน ⁺¹
25:17 Hello Luis, *Positional Encoding* enables a Transformer to handle multiple words in parallel since ordering of words are built-in due to PE. Before transformers, the Reinforcement Learning using Long Short-Term Memory (LSTM) models were slow because words had to be processed serially since order and sequence matters. In addition to Attention, PE was a big innovation in Transformer models making them highly performant and so successful today.
@jazznomad ปีที่แล้ว ⁺⁵
Very clear. You are a natural teacher
@zihaoli962 5 หลายเดือนก่อน
such a wonderful video! it lays a quite good foundation and during watching the 3 videos a lot of questions occur to my mind e.g., 1) number-of-heads vs. accuracy/computation/space, 2) what's the trainable size of QKV, 3) for the whole framework what are the data shapes of all intermediate steps, and how do we perform back-propagation and so on.
then after watching the video, you could proactively go search n find those answers then your level of understanding could be further boosted!
that's the good thing about this video - it takes decent amount of effort to understand while it does not act like a show-stopper (I'd say the paper "attention is all you need"). it starts you off and you could pick up more along the learning journey. a big thx to Serrano!
@sandhiyar8763 ปีที่แล้ว ⁺⁴
Absolutely worth the watch! The clarity in Luis's explanation truly reflects his solid grasp on the content.👍
@narasimmanramiah3062 5 หลายเดือนก่อน
The set of 3 videos is excellent. I could understand the transformer architecture very well with visuals. Kudos to Luis Serrano !!
@matin2021 7 หลายเดือนก่อน ⁺²
All my friends and I have watched every single one of your videos
Great content, appropriate sentences and...
really great
Keep making great videos 🙌
@MohamedHassan-pv1xl ปีที่แล้ว ⁺²
I feel super lucky to come across your videos, normally I don't comment, but I saw the 3 videos of the series and I'm amazed on how you explain complicated topics. You're efforts are highly appreciated.
@JohnDeBrittoL-n6x ปีที่แล้ว
Thanks!
@SerranoAcademy ปีที่แล้ว
Thank you so much for your kindness! Very appreciated. :)
@SourjyaMukherjee 10 วันที่ผ่านมา
Hard to believe that someone came up with better content on transformers than 3B1B
@luizabarguil5214 8 หลายเดือนก่อน
A clear, concise and conclusive way of explaining Transformers! Congrats and Thank you so much for sharing it!
@SkyRiderJavelin ปีที่แล้ว ⁺¹
What an excellent series on Transformers, really did the trick !!! The penny has finally dropped. Thanks very much for posting this is very useful content. I wish I came across this channel before spending 8 hours doing a course and still not understanding what happens under the hood.
@BhuvanDwarasila-y8x 3 หลายเดือนก่อน
That''s fire! One quick thought I had about finetuning is a good model should be able to respond to human behaviors really well and in the way the business wants! For example, the model should not simply respond to someone sad and say oh thats great, keep crying!, but you never know what the majority of the web context may be like! However at the same time this is something to be very cautious about from a user perspective because businesses can easily manipulate the model to encourage nation, business, other motives that may be unclear to us!
@shaktisd ปีที่แล้ว
Amazing series on Transformer. Never ever imagined the true rationale behind Q,K,V ... it is actually clear after watching your video. Thanks a lot.
@jaeen7665 4 วันที่ผ่านมา
This is a fantastic explanation of transformers!!!
@alidanish6303 ปีที่แล้ว ⁺¹⁹
Finally the 3rd video of the series and as usual with the same clarity of concepts as expected from Serrano. The way you have perceived these esoteric concepts have produced pure gold. I am following you and jay from Udacity and you guys have made real contribution in explaining a lot of black magic. Any plans to update grokking series...?
@patriciasilvaoliveira6130 8 หลายเดือนก่อน ⁺¹
Amazingly clear and encouraging to learn more. Thanks, maestro!
@faraazmohammed3693 11 หลายเดือนก่อน
Liked the video while watching at 7:15. crystal clear explanation. Good job, thank you Serrano and I appreciate your work.
@GameDevCoffeeBreak 6 หลายเดือนก่อน
This is the best video I have watched about the subject; it explains the concepts in such a way everyone can understand them! Thumbs up and subscribing right now!
@ehsanpartovi8279 4 หลายเดือนก่อน
This was such a thorough and precise explanation. And the visualizations! Great job. We would like to see more videos like this, particularly about other generative models such as those used for image and video generation, i.e. Canva.
@GiovanneAfonso 9 หลายเดือนก่อน
The best explanation available
@aljebraschool 3 หลายเดือนก่อน
Delighted to learn Transformer from my mentor today! Thanks for a very clear overview!!!
@johnny1966m ปีที่แล้ว
Thank you Mr. Serrano, it was very educated and lectured in very good way. In relation to Positioning, the example with arrows gave me the idea that the purpose of this stage is to make sure that only the correct positions of words in the sentence cluster together, while the incorrect ones diverge them, thus the neural network distinguishes their position in the sentence during training.
@aminemharzi7222 8 หลายเดือนก่อน ⁺²
The best explanation I found
@nileshkikle8112 11 หลายเดือนก่อน ⁺¹
Dr. Luis - Thank you for taking all the effort for creating these #3 videos. Explaining complex things in the simplest way is an art! And you have that knack! Great job! Been following you ML videos for years now and I always enjoy them.
PS- Funny enough, for typing these comments, I'm being prompted for selecting the next prediction word! 🙂
@karunamudliyar5625 ปีที่แล้ว
The best video, that I watched on Transformer. Very clear explanation
@Tuscani2005GT 11 หลายเดือนก่อน
Seriously some of the best videos on the topic. Thank you!
@AboutOliver ปีที่แล้ว
You have a skill for teaching! Thanks so much for this series.
@johnschut164 ปีที่แล้ว
Your explanations are truly great! You have even understood that you sometimes have to ‘lie’ first to be able to explain things better. My sincere compliments! 👊
@vimalshrivastava6586 หลายเดือนก่อน
Thanks for such a wonderful explanation of a complex model. If possible, please make videos on Vision Transformer as well, if possible.
@anuragdh 11 หลายเดือนก่อน
Thanks.. for the first time (not that I've gone through a lot of them :)), I was able to appreciate how the different layers of a Neural Network fit together with their weights. Thanks for making this video with the example used
@ikheiri ปีที่แล้ว
Best video i've come across that explains concepts simply. Helped tremendously in my learning endeavor to create a mental model for neural networks (there's a joke there somewhere)
@SerranoAcademy ปีที่แล้ว
Thanks! Lol, I see what you did there! :)
@utkarshkapil ปีที่แล้ว
Beautifully explained. Loved how you went ahead to also teach a bit of the pre-requisites!
@incognito3k 11 หลายเดือนก่อน ⁺¹
As always amazing content. We need a book from Luis on GenAI!
@markuskaukonen3903 ปีที่แล้ว
Very nice stuff!, This first time somebody explained clearly what large language models are. Especially the second video was very valuable for me!
@edwinma9933 ปีที่แล้ว
this is amazing, it deserves 10M views.
@harithummaluru3343 11 หลายเดือนก่อน
great explanation. perhaps one of the best videos
@AmanBansil 8 หลายเดือนก่อน
Incredible - just found this channel and I am about to pour over all videos. Thank you so much for your effort.
@KC-dr4gj 5 หลายเดือนก่อน
For word2vec, my understanding is that the embeddings are the weights from the input layer to the hidden layer, not the hidden layer itself as you menioned in 24:57?
@nazmulhaque8533 ปีที่แล้ว ⁺⁵
Excellent presentation. Waiting to see more videos like this. I would request you to make a series about aspect based sentiment analysis. Best wishes...
@jukebox419 ปีที่แล้ว
You're the greatest teacher ever lived in the history of mankind. Can you please do more videos regularly?
@SerranoAcademy ปีที่แล้ว ⁺²
Thank you so much! Yes I'm definitely working hard at it. Some videos take me quite a while to make, but I really enjoy the process. :)
If you have suggestions for topics, please let me know!
@jamesgeller9975 4 หลายเดือนก่อน
This is VERY GOOD. I also watched several videos and this helped most. The one thing I don't understand is how the system decides that there are two apples. I felt that one Apple will be pulled back and forth between phone and orange.
@abdelrhmanshoeeb7159 ปีที่แล้ว
Finally ,i am waiting it from a month. Thank you alot.
@samirelzein1095 ปีที่แล้ว
the great Luis! i am recommending you in my job posts, your content is a prerequisite before working for us
@SerranoAcademy ปีที่แล้ว ⁺¹
Wow thanks Samir, what an honor! And great to hear from you! I hope all is well on your end!
@samirelzein1095 ปีที่แล้ว
@@SerranoAcademy honor is mine! you are the artist of going inside it, seeing the wiring and connections and delivering them as seen to all people. that s the job of prophets and saints. bless you.
i am doing great, plenty of text and image processing currently :) digitizing the undigitized!
@poussinet2 ปีที่แล้ว
Thank you for these really high quality videos and explanations.
@dragolov ปีที่แล้ว
Deep respect, Luis Serrano! Thank you so much!
@lakshminarayanan5486 ปีที่แล้ว ⁺²
Hi Luis, Excellent material and you know how to deliver it to perfection. Thanks a lot. Could you please explain a bit more on positional encoding and how the residual connections & layer normalization, encoder-decoder components fit into the very same example.
@timothyjoubert8543 11 หลายเดือนก่อน
thank you for this series - wonderfully explained. 💯
@skbHerath 10 หลายเดือนก่อน
finally, i managed to understand the concept clearly, Thanks
@blueberryml ปีที่แล้ว
excellent -- clear and concise explanations
@leenabhandari5949 8 หลายเดือนก่อน
Your videos are gold.
@chrisogonas ปีที่แล้ว
Well illustrated! Thanks for sharing.
@sohamlakhote9822 8 หลายเดือนก่อน
Thanks a lot man!!! You did a fantastic job explaining these concepts 🙂
@wanggogo1979 ปีที่แล้ว
Finally, I waited until this video was released.
@Nerdimo 9 หลายเดือนก่อน
8:53 do you think it’s a plain feed forward neural network, or something like an RNN, LSTM to be specific? Just a thought.
@stephengibert4722 4 หลายเดือนก่อน
Emmy Noether did not "invent" abstract algebra. It is very disconcerting that when I googled that question as a test I came up with, in great big capital letters, "NOETHER..." etc. Once these models are relied upon, we are in big trouble, because such errors will be embedded in the embeddings, and so on. Of course this is really peripheral to the topic at hand, and I very much appreciate your excellent job at teaching the basic ideas of transformers. Thanks very much!
@dayobanjo3870 10 หลายเดือนก่อน
Great video, speaking from Abuja capital of Nigeria
@SerranoAcademy 9 หลายเดือนก่อน
ohhhh greetings to Abuja!!! Nigerians are the kindest people, I hope to visit sometime!
@vasimshaikh9857 ปีที่แล้ว
Finally 3rd video is here 😮😅 thank you sir
I have been waiting for this video since last month , everyday I check your channel for the 3rd video sir , thank you so much sir You're doing great work 👍
@usefbob ปีที่แล้ว
This series was great! Appreciate all the time and effort you've put into them, and laid out the concepts so clearly 🙏🙏
@danieltiema 9 หลายเดือนก่อน
Thank you for explaining this so well.
@amrapalisamanta5085 7 หลายเดือนก่อน
Very good playlist
@silvera1109 9 หลายเดือนก่อน
Great video, hugely appreciated, thank you Luis! 🙏
@VerdonTrigance ปีที่แล้ว
24:59 - again... who and how defines these layers and network to set words vectors? All it comes to it. How do we know that cherry and apple has a similar 'properties' ?
@sathyanukala3409 10 หลายเดือนก่อน
Excellent explanation. Thanks you.
@SatyaRao-fh4ny ปีที่แล้ว ⁺¹
This is a great video, clarifying a number of concepts. However, I am still not finding any answer to some of my questions. e.g in this video, when the user enters "Write a story.", these are 4 tokens. But the "model" spits out a NEW word "Once". Where is this NEW word coming from? How does the "model" even "KNOW" about such a word? Is it saved in some database/file? Is there a dictionary of ALL the words (or tokens) that the "model" has access to? And I guess the other question what does "training a model" actually mean- on the ground- not just conceptually? After training, is the end result some data/words/tokens/embeddings that are save in some file that the "model" "reads/processes" when it is used later on? What are parameters? I have watched several hours of videos, but have not found answers to these questions! Thanks for any help for experts!
@SerranoAcademy ปีที่แล้ว ⁺²
Thanks, great questions!
Yes, there is a database of tokens, and what the model does is output a list of probabilities, for each token. The ones with high probability are the ones that are very likely to be the next in the sentence. So then one can pick a token at random based on this probabilities, and very likely you'll pick one that has a high probability (and that way, the model will not always answer the questions in the exact same way, but it'll have variety).
The training part is very similar to a neural network. It consists on updating the weights so that the model does a better job. So for example, if the next word in a sentence should be "apple", and the model gives "apple" a very low probability, then the backpropagation process updates the weights so that the probability of "apple" increases, and all the other ones decrease.
The parameters are the parameters of the neural network + the parameters of the attention matrices.
If you'd like to learn more about neural networks and the training process, check out this video: th-cam.com/video/BR9h47Jtqyw/w-d-xo.html
@joaomontenegro ปีที่แล้ว
These videos are great!! I would love to see one about the intuition of cross attention in, for example, the context of translation between two languages.
@SerranoAcademy ปีที่แล้ว
Thanks, great suggestion!
@jorovifi89 ปีที่แล้ว ⁺¹
Great work as always, thank you keep them coming
@panoskolyvakis4075 9 หลายเดือนก่อน
you re an absolute legend
@vankram1552 ปีที่แล้ว
This is a fantasitc video, by far the best on youtube. My only feedback would be the guitar music you use between chapters is a little abrasive and can you take you out of the learning process. Maybe some calmer more thought provocing music along with more interesting title cards would be better.
@karstenhannes9628 9 หลายเดือนก่อน
Thanks for the video! I particularly liked the previous video about attention, super nice explanation!
However, I thought most transformers simply use a linear layer that is also trained to create the embedding instead of using a pre-trained network like word2vec.
@19AKS58 หลายเดือนก่อน
Clearly the best & clearest explanations. One question: after the "write a story." prompt, when the Attention block selects possible words such as Once or There, does it go through the ENTIRE vocabulary to do so? Doesn't that take a very long time? Are there shortcuts to make it more efficient? Thx
@SerranoAcademy หลายเดือนก่อน
Great question! Yes, transformers do go through all the words, but in a really efficient way. First of all, matrix multiplications are very optimized and parallelized, so these are done very fast. Also, sometimes shortcuts are applied, like taking topk max, in order to not look at the whole vector of all words each time. You may lose a word here and there, but it goes much faster.
@william_8844 ปีที่แล้ว
I like the attention explanation
@mikelCold 7 หลายเดือนก่อน ⁺¹
Where does context length come in? Why can some models be longer than others?
@jamesgeller9975 4 หลายเดือนก่อน
Also the third video shoes the attention blocks in the neural network but I don't see how that layer implements K Q V.
@maethu ปีที่แล้ว
I am happy like a zygote about this video!
Great work, thanks a lot!
@alexanderzikal7244 8 หลายเดือนก่อน
Thank You very much for all your videos! Whats Software do You use for your presentations? All looks really nice, all pictures…
@ksprdk 24 วันที่ผ่านมา
Great video, thanks
@crowsnest6753 3 หลายเดือนก่อน ⁺¹
Your explanation of Word2Vec was not fully clear to me. It seems like a catch 22 scenario -You use a trained network that guesses the next word to determine which embeddings to use for the beginning of the process but to know what embeddings to use you need a trained network that can guess the next word at the end of the process. What did I misunderstand?
@TemporaryForstudy ปีที่แล้ว ⁺³
Your videos are rocking as always. Hey, do you have any remote internship opportunities in your team or in your organisation? I would love to learn and work with you guys.
@SerranoAcademy ปีที่แล้ว
Thank you so much! Yes we have internships, check them out here! jobs.lever.co/cohere
@EugenBurianov 9 หลายเดือนก่อน
Great great video! Thank you
@samarthseksaria2587 3 หลายเดือนก่อน
Could you describe the feedforward step more, with respect to the variable input length? Is it an RNN?
@Omar-bi9zn 10 หลายเดือนก่อน
fantastic videos !!
@andreibuldakov2641 ปีที่แล้ว
Man, you are the best!
@htchtc203 11 หลายเดือนก่อน
Sir, thank you for very clear and informative series of presentations. Excellent job?
May I ask something about embedding or word2vectors. How is a NN trained for words in order to cluster words for some kind of similarity grouos in multidimensional vector space? Is this training proceas guided or is it like self organizing map or process?
@khaledbouzaiene3959 ปีที่แล้ว ⁺¹
nice wow but please i still have a question, you didn’t mentioned how the words with similarities are placed close in embedding, i know after we assign the mechanism attention score but don’t get do the embedding is a separate neural network as in video
@tantzer6113 ปีที่แล้ว
That closeness is achieved automatically in the end result because it’s more efficient. It isn’t something that the human designer plans for.
@SerranoAcademy ปีที่แล้ว ⁺¹
Yes, great question! The idea is to train a neural network to learn the neighboring words to a particular word. So in principle, words with similar neighbors will be close in the embedding, because the neural network sees them similarly. Then the embedding comes from looking at the penultimate layer in the neural network, which has a pretty good description of the words. So for example, the word 'apple' and the word 'pear' have similar neighboring words, so the neural network would output similar things. Therefore, at the penultimate layer, we'd imagine that the neural network must be carrying similar numbers for each of the words. The embeddings come out of here, so that's why the embeddings for 'apple' and 'pear' would be similar.
@khaledbouzaiene3959 ปีที่แล้ว
@@SerranoAcademythanks for clarifying i got confused coz i just want this huge neural network composed of multiple layers of smaller neural networks where the first one is the embedding layer not separate one , but generally everything now make sense now no matter the design
@atriplehero ปีที่แล้ว
Does the whole "Once upon a time" already built gets fed again into the whole process again as a 'input' in order to get attached its next word/token?
In order words, is it like a cycling again and again untill a "seemengly complete" answer is generated?
If this is the case, it would be a whole lot of inefficiency and explains why so much electricity is consumed!!
Please answer this crucial detail.
@laavispamaya ปีที่แล้ว
Ooh thanxs for mentioning Emmy Noether
@SerranoAcademy ปีที่แล้ว
Yayy!!! Huge fan of Emmy! :)
@techchanx ปีที่แล้ว
Excellent video, will be good to indicate what encoder - decoder model is in transformers. Couldnt figure that out here.
@SerranoAcademy ปีที่แล้ว ⁺¹
Thanks! Yes that's something I'm trying to make sense of, perhaps in a future video. In the meantime, this blog post is the best place to go for that:
jalammar.github.io/illustrated-transformer/
@nafisanawrin2901 9 หลายเดือนก่อน ⁺¹
While training model if it shows wrong answer how it is corrected?
@maneeshajainz ปีที่แล้ว
I like your videos. Can you post a quiz after each of your videos?

ต่อไป

เล่นอัตโนมัติ

Proximal Policy Optimization (PPO) - How to train Large Language Models