For someone like me who is new to this field and wants to understand the nitty-gritty of language models, it's necessary to see each part separately, understand it first, and then move on to the next part. But still, I can sense how fantastically it is explained to those who have the basic understanding of deep learning.
Thanks for this awesome explanation! Can someone explain one point to me? The issue with argmax at 22:15 is that it has no derivative, so neural network parameters cannot be trained using it. If I understand correctly, the argmax is the word which should be "attended" when predicting the next word (park). Why is argmax the desired function here - what if the prediction of the next word depends on not the most important single word, but the most important two words in the context? Considering this case, doesn't softmax have an additional benefit over the "naive" argmax that it can also compute distributions with more than one mode?
This is a good point. One detail I didn't mention is that at each layer there are multiple "heads" each with a different query so even if you have an argmax you still get to select multiple words per layer. But even so your point is fair that there may be other advantages to softmax besides easier learning.
Well every output must be mathematical proven ingest so can we not build a formula for every pattern of output. Let's say it out human sense n grammar sense of each word constructed. While it construct can it not out how it did it
This appears to be a distillation of the most important concepts in large language models today. Thanks for the exposition.
Extremely high entropy video. Amazing clarity, delivery, content, and follow. Pure genius!
I found this to be an incredibly unique and interesting approach to explaining LLMs, an excellent introduction, thank you so much for the video!
This is a great modern supplement to Karpathy's guide to language models! Thanks Sasha! Just subbed
Knowledge/sec in this video is off the chart, and the info is cutting edge!
Excellent presentation! Easy to follow and tons of great material including the links to the slides
Thank you for making this video so interesting with those nice graphics and examples. I need to sit down and watch it attentively.
For someone like me who is new to this field and wants to understand the nitty-gritty of language models, it's necessary to see each part separately, understand it first, and then move on to the next part. But still, I can sense how fantastically it is explained to those who have the basic understanding of deep learning.
Amazing content, thanks for putting this together!
Thanks a lot Prof. Rush for this material.
Thanks for the video good high level overview. I like the excalidraw slides also
This is very insightful. Thanks for posting!
this was a wonderful video thanks so much for this
Great complement to Karphathy's video
amazing video!
Great video!
Very nice talk
Excellent talk!! Will recommend to all my coworkers.
amazing video!
Awesome! 🙌
So good!
I'm perpexed.
Hey Sasha, What tools do you use to make your presentations? It's so different from the typical academic presentations :)
Thanks for this awesome explanation! Can someone explain one point to me? The issue with argmax at 22:15 is that it has no derivative, so neural network parameters cannot be trained using it. If I understand correctly, the argmax is the word which should be "attended" when predicting the next word (park). Why is argmax the desired function here - what if the prediction of the next word depends on not the most important single word, but the most important two words in the context? Considering this case, doesn't softmax have an additional benefit over the "naive" argmax that it can also compute distributions with more than one mode?
This is a good point. One detail I didn't mention is that at each layer there are multiple "heads" each with a different query so even if you have an argmax you still get to select multiple words per layer. But even so your point is fair that there may be other advantages to softmax besides easier learning.
That makes sense. Thanks for your helpful reply!
At 32:41, isn't each element in AB rows in A multiplying with columns in B? Waiting for your answer.
Yes this is a bug, sorry about that!
was narration generated? I would love to use the same technique for narrating text.
Well every output must be mathematical proven ingest so can we not build a formula for every pattern of output. Let's say it out human sense n grammar sense of each word constructed. While it construct can it not out how it did it
WOOHOO! just found this channel. it is almost better than porn. how do we give you our money so you keep making videos? pls tell us :o