I didn't expect to finally understand transformers in this generative music course. I had watched lots of other videos about transformers but still found them really confusing. I started this course because I'm interested in generative music, so understanding transformers is just a bonus. I will definitely recommend this series to my classmates. Thank you!
i probably can say that this video is the best on whole TH-cam about this topic, i searched really a lot and all i found was very superficial courses. Great job.
This video just saved my ass as I was having hard time understanding transformers for my work assignment to train a transformer model for audio classification. Thank You!!
You're right and wrong at the same time. There's a mistake in the video -> dimension_model = 2 instead of 3 (I messed this one up in LaTex!). There's also a mistake in your formula "2*0" should be "2*1" as is correctly showed in the video. We're at embedding position 2, that is i = 1, given 0-indexing. In any case, thank you for pointing this out :)
I believe @user-yf6yf6ki6f has a valid point. The denominator in the second column should be 10000^(2*0/3), and I also noticed a mistake in the third column - it should be 10000^(2*1/3). I think this is how it is implemented in the upcoming video within the _get_angles method.
Velario, I'm mid writing my PhD thesis on music generation and this video is incredibly useful for ensuring my explanations make sense and is a great source to cite. Thanks for making it! Also at 1:00:38, why is your dimension model 2 for the cos(pos/10000^2i / dimension model) examples? Just want to make sure if I'm misunderstanding something :) Thanks again!
Awesome explanation. I have a doubt, the embeddings I is such that the first row corresponds to first word in the sequence and so on. Now we have the positional representation of eac word in the sequence, isn't this enough for the transformer model to undersatnd position related info of all the words in the input sequence?
The self-attention process is inherently position-agnostic - it doesn't inherently consider the order of words. The attention mechanism would work the same way regardless of the word order if not for positional encodings. That's why we can't rely on the order in the input matrix directly. The model needs an explicit, numerical way to understand word order. That is the job of the sinusoidal function.
the positional encoding matrix is either a 'clever math trick', or a sign that all of this is a kludgy hack and that we're still very far off from actually understanding this crap lol. like, we're still messin with brimstone and vitriol, and haven't been able describe 'sulfur' yet
I considered various methods to convey this topic: 1. Release a concise 15-minute video, giving viewers a feeling of understanding about transformers, yet only skimming the surface; 2. Publish a denser 30-minute video, heavy on mathematics and light on explanations, assuming a substantial level of pre-knowledge and making the material challenging; 3. Provide an in-depth, 2+ hour explanation filled with details, offering sufficient time to demystify the more intricate concepts in a user-friendly way. My choice was the third option. Though it is lengthy, I believe its length makes it inherently simpler to comprehend due to the thorough coverage it allows.
I didn't expect to finally understand transformers in this generative music course. I had watched lots of other videos about transformers but still found them really confusing. I started this course because I'm interested in generative music, so understanding transformers is just a bonus. I will definitely recommend this series to my classmates. Thank you!
This is such a generous and empowering resource. Massive thanks!
i probably can say that this video is the best on whole TH-cam about this topic, i searched really a lot and all i found was very superficial courses.
Great job.
Thank you :)
Good video, and good explanations of query, key and value matrices with analogies!
This video just saved my ass as I was having hard time understanding transformers for my work assignment to train a transformer model for audio classification. Thank You!!
Amazing!
Excellent explanation in a very lucid fashion. It was really helpful!
Mad value in this video. You are such a good expositor.
Thank you!
Amazing video, I would like it a thousand times if I could!
Superbly presented!!
Great work you're doing here Valerio. Really appreciated!
Thanks!
the kind Valerio. Thank you
59:41 The denominator values in the second column of this matrix seem to be different from the formula. Shouldn't it be 10000^(2*0/3)?
You're right and wrong at the same time. There's a mistake in the video -> dimension_model = 2 instead of 3 (I messed this one up in LaTex!). There's also a mistake in your formula "2*0" should be "2*1" as is correctly showed in the video. We're at embedding position 2, that is i = 1, given 0-indexing.
In any case, thank you for pointing this out :)
I believe @user-yf6yf6ki6f has a valid point. The denominator in the second column should be 10000^(2*0/3), and I also noticed a mistake in the third column - it should be 10000^(2*1/3). I think this is how it is implemented in the upcoming video within the _get_angles method.
Thank you very much! It will significantly help me with my university project!
Velario, I'm mid writing my PhD thesis on music generation and this video is incredibly useful for ensuring my explanations make sense and is a great source to cite. Thanks for making it! Also at 1:00:38, why is your dimension model 2 for the cos(pos/10000^2i / dimension model) examples? Just want to make sure if I'm misunderstanding something :)
Thanks again!
Best explination I found so far. Keep it up!
Thanks a lot, that's pure gold content !
Thank you!
Thank you so much, Valerio!
Excellent , thanks !
Thanks a lot! You made great work!
Thanks!
Awesome explanation.
I have a doubt, the embeddings I is such that the first row corresponds to first word in the sequence and so on. Now we have the positional representation of eac word in the sequence, isn't this enough for the transformer model to undersatnd position related info of all the words in the input sequence?
The self-attention process is inherently position-agnostic - it doesn't inherently consider the order of words. The attention mechanism would work the same way regardless of the word order if not for positional encodings. That's why we can't rely on the order in the input matrix directly. The model needs an explicit, numerical way to understand word order. That is the job of the sinusoidal function.
@@ValerioVelardoTheSoundofAI
Like a blind mice which can sense gradient in smell of cheese in its environment.
@@hariduraibaskar9056 I love the metaphor :D Quite appropriate!
Thank you, sir!
Please call me Valerio :)
Thank you, Valerio! :)
Lovely explanation as always.@@ValerioVelardoTheSoundofAI
Is there the part II of the video ?
It'll come out tomorrow - stay tuned ;)
@@ValerioVelardoTheSoundofAI great thanks for sharing your knowledge.
🤟
the positional encoding matrix is either a 'clever math trick', or a sign that all of this is a kludgy hack and that we're still very far off from actually understanding this crap lol.
like, we're still messin with brimstone and vitriol, and haven't been able describe 'sulfur' yet
you say "easily" but your part 1 video is over 1hour 😅
I considered various methods to convey this topic:
1. Release a concise 15-minute video, giving viewers a feeling of understanding about transformers, yet only skimming the surface;
2. Publish a denser 30-minute video, heavy on mathematics and light on explanations, assuming a substantial level of pre-knowledge and making the material challenging;
3. Provide an in-depth, 2+ hour explanation filled with details, offering sufficient time to demystify the more intricate concepts in a user-friendly way.
My choice was the third option. Though it is lengthy, I believe its length makes it inherently simpler to comprehend due to the thorough coverage it allows.