Watching AI Coffee Break while working. 😀 Watching AI Coffee Break while on a coffee break: 🤯😱 Thanks for leaving this comment! Happy you found this video useful!
What OTHER state-of-the-art in NLP these days? :D Just kidding, but there is some truth in my question, since the Transformer has taken all spotlights nowadays.
@@AICoffeeBreak Agreed! I meant Multimodals, Google BiT etc. Also, high-level differences between T5, BART, GPT-2, and which kind of tasks each excel in would be very helpful to know! Thanks for making these videos. Just watched multimodal video and it was extremely useful.
Thanks! Best explanation on transformers I've seen so far, but I think I would have liked it if you dwelled more on the generation of the output and encoder-decoder attention. Cheers!
Subscribed because this channel keeps giving top quality explanations. Also I just realised the eyes on the coffee bean are shaped like a coffee bean 🤣
Holy Smokes... you're videos are clear and entertaining.. wow, other could learn from you. The "pseudo" bossa nova at the end helps too, me having a Brasil connection ;))
Please keep up this amazing work, clearly explained. And a request, if possible please add your all relevant videos in a playlist, so that it can be followed easily(I found a playlist though). Thanks a ton
@@AICoffeeBreak No just wanted to say whenever you'll add a new video, please try to add it to the playlist. You've playlist has got great contents. Thanks again
I am not sure if you already made a video on this, but I would love to see you taking two architecture paradigms like CNNs and Transformers and explain the differences, the reason why one works better than the other in what situations etc.
This was very well explained. Thank you so much for this. I couldn't help but wonder since these models are trained by data, how much data do you need for a model to be reasonably accurate? A quick google search told me that GPT4.0 is trained by approximately 500,000,000,000 pages of text, which is absolutely insane to me! I want to know if there are models that we can develop that train based on less data but still provide accurate results and what do these models look like?
Thanks a lot, especially since this is a very old video. We have made a new transformer explainer: th-cam.com/video/ec9IQMiJBhs/w-d-xo.html About your question: Unfortunately in deep learning, great performance comes with big data, because the models work only well in the domains and kind of data they have seen so far (in distribution). And the motto is: nothing will be out of distribution if we include the entire world in the training data, no? (this is a tongue-in-cheek comment, just flagging it). 😅 So, if you are willing to sacrifice a lot of performance, then there are models that can work with less data, going back to older NLP, based on word embeddings and tf-idf representations. But I cannot say more until I know your specific use case. But if you want a chatbot that can talk about almost anything, then you need trillions of tokens of text, at least this is what we learned from ChatGPT et al.
Oh wow I didn't even realize I was on the older video I will definitely check out the new one, and thanks for your answer! The motivation for my question was since we typically don't have a lot of data on endangered languages, could there be language models that can produce helpful results in these languages despite the lack of data on them. I guess the broader question would be what kinds of language models could we apply to endangered languages for things such as documenting them or aiding in that kind of research?
@@ChengyiLi-t1k I'm not an expert in multilingual AI, but I have heard from experts there. Your question reminds me of two points. * In multilingual AI, people still try to scrape all the monolingual data they have, automatically produce back translations and then train a multilingual model that hopefully can transfer its knowledge from high resource languages, to the low resource one. But you need some decent amount of data from every language you aim to learn. We've made a video on this approach, find the link in the description. th-cam.com/video/1gHUiNLYa20/w-d-xo.html * If you have a very powerful model, of the class of GPT-4, Gemini, et al, then you could hope that the representations that exist are strong enough to elicit them with few-shot prompting. So, if you have the context length of Gemini, of multiple million input tokens, then you can many-shot a language from scratch by feeding in its dictionary and a grammar book. This is what Gemini 1.5 did for Kalamang: www.reddit.com/r/singularity/comments/1arla9z/gemini_15_pro_can_learn_to_zero_shot_translate_of/ It was meant as a out-of-distribution test, because the authors were sure that there is no trace of Kalamang on the internet that Gemini was trained on.
Hi Letitia , Thanks for such great content. One question I have - When we use Transformers Encoder to encode any sequence to generate embeddings what loss function does transformer uses. For e.g. I am using Transformer Encoder to encode a sequence of user actions in a user session to generate embeddings to be used in my recommender system. Kindly answer
Hi! To predict the correct word, from all possible word over the (English) vocabulary, the model uses cross-entropy loss. But you can change the loss to adapt it to your problem.
Neat explanation however after watching video I still don’t understand how multihead attention and encoder-decoder attention work which are two important concepts I need to know
You're right. We plan a "Transformer remastered" video because sure, we did not get to the most technical points and to the math formulas. You might want to check out Yannic's video on it: He stays very close to the paper. th-cam.com/video/iDulhoQ2pro/w-d-xo.html
Especially this end music is one of my favorites too! It is called "Malandragem" and you can find it in the TH-cam Audio Library (under the TH-cam Audio Library license).
For the graphics and content animations I use Powerpoint, for animating Ms. Coffee Bean and video editing I use kdenlive, for drawing her I use Adobe Photoshop.
That's their length in BERT base. They can be longer depending on how much memory available one has and how long one is willing to wait for a pass through the network. The longer the vectors, the slower the transformer.
Hi Letitia, great explanation, thank you. At 08:14 you have mentioned the Encoder-decoder attention. That’s one of the key locations where magic happens. Could you elaborate on it? Or maybe you you have done already in some other of your videos?
Maybe our remastered version of this video might help th-cam.com/video/ec9IQMiJBhs/w-d-xo.html . We did it mainly because of your feedback, thanks a lot!
Ms. Coffee Bean is planning a follow-up video in Transformer going into more details. Right now the concepts you ask about were out of the scope of this explanation. Stay tuned!
Hi, thanks for the video! There are several things that are still unclear to me. First I do not understand well how the architecture is dynamic with respect to the size of the input. I mean what does change structurally when we change the size of the input, are there some inner parts that should be parallely repeated? or does this architecture fix a size of max window that we hope will be larger than any sequence input? The other question is the most important one, it seems every explanation of transformer architecture I have found so far focuses one what we WANT a self attention or attention layer to do but never say a word of WHY after training those attention layers will do, by emergence, what we expect them to do. I guess it has something to do with the chosen structure of data in input and output of those layers, as well as the data flow which is forced but I do not have yet the revelation. If you could help me with those, that would be great!
Thank you, I will try my best concerning the verbal pace! 😊 Does turning the captions on and setting the speed of the video to 0.75x maybe help you? The captions could help a lot, I have uploaded them myself => they are not automatically generated. Even if they were, the Algorithm got pretty good at speech-to-text.
🛑🪧Our remastered version of this video: th-cam.com/video/ec9IQMiJBhs/w-d-xo.html 🪧 containing more about attention keys, queries and values!
Residual conevtion.
I just understood why they sum those layers, thx so the main information don't lose through the process
This is my favourite video on transformers. Everytime I want to revise the architecture quickly, I come here!
Wow, this is huge and means a lot to us. Thanks, happy this helps.
This is my favorite video on transformers
Glad you like it this much!
I agree with the general consensus in the comments your explanation of transformers has been the best I've seen. Subbed!!
Reading this comment makes Ms. Coffee Bean so happy!
I've trying to warp my head around this architecture for a loooong time.. Thanks Coffee Bean !
So happy, I made a difference! -- Coffee Bean
Finally got what is so special about this architecture. Thank you!
Our pleasure! :)
This is best explained video on transformers,, i have ever watched !!! Amazing work . Plz keep posting videos on such topics.
Thank you! Ms. Coffee Bean will try her best!
Thanks for your contribution! I think I just have to binge watch all your videos before I can do anything else today.
Haha, so happy to hear this! Luckily, the videos are short and not that many (yet).
Very nice explanation, easy to follow even for someone with almost no background in NLP
Thanks, that was exactly the point! 😃
By far the best explanation on transformers. I was actually on a coffee break O_o.
Watching AI Coffee Break while working. 😀
Watching AI Coffee Break while on a coffee break: 🤯😱
Thanks for leaving this comment! Happy you found this video useful!
Great video. I really
Like how you put the aspects of transformers in practical context.
Thank you very much!
Thank you very much for this easy explanation of the complex topic! :)
Your explanation is very clear and precise. It helps me a lot. Thank you so much.
Excellent explanations! Please keep the transformer and other state-of-the-art NLP videos coming ! Great quality.
What OTHER state-of-the-art in NLP these days? :D Just kidding, but there is some truth in my question, since the Transformer has taken all spotlights nowadays.
@@AICoffeeBreak Agreed! I meant Multimodals, Google BiT etc. Also, high-level differences between T5, BART, GPT-2, and which kind of tasks each excel in would be very helpful to know! Thanks for making these videos. Just watched multimodal video and it was extremely useful.
@@vaishali_ramsri Thanks, these ideas are very interesting for me too! I will add these to my (already very long) list of ideas for videos.
SUBBED! I honestly enjoy this more than my favourite video game youtube videos
Haha, thanks! Imagine what would happen if Ms. Coffee Bean would start explaining ML concepts while gaming! 😱
@@AICoffeeBreak that would be the only live stream that I would pay money to watch haha
Easy to understand and good for beginers. Thanks for your high quality video :P
Thanks for passing by and leaving this nice comment! ☺️
Awesome! Thanks for this great video
Our pleasure! Thanks for watching. 👍
Best explanation everrrr.. Thank you soo much.
You're very welcome! Glad it helped!
Very good explanation. Thanks!
Thanks! Best explanation on transformers I've seen so far, but I think I would have liked it if you dwelled more on the generation of the output and encoder-decoder attention. Cheers!
Great suggestion! I am considering to make Transformer Explained Remastered version. But I did not find the time yet...
@@AICoffeeBreak Thank you for the response! Best of luck
Very good explanation
Thanks for appreciating. It is one of my earliest videos. 😅 I should do a remastered version of this.
@@AICoffeeBreak Great keep going . I am newbie to AI world and not a computer science person. Your videos are really awesome and easy to understand 😊
Best explanation i've seen thank you. Subbed.
Subscribed because this channel keeps giving top quality explanations.
Also I just realised the eyes on the coffee bean are shaped like a coffee bean 🤣
What? I did not realize the CB eyes are shaped like beans. They are just circles. 🤣
Holy Smokes... you're videos are clear and entertaining.. wow, other could learn from you. The "pseudo" bossa nova at the end helps too, me having a Brasil connection ;))
Please keep up this amazing work, clearly explained. And a request, if possible please add your all relevant videos in a playlist, so that it can be followed easily(I found a playlist though). Thanks a ton
Hey, thanks for the suggestion! Do I understand correctly that the existing playlists are not enough? :)
@@AICoffeeBreak No just wanted to say whenever you'll add a new video, please try to add it to the playlist. You've playlist has got great contents. Thanks again
Thanks for clarifying! I usually add things to the playlists if they belong together.
Excellent explanation ! Thank You
Ms. Coffee Bean is so glad it was helpful!
Thank you mam for this beautiful explanation!
Subscribed
Excellent video!
I am not sure if you already made a video on this, but I would love to see you taking two architecture paradigms like CNNs and Transformers and explain the differences, the reason why one works better than the other in what situations etc.
Is this what you mean? th-cam.com/video/aH7s6qXEUcc/w-d-xo.html
Or this? th-cam.com/video/DVoHvmww2lQ/w-d-xo.html
Such an incredible explanation.
Are you using Manim? its so smooth
Thanks! No manim, just PowerPoint's morph functionality. 😅
Glad to see you around, hope you find something worth wasting your time. 👍
The high level is really helpful but I think a video that explains the details with some intuition could be really useful as well.
Noted. 😀
I wonder what the next big neural network architecture is going to be, what comes next after transformers?
Maybe RNNs such as LSTMs and State Space Models (we did an explainer here: th-cam.com/video/vrF3MtGwD0Y/w-d-xo.html )
Very nice explanation :p
Glad you think so!
Thanks soooooooooooo much
Welcome 😊
In sfarsit inteleg si eu arhitectura asta :D. Multumesc.
Mă bucur că a ajutat pe cineva acest video!
This was very well explained. Thank you so much for this. I couldn't help but wonder since these models are trained by data, how much data do you need for a model to be reasonably accurate? A quick google search told me that GPT4.0 is trained by approximately 500,000,000,000 pages of text, which is absolutely insane to me!
I want to know if there are models that we can develop that train based on less data but still provide accurate results and what do these models look like?
Thanks a lot, especially since this is a very old video. We have made a new transformer explainer: th-cam.com/video/ec9IQMiJBhs/w-d-xo.html
About your question: Unfortunately in deep learning, great performance comes with big data, because the models work only well in the domains and kind of data they have seen so far (in distribution). And the motto is: nothing will be out of distribution if we include the entire world in the training data, no? (this is a tongue-in-cheek comment, just flagging it). 😅
So, if you are willing to sacrifice a lot of performance, then there are models that can work with less data, going back to older NLP, based on word embeddings and tf-idf representations. But I cannot say more until I know your specific use case.
But if you want a chatbot that can talk about almost anything, then you need trillions of tokens of text, at least this is what we learned from ChatGPT et al.
Oh wow I didn't even realize I was on the older video I will definitely check out the new one, and thanks for your answer!
The motivation for my question was since we typically don't have a lot of data on endangered languages, could there be language models that can produce helpful results in these languages despite the lack of data on them. I guess the broader question would be what kinds of language models could we apply to endangered languages for things such as documenting them or aiding in that kind of research?
@@ChengyiLi-t1k I'm not an expert in multilingual AI, but I have heard from experts there. Your question reminds me of two points.
* In multilingual AI, people still try to scrape all the monolingual data they have, automatically produce back translations and then train a multilingual model that hopefully can transfer its knowledge from high resource languages, to the low resource one. But you need some decent amount of data from every language you aim to learn. We've made a video on this approach, find the link in the description. th-cam.com/video/1gHUiNLYa20/w-d-xo.html
* If you have a very powerful model, of the class of GPT-4, Gemini, et al, then you could hope that the representations that exist are strong enough to elicit them with few-shot prompting. So, if you have the context length of Gemini, of multiple million input tokens, then you can many-shot a language from scratch by feeding in its dictionary and a grammar book. This is what Gemini 1.5 did for Kalamang: www.reddit.com/r/singularity/comments/1arla9z/gemini_15_pro_can_learn_to_zero_shot_translate_of/
It was meant as a out-of-distribution test, because the authors were sure that there is no trace of Kalamang on the internet that Gemini was trained on.
@@AICoffeeBreak Thank you very much! I appreciate the response.
Hi Letitia , Thanks for such great content. One question I have - When we use Transformers Encoder to encode any sequence to generate embeddings what loss function does transformer uses. For e.g. I am using Transformer Encoder to encode a sequence of user actions in a user session to generate embeddings to be used in my recommender system. Kindly answer
Hi! To predict the correct word, from all possible word over the (English) vocabulary, the model uses cross-entropy loss. But you can change the loss to adapt it to your problem.
Neat explanation however after watching video I still don’t understand how multihead attention and encoder-decoder attention work which are two important concepts I need to know
You're right. We plan a "Transformer remastered" video because sure, we did not get to the most technical points and to the math formulas. You might want to check out Yannic's video on it: He stays very close to the paper. th-cam.com/video/iDulhoQ2pro/w-d-xo.html
What is the song at the end? I like your end musics...
Especially this end music is one of my favorites too! It is called "Malandragem" and you can find it in the TH-cam Audio Library (under the TH-cam Audio Library license).
Thanks Letitia! Very nice animations. I wonder which software do you use to draw them?
For the graphics and content animations I use Powerpoint, for animating Ms. Coffee Bean and video editing I use kdenlive, for drawing her I use Adobe Photoshop.
@@AICoffeeBreak impressive! Do you film the Powerpoint presentation with OBS?
@@GradientDude yes, OBS it is! Forgot to mention.
niiiiiiiiice love you
Gow can all the word embeddings are processed in parallel? Each sentence is a different length?
Why are word vectors of length 512?
That's their length in BERT base. They can be longer depending on how much memory available one has and how long one is willing to wait for a pass through the network. The longer the vectors, the slower the transformer.
Hi Letitia, great explanation, thank you. At 08:14 you have mentioned the Encoder-decoder attention. That’s one of the key locations where magic happens. Could you elaborate on it? Or maybe you you have done already in some other of your videos?
best video i found on transformers.
i still don't know what are the query key and values in the decoder parr
Maybe our remastered version of this video might help th-cam.com/video/ec9IQMiJBhs/w-d-xo.html . We did it mainly because of your feedback, thanks a lot!
Hi , i still dont understand what exactly are the; query, Keys and values in the paper: attention is all you need, can u help?
Ms. Coffee Bean is planning a follow-up video in Transformer going into more details. Right now the concepts you ask about were out of the scope of this explanation. Stay tuned!
What is the meaning of fine-tuning and Pre-trained in Transformers?
Hi, thanks for the video! There are several things that are still unclear to me. First I do not understand well how the architecture is dynamic with respect to the size of the input. I mean what does change structurally when we change the size of the input, are there some inner parts that should be parallely repeated? or does this architecture fix a size of max window that we hope will be larger than any sequence input?
The other question is the most important one, it seems every explanation of transformer architecture I have found so far focuses one what we WANT a self attention or attention layer to do but never say a word of WHY after training those attention layers will do, by emergence, what we expect them to do. I guess it has something to do with the chosen structure of data in input and output of those layers, as well as the data flow which is forced but I do not have yet the revelation.
If you could help me with those, that would be great!
video anda sangat bagus dan mempunyai mesej yang luas terima kasih
Thank you, it was very good. I hope you can improve your accent and speak faster a bit. Again thanks.
Amazing love it 💗. Very intuitive Explanation. Please speak slow :)
Thank you, I will try my best concerning the verbal pace! 😊 Does turning the captions on and setting the speed of the video to 0.75x maybe help you?
The captions could help a lot, I have uploaded them myself => they are not automatically generated. Even if they were, the Algorithm got pretty good at speech-to-text.