Transformer Neural Networks Derived from Scratch

แชร์
ฝัง
  • เผยแพร่เมื่อ 23 พ.ย. 2024

ความคิดเห็น • 248

  • @algorithmicsimplicity
    @algorithmicsimplicity  ปีที่แล้ว +139

    Video about Diffusion/Generative models coming next, stay tuned!

    • @mahmirr
      @mahmirr ปีที่แล้ว +5

      Was coming to comment this, thanks

    • @arslanjutt4282
      @arslanjutt4282 ปีที่แล้ว +1

      Please make video

    • @micmac8171
      @micmac8171 9 หลายเดือนก่อน

      Please!

  • @ullibowyer
    @ullibowyer 6 หลายเดือนก่อน +58

    I now realise that the key to understanding transformers is to ask why they work, not how. Thanks!

    • @algorithmicsimplicity
      @algorithmicsimplicity  6 หลายเดือนก่อน +6

      Thank you so much!

    • @robertputneydrake
      @robertputneydrake 4 หลายเดือนก่อน +5

      @@algorithmicsimplicity this is indeed quite eye-opening! Thanks for your video!

  • @patriziap4316
    @patriziap4316 4 หลายเดือนก่อน +18

    I'm a mathematician working at the university, this is a wonderful explanation of Transformers, explaining the meaning and not just the algorithm, very good

  • @rah-66comanche94
    @rah-66comanche94 ปีที่แล้ว +147

    Amazing video ! I really appreciate that you explained the Transformer model *from scratch*, and didn't just give a simplistic overview of it 👍
    I can definitely see that *a lot* of work was put into this video, keep it up !

    • @korigamik
      @korigamik 8 หลายเดือนก่อน

      Would you share the source code for the animations?

  • @IllIl
    @IllIl ปีที่แล้ว +87

    Dude, your explanations are truly next level. This really opened my eyes to understanding transformers like never before. Thank you so much for making these videos. Really amazing resource that you have created.

  • @abdullahbaig7517
    @abdullahbaig7517 6 หลายเดือนก่อน +28

    This gem is underrated. This is the only video that after watching, I feel like I know how transformers work. Thanks!

    • @qwsafirkmc9093
      @qwsafirkmc9093 3 หลายเดือนก่อน

      you most likely don't. He didn't even show the attention formula

    • @BenjaminDorra
      @BenjaminDorra 3 หลายเดือนก่อน

      @@qwsafirkmc9093 I disagree.
      Everything in this video is simplified to the extreme, on purpose. Because it is the only way to understand the global behavior quickly.
      Yes the attention formula is not shown but the whole process is illustrated (including the softmax operation).
      The tokenizer is far more complicated in practice than a one-hot encoding at word level (and a good tokenizer is apparently quite important for good performance).
      The positional encoding is, you guessed it, not a one-hot encoding either. It may be complicated enough on its own to require a whole explanation video.
      Point is, the whole approach is to avoid details. And I think it works quite well.

    • @qwsafirkmc9093
      @qwsafirkmc9093 3 หลายเดือนก่อน

      @@BenjaminDorra if I don't see how matrices are multiplied, the general shape of the tensor at each step and all that jazz I barely understand anything. That simple explanation would've done wonders are some tensor multiplication shenanigans

    • @BenjaminDorra
      @BenjaminDorra 3 หลายเดือนก่อน

      @@qwsafirkmc9093 Ok fair.
      From my understanding the attention matmuls behave more like a vector outer product, so every possible pairs of tokens (every token being themselves vectors) are combined (through simple element by element product), pretty much what is shown in the video.
      But yes it is not simple and I may be wrong. Math is hard !

  • @tdv8686
    @tdv8686 ปีที่แล้ว +53

    Thanks for your explanation; This is probably the best video on TH-cam about the core of transformer architecture so far, other videos are more about the actual implementation but lack the fundamental explanation. I 100% recommend it to everyone on the field.

  • @asier6734
    @asier6734 ปีที่แล้ว +12

    I love the algorithmic way of explaining what mathematics does. Not too deep, not too shallow, just the right level of abstraction and detail. Please please explain RNNs and LSTMs, I'm unable to find a proper explanation. Thanks !

  • @StratosFair
    @StratosFair 9 หลายเดือนก่อน +19

    I am currently doing my PhD in machine learning (well, on its theoretical aspects), and this video is the best explanation of transformers I've seen on TH-cam. Congratulations and thank you for your work

  • @Magnetic-Milk
    @Magnetic-Milk 11 หลายเดือนก่อน +2

    Not so long ago I was searching for hours trying to understand transformers. In this 18 min video I learned more than I learned in 3 hours of researching. This is best computer science video I have ever watched in my entire life.

  • @chrisvinciguerra4128
    @chrisvinciguerra4128 ปีที่แล้ว +3

    It seems like whenever I want to dive deeper into the workings of a subject, I always only find videos that simply define the parts to how something works, like it is from a textbook. You not only explained the ideas behind why the inner workings exist the way they do and how they work, but acknowledged that it was an intentional effort to take a improved approach to learning.

  • @anatolyr3589
    @anatolyr3589 7 หลายเดือนก่อน +4

    yeah! this "functional" approach to the explanation rather than "mechanical" is truly amazing 👍👍👍👏👏👏

  • @xt3708
    @xt3708 ปีที่แล้ว +7

    Absolutely love how you explain the process of discovery, in other words figure out one part which then causes a new problem, which then can be solved with this method, etc. The insight into this process for me was even more valuable than understanding this architecture itself.

  • @atabhatti2844
    @atabhatti2844 4 หลายเดือนก่อน +3

    This is an excellent video! Highly underrated. While most videos explain algorithms, this explains the why, which gets me to understand the algorithm on a much deeper level. I wish this video would have ended with a summary of all the ideas covered and how those ideas are addressed by the transformer architecture. I was doing that in my head during the video, but not everyone may be as familiar. Thanks anyway. Please make many more videos!

    • @algorithmicsimplicity
      @algorithmicsimplicity  4 หลายเดือนก่อน +1

      Thanks for the feedback, I will keep it in mind for my next videos!

  • @ChrisCowherd
    @ChrisCowherd ปีที่แล้ว +2

    This video is by far the clearest and best explained I've seen! I've watched so many videos on how transformers work and still came away lost. After watching this video (and the previous background videos) I feel like I finally get it. Thank you so much!

  • @diegobellani
    @diegobellani ปีที่แล้ว +2

    Wow just wow. This video makes you understanding really the reason behind the architecture, something that even reading the original paper you don't really get.

  • @Abcdefghmnopqrstuvwxyz
    @Abcdefghmnopqrstuvwxyz 2 หลายเดือนก่อน +1

    Thank you for your work, I am currently doing a PhD in ML Systems and I learned several things from your video! Thank you for your service!

  • @jackkim5869
    @jackkim5869 8 หลายเดือนก่อน +2

    Truly this is the best explanation of transformers I have seen so far. Especially great logical flow makes it easier to understand difficult concepts. Appreciate your hard work!

  • @benjamindilorenzo
    @benjamindilorenzo 9 หลายเดือนก่อน +6

    This is the best Video on Transformers i have seen on whole youtube.

  • @Muhammed.Abd.
    @Muhammed.Abd. ปีที่แล้ว +6

    That is the possibly the best explanation of Attention I have ever seen!

  • @RoboticusMusic
    @RoboticusMusic ปีที่แล้ว +16

    Thank you for not using slides filled with math equations. If someone understands the math they're probably not watching these videos, if they're watching these videos they're not understanding the math. It's incredible that so many TH-cam teachers decide to add math and just point at it for an hour without explaining anything their audience can grasp, and then in the comments you can tell everybody golf clapped and understood nothing except for the people who already grasp the topic. Thank you again for thinking of a smart way to teach simple concepts.

    • @xt3708
      @xt3708 ปีที่แล้ว +3

      amen. the power of out of the box teachers is infinite.

  • @gabrielpetersson3416
    @gabrielpetersson3416 หลายเดือนก่อน +2

    Incredible, every one of your videos are crazy good. Post more!

    • @algorithmicsimplicity
      @algorithmicsimplicity  หลายเดือนก่อน

      Thank you so much! I am working on the next one at the moment!

  • @y.shrestha6936
    @y.shrestha6936 3 หลายเดือนก่อน +2

    Amazing presentation. Thanks!

  • @CharlieZYG
    @CharlieZYG ปีที่แล้ว +3

    Wonderful video. Easily the best video I've seen on explaining transformer networks. This "incremental problem-solving" approach to explaining concepts personally helps me understand and retain the information more efficiently.

  • @ondrejbelan1463
    @ondrejbelan1463 19 วันที่ผ่านมา +1

    Great videa, I am just starting with Transformers, but never thought about them in relation to convolutional networks

  • @tunafllsh
    @tunafllsh ปีที่แล้ว +4

    This video is exactly what I needed. Despite knowing what a transformer's made of, I still felt incompleteness and didn't know the motivation behind it. And your video answered this question perfectly. Now understanding why it works is another question.

  • @ItsRyanStudios
    @ItsRyanStudios ปีที่แล้ว +4

    This is AMAZING
    I've been working on coding a transformer network from scratch, and although the code is intuitive, the underlying reasoning can be mind bending.
    Thank you for this fantastic content.

  • @declanbracken2577
    @declanbracken2577 5 หลายเดือนก่อน +1

    There are many explanations of what a transformer is and how it works, but this one is the best I've seen. Really good work.

  • @mek_Morok
    @mek_Morok หลายเดือนก่อน +1

    This is the best transformer explanation video on youtube! Everything is so clear now!

  • @igNights77
    @igNights77 ปีที่แล้ว +2

    Explained thoroughly and clearly from basic principles and practical motivations. Basically the perfect explanation video.

  • @TTTrouble
    @TTTrouble ปีที่แล้ว +1

    I’ve watched so many video explainers on transformers and this is the first one that really helped show the intuition in a unique and educational way. Thank you, I will need to rewatch this a few times but I can tell it has unlocked another level of understanding with regard to the attention mechanism that has evaded me for quite some time.(darned KQV vectors…) Thanks for your work!

  • @Alpha_GameDev-wq5cc
    @Alpha_GameDev-wq5cc 6 หลายเดือนก่อน +9

    I still remember when all the cool acronyms I had to deal with was just FNNs, CNNs, ADAM, RNNs, LSTMs and the newest kid on the block, GANs.

    • @newbie8051
      @newbie8051 6 หลายเดือนก่อน +1

      Damn FNN's and CNN's are basic stuff we were taught in our 4semester of our undergrad. Adam and RNNs were in the "additional resources" section for an Introdcutory course for Deep Learning I took in the same semester.
      Encountered LSTMs through personal projects lol
      Still haven't used GANs and Autoencoders, but it they were talk of the town back then due to the diffusion models.

    • @Alpha_GameDev-wq5cc
      @Alpha_GameDev-wq5cc 6 หลายเดือนก่อน

      @@newbie8051 yea I did FNN from scratch in high school, I was really hopeful for getting into Ai Research and then the transformers arrived in my college year…

    • @xxyyzz8464
      @xxyyzz8464 3 หลายเดือนก่อน

      You remember 2014 - 2015 too?!? 😂

  • @yoavtamir7707
    @yoavtamir7707 หลายเดือนก่อน +1

    This is BY FAR the BEST explenation I have seen on this topic. You Sir are extremely talneted! keep up the great work and thank ou!

  • @adityachoudhary151
    @adityachoudhary151 9 หลายเดือนก่อน +2

    really made me appreciate NN even more. Thanks for the video

  • @ryhime3084
    @ryhime3084 ปีที่แล้ว +1

    This was so helpful. I was reading through how other models work like ELMo and it makes sense how they came up with ideas for those, but the transformer it just seemed like it popped out of nowhere with random logic. This video really helps to understand their thought process.

  • @halflearned2190
    @halflearned2190 11 หลายเดือนก่อน +3

    Hey man, I watched your video months ago, and found it excellent. Then I forgot the title, and could not find it again for a long time. It doesn't show up when I search for "transformers deep learning", "transformers neural network", etc. Consider changing the title to include that keyword? This is such a good video, it should have millions of views.

  • @dmlqdk
    @dmlqdk 9 หลายเดือนก่อน +3

    Thank you for answering my questions!!

    • @algorithmicsimplicity
      @algorithmicsimplicity  9 หลายเดือนก่อน

      Thanks for the tip! I'm always happy to answer questions.

  • @shantanuojha3578
    @shantanuojha3578 6 หลายเดือนก่อน +3

    Awesome video bro. i always like some intutive explanation.

  • @RalphDratman
    @RalphDratman ปีที่แล้ว +2

    This is by far the best explanation of the transformer architecture. Well done, and thank you very much.

  • @corydkiser
    @corydkiser ปีที่แล้ว +13

    This was top notch. Please do one for RetNets and Liquid Neural Nets.

  • @TropicalCoder
    @TropicalCoder ปีที่แล้ว +3

    Very nicely done. Your graphics had a calming, almost hypnotic effect.

  • @terjeoseberg990
    @terjeoseberg990 ปีที่แล้ว +10

    I wasn’t aware that they were using a convolutional neural network in the transformer, so I was extremely confused about why the positional vectors were needed. Nobody else in any of the other videos describing transformers pointed this out. Thanks.

    • @Hexanitrobenzene
      @Hexanitrobenzene ปีที่แล้ว +6

      "they were using a convolutional neural network in the transformer"
      No no, Transformers do not have any convolutional layers, the author of the video just chose CNN as a starting point in the process "Let's start with the solution that doesn't work well, understand why it doesn't work well and try to improve it, changing the solution completely along the way".
      The main architecture in natural language processing before transformers was RNN, recurrent neural network. Then in 2014 researchers improved it with attention mechanism. However, RNNs do not scale well, because they are inherently sequential, and scale is very important for accuracy. So, researchers tried to get rid of RNNs and succeded in 2017. CNNs were also tried, but, to my not-very-deep knowledge, were less succesful. Interesting that the author of the video chose CNN as a starting point.

    • @terjeoseberg990
      @terjeoseberg990 ปีที่แล้ว

      @@Hexanitrobenzene, I suppose I’ll have to watch this video again. I’ll look for what you mentioned.

    • @Hexanitrobenzene
      @Hexanitrobenzene ปีที่แล้ว

      @@terjeoseberg990
      A little off topic, but... Not long ago I noticed that TH-cam deletes comments with links. Ok, automatic spam protection. (Still, the thing that it does this silently frustrates a lot...) But, does it also delete comments where links are separated into words with "dot" between them ? I tried to give you a resource I learned this from, but my comment got dropped two times...

    • @Hexanitrobenzene
      @Hexanitrobenzene ปีที่แล้ว

      ...Silly me, I figured I could just give you the title you can search for: "Dive into deep learning". It's an open textbook with code included.

    • @terjeoseberg990
      @terjeoseberg990 ปีที่แล้ว

      @@Hexanitrobenzene, The best thing to do when TH-cam deletes comments is to provide a title or something so I can find it. A lot of words are banned too.

  • @SahinKupusoglu
    @SahinKupusoglu ปีที่แล้ว +1

    This video was all I needed for LLMs/transformers!

  • @TeamDman
    @TeamDman 7 หลายเดือนก่อน +2

    I keep coming back to this because it's the best explanation!!

  • @ArtOfTheProblem
    @ArtOfTheProblem ปีที่แล้ว +3

    Really well done, I haven't seen your channel before and this is a breath of fresh air. I've been working on my GPT + transformer video for months and this is the only video online which is trying to simplify things through an indepdnent realization approach. Before I watched this video my 1 sentence summary of why Transformers matter was: "They contain layers that have weights which adapt based on context" (vs. using deeper networks with static layers). and this video helped solidify that further, would you agree?
    I also wanted to boil down the attention heads as "mini networks" (or linear functions) connected to each token which are trained to do this adaptation. One network pulls out what's important in each word given the context around it, the other networks combines these values to decide the important those two words in that context, and this is how the 'weights adapt'
    I still wonder how important the distinction of linear layer vs. just a single layer, I like how you pulled that into the optimization section. i know how hard this stuff is to make clear and you did well here

    • @maxkho00
      @maxkho00 ปีที่แล้ว

      My one-sentence summary of why transformers matter would be "they are standard CNNs, except the words are re-ordered in a way that makes the CNN's job easier first before being fed ".
      Also, a single NN layer IS a linear layer; I'm not sure what you mean by saying you don't know how important the distinction between the two is.

    • @ArtOfTheProblem
      @ArtOfTheProblem ปีที่แล้ว

      thanks@@maxkho00

  • @nayanbaishya4077
    @nayanbaishya4077 13 วันที่ผ่านมา +1

    Very nice video. Name of your channel reflects in the content of the video. Thank you.🙏🙏

  • @nara260
    @nara260 11 หลายเดือนก่อน +1

    thank a lot lot! this visual lecture cleared the dense fogs over my cognitive picture of the transformer.

  • @Muuip
    @Muuip ปีที่แล้ว +2

    Great concise visual presentation!
    Thank you, much appreciated!
    👍👍

  • @IzUrBoiKK
    @IzUrBoiKK ปีที่แล้ว +1

    As both a math enthusiasts and a programme (who obv also works on AI) I rly liked this vid. I can confirm that this is one of the best and genuine explanation of transformers...

  • @yonnn7523
    @yonnn7523 ปีที่แล้ว +1

    best explainer of transformers I saw so far, thnx!

  • @giphe
    @giphe ปีที่แล้ว +1

    Wow! I knew about attention mechanisms but this really brought my understanding to a new level. Thank you!!

  • @rogerzen8696
    @rogerzen8696 10 หลายเดือนก่อน +1

    Good job! There was a lot of intuition in this explanation.

  • @MalTramp
    @MalTramp 6 หลายเดือนก่อน +1

    This was an excellent video on the global design structure for transformer. Love all your videos!

  • @MaribSultan
    @MaribSultan ปีที่แล้ว +1

    Cant wait for more content from your channel. Brilliantly explained.

  • @panizzutti
    @panizzutti หลายเดือนก่อน +1

    the best explanation i ever seen, thank you

  • @grjesus9979
    @grjesus9979 2 หลายเดือนก่อน +2

    Thanks! Now what you do with the output of each self attention layer, as it now have the appropiate information of its context, you pass it through a CNN? or what is it you do with of each output vector from each word?

    • @algorithmicsimplicity
      @algorithmicsimplicity  2 หลายเดือนก่อน +1

      You feed it through another self-attention layer! A transformer consists of multiple self attention + MLP layers (usually hundreds of layers for a large model). At the end, you run the output vectors from the final layer through a linear classifier to predict what ever the label is (e.g. the next word in the input text for language modelling).

    • @grjesus9979
      @grjesus9979 2 หลายเดือนก่อน +2

      @@algorithmicsimplicity Thank you! Appreciate it! Again, very high quality content. Thanks for your time and effort.

  • @JunYamog
    @JunYamog 10 หลายเดือนก่อน +1

    Your visualization and explanation are very good. Helped me understand a lot. I hope you can put more videos, it must be not easy otherwise you would have done it. Keep it up.

  • @briancase6180
    @briancase6180 ปีที่แล้ว +1

    This a truly great introduction. I've watched other also excellent introductions, but yours is superior in a few ways. Congrats and thanks! 🤙

  • @antonkot6250
    @antonkot6250 6 หลายเดือนก่อน

    The best explanation I found so far!

  • @mvlad7402
    @mvlad7402 6 หลายเดือนก่อน +2

    Excellent explanation! All kudos to the author!

  • @MichaelBrown-gt4qi
    @MichaelBrown-gt4qi 5 หลายเดือนก่อน +2

    I've started binge watching all your videos. 😁

  • @DanOneOne
    @DanOneOne 6 หลายเดือนก่อน +2

    so what does it really classify? The image recognition needed to output a label of that image, What does this transformer output after processing the text?

    • @algorithmicsimplicity
      @algorithmicsimplicity  6 หลายเดือนก่อน +1

      What ever you train it to. People have trained transformers to categorize text, predict the sentiment of sentences, all sorts of things. ChatGPT is specifically trained to predict the next word that comes after a partial piece of text. It turns out that you can use this to generate new text from scratch by repeatedly applying it to its own output. This technique is known as 'auto-regression' and I explain it in more detail in this video: th-cam.com/video/zc5NTeJbk-k/w-d-xo.html

  • @ronakbhatt4880
    @ronakbhatt4880 10 หลายเดือนก่อน +1

    What a simple but perfect explanation!! You deserve 100s time more subscriber.

  • @domasvaitmonas8814
    @domasvaitmonas8814 8 หลายเดือนก่อน +3

    Thanks. Amazing video. One question though - how do you train the network to output the "importance score"? I get the other part of the self-attention mechanism, but the score seems a bit out of the blue.

    • @algorithmicsimplicity
      @algorithmicsimplicity  8 หลายเดือนก่อน

      The entire model is trained end-to-end to solve the training task. What this means is you have some training dataset consisting of a bunch of input/label pairs. For each input, you run the model on that input, then you change the parameters in the model a bit, evaluate it again and check if the new output is closer to the training label, if it is you keep the changes. You do this process for every parameter in all layers and in all value and score networks, at the same time.
      By doing this process, the importance score generating networks will change over time so that they produce scores which cause the model's outputs to be closer to the training dataset labels. For standard training tasks, such as predicting the next word in a piece of text, it turns out that the best way for the score generating networks to influence the model's output is by generating 'correct' scores which roughly correspond to how related 2 words are, so this is what they end up learning to do.

  • @AdhyyanSekhsaria
    @AdhyyanSekhsaria ปีที่แล้ว +1

    Great explanation. Havent found this perspective before.

  • @jcorey333
    @jcorey333 9 หลายเดือนก่อน +2

    This is one of the genuinely best and most innovative explanations of transformers/attention I've ever seen! Thank you.

  • @TeamDman
    @TeamDman ปีที่แล้ว +1

    I've had to watch this a few times, great explanation!

  • @lakshay510
    @lakshay510 9 หลายเดือนก่อน +2

    Halfway through the video and I pressed the subscribed button. Very intutive and easy to understand. Keep up the good work man :)
    1 suggestion: Change the title of video and you'll get more traction.

    • @algorithmicsimplicity
      @algorithmicsimplicity  9 หลายเดือนก่อน

      Thanks, any title in particular you'd recommend?

  • @christrifinopoulos8639
    @christrifinopoulos8639 10 หลายเดือนก่อน

    The visualisation was amazing.

  • @_MrKekovich
    @_MrKekovich ปีที่แล้ว

    FINALLY I have something me basic understanding. Thank you so much!

  • @clray123
    @clray123 ปีที่แล้ว +2

    Great video, maybe you could cover retentive network (from the RetNet paper) in the same fashion next - as it aims to be a replacement for the quadratic/linear attention in transformer (I'm curious as to how much of the "blurry vector" problem their approach suffers from).

  • @subhadeepchatterjee1528
    @subhadeepchatterjee1528 3 หลายเดือนก่อน +1

    Finally!!!! Exactly the video I wanted!!!!

  • @pravinkool
    @pravinkool ปีที่แล้ว

    Fantastic! Loved it! Exactly what I needed.

  • @rishikakade6351
    @rishikakade6351 6 หลายเดือนก่อน +2

    Insane that this website is free. Thanks!

  • @c1tywi
    @c1tywi 6 หลายเดือนก่อน +2

    This video is gold!
    Subscribed.

  • @buchhibaburachakonda5646
    @buchhibaburachakonda5646 3 หลายเดือนก่อน +2

    Thanks!

  • @Baigle1
    @Baigle1 ปีที่แล้ว +1

    I think they were actually used as far back or more as 2006, in compressor algorithm competitions publicly

  • @kul6420
    @kul6420 6 หลายเดือนก่อน +1

    I may be too late to the party but glad I found this channel.

  • @anilaxsus6376
    @anilaxsus6376 ปีที่แล้ว

    best explanation i have seen so far.
    Basically The transformer is cnn with a lot of extra upgrades. Good to know.

  • @quocanhad
    @quocanhad 8 หลายเดือนก่อน +1

    you deserve my like bro, really awesome video

  • @iandanforth
    @iandanforth ปีที่แล้ว

    I wish this had tied in specifically to the nomenclature of the transformer such as where these operations appear in a block, if they are part of both encoder and decoder paths, how they relate to "KQV" and if there's any difference between these basic operations and "cross attention".

    • @ArtOfTheProblem
      @ArtOfTheProblem ปีที่แล้ว

      I"ll be doing this, but in short, the little networks he showed connected to each pair are KQ (word pair representation) and the V is the value network., all of this can be done in the decoder only model as well. and cross attention is the same thing but you are using two separate sequences looking at each other (such as two sentences in a translation network). it's nice to know that GPT for example is decorder only, and so doesn't even need this

  • @minhsphuc12
    @minhsphuc12 ปีที่แล้ว

    Thank you so much for this video.

  • @yash1152
    @yash1152 ปีที่แล้ว +1

    2:36 wow, just 50k words... that soud pretty easy for computers. amazing.

  • @palyndrom2
    @palyndrom2 ปีที่แล้ว +3

    Great video

  • @Tigerfour4
    @Tigerfour4 ปีที่แล้ว +2

    Great video, but it left me with a question. I tried to compare what you arrived at (16:25) to the original transformer equations, and if I understand it correctly, in the original we don't add the red W2X matrix, but we have a residual connection instead, so it is as if we would add X without passing it through an additional linear layer. Am I correct in this observation, and do you have an explanation for this difference?

    • @algorithmicsimplicity
      @algorithmicsimplicity  ปีที่แล้ว +1

      Yes that's correct, the transformer just adds x without passing it through an additional linear layer. Including the additional linear layer doesn't actually change the model at all, because when the result of self attention is run through the MLP in the next layer, the first thing the MLP does is apply a linear transform to the input. Composition of 2 linear transforms is a linear transform, so we may as well save computation and just let the MLP's linear transform handle it.

  • @AlbertoMoccardi
    @AlbertoMoccardi 10 หลายเดือนก่อน

    Amazing, continue like this.

  • @benjamingoldstein14
    @benjamingoldstein14 หลายเดือนก่อน +1

    What books do you recommend to learn about this rigorously? I just graduated with a math degree and did some pure math research as an undergrad but never studied machine learning. I have a very solid linear algebra and analysis background. I want to start self-studying so I can work as an ML developer/researcher.
    Great video by the way! I would be really interested in your reading recommendations.

    • @algorithmicsimplicity
      @algorithmicsimplicity  หลายเดือนก่อน +1

      This is an unpopular opinion, but I don't believe there is such a thing as rigorous machine learning. Almost always new ideas in machine learning come from intuition and experimentation, and then mathematical justifications are tacked on after-the-fact (and the justifications are often incorrect or incomplete).
      If you want to get into machine learning you first need to learn the practical skills. For this I would recommend a course like fast.ai by Jeremy Howard, which gives practical experience implementing ML models for real use-cases.
      If you want to do research there really isn't any better way than just reading papers. Pick out some core ML techniques (such as diffusion, transformers, etc) and read the papers behind them. If there's anything you don't understand in the papers, then do further research until you do understand. Having a background in math should help a lot with understanding papers. I would also recommend just picking a bunch of papers from recent top conferences (such as Neurips, ICML, ICLR, CVPR) that you think are interesting to read.

  • @TaranovskiAlex
    @TaranovskiAlex ปีที่แล้ว

    thank you for the explanation!

  • @ramanShariati
    @ramanShariati 25 วันที่ผ่านมา +1

    never thought pf attention as pair-wise convolution ! interesting.

  • @AN-ch3ly
    @AN-ch3ly 8 หลายเดือนก่อน +1

    Great video, but I was wondering how one aspect of the transformer is handled in the real world. How are importance scores assigned to pairs in order to determine their importance? Basically, on a massive scale, how can important scores be automatically assigned in order to get the correct importance for a pair for a given sentence?

    • @algorithmicsimplicity
      @algorithmicsimplicity  8 หลายเดือนก่อน

      The entire model is trained end-to-end to solve the training task. What this means is you have some training dataset consisting of a bunch of input/label pairs. For each input, you run the model on that input, then you change the parameters in the model a bit, evaluate it again and check if the new output is closer to the training label, if it is you keep the changes.
      By doing this process, the score generating networks will change over time so that they produce scores which cause the model's outputs to be closer to the training dataset labels. It turns out that the best way for the score generating networks to influence the model's output is by generating 'correct' scores which roughly correspond to how related 2 words are, so this is what they end up learning.

  • @hadadvitor
    @hadadvitor ปีที่แล้ว

    fantastic video, congratulations on and thank you for making it

  • @Supreme_Lobster
    @Supreme_Lobster ปีที่แล้ว

    Thanks. I had read the original Transformer paper and I barely understood the underlying ideas.

  • @rafa_br34
    @rafa_br34 6 หลายเดือนก่อน +1

    I'd love to see you explain how KANs work.

  • @marcfruchtman9473
    @marcfruchtman9473 ปีที่แล้ว

    Very interesting. Thank you for the video.

  • @GaryBernstein
    @GaryBernstein ปีที่แล้ว

    Can you explain how the NN produces the important-word-pair information-scores method described after 12:15 from the sentence problem raised at 10:17?
    Well it’s just another trained set of values. I supposs it scores pairs importance over the pairs’ uses in ~billions of sentences.

    • @algorithmicsimplicity
      @algorithmicsimplicity  ปีที่แล้ว +2

      The importance-scoring neural network is trained in exactly the same way that the representation neural network is. Roughly speaking, for every weight in the importance-scoring neural network you increase the value of that weight slightly and then re-evaluate the entire transformer on a training example. If the new output is closer to the training label, then that was a good change so the weight stays at its new value. If the new output is further away, then you reverse the change to that weight. Repeat this over and over again on billions of training examples and the importance-scoring neural network weights will end up set to values so that that the produced scores are useful.

  • @iustinraznic5811
    @iustinraznic5811 ปีที่แล้ว

    Amazing explainations and video!

  • @komalsinghgurjar
    @komalsinghgurjar ปีที่แล้ว

    Sir I like your videos very much. Love from India ♥️♥️.

  • @CC1.unposted
    @CC1.unposted 5 หลายเดือนก่อน +1

    So why don't we just train a model which has ability to change it's own weights and biasis, it would be a great way where modal it self adapts to it's own architecture, the best way I think would be that there are small NN modals where instead of having label data they have a cost value, each NN can give feedback to another, and are connected with each other, and than there will be a slightly Bigger NN or RNN where we will give our cost value , it will than give different cost values to each of the smaller NN, notice that each small NN can communicate with each other that means they can also change there Weights as well, as doing this we can do it in multiple layers, where each layer has multiple groups supose a group which uses another group which is slightly lower, the fact that we get so much more dynamic and conceptual room is best, it can also rember it self instead, I actually tried it in JavaScript (yes I know it's not a good language for these works) the results where mixed but it may be due to my programing skills

    • @algorithmicsimplicity
      @algorithmicsimplicity  5 หลายเดือนก่อน +1

      All neural nets change their weights during training, if you mean specifically change their weights depending on their current input, that is called 'dynamic weights' and it has been explored. The main issue is that weight matrices are dxd and inputs are length d matrices, so if you use a linear function to generate the new weight matrices, it needs to have d^3 weights, which quickly becomes computationally infeasible. In the Mamba architecture, where the recurrent weights are only size d, they do use dynamic weights and it helps a lot.

    • @CC1.unposted
      @CC1.unposted 5 หลายเดือนก่อน

      @@algorithmicsimplicity no I meant a very different architecture, I derived it my self over course of 3 months (not because it's complex but because I'm very busy) So it was like bunch of small 5 to 10 hidden layers modals where every modal connects to every other modal, including there fitness input which is average of what all modals think, the fitness is just a single value (cost Or loss) which is used to change the modal's parameters, and than there's a head Nuron, slightly bigger with 15 to 20 hidden, it has a single input, multiple outputs, the single input is like a way to tell our current entire modal's cost, loss etc, and the output connects to every small NN directly, so it can basically give different costs to each NN, it also is trained by the main cost input, where it's input is also cost, now imagine this all architecture as a single Group, in next Group we use small Group instead of NN (so that information is processed in modules), and than we can have a Layering where there are multiple Layers containing Groups or combination of Groups, the idea is that the fact that each NN can or will be change depending on Inputs, the Groups inside Groups means that information is processed as modules like MINi functions doing some smaller tasks, the fact that every NN eventually gets the same input, Layers making it high dimensional processing while understanding all Data carefully, the idea is that by using something like this you don't even need to give input they will figure out automatically (predict next and next input there so understand as well) even though it will be more computationally intensive while training it will run parallel when using it as running, it would work with sequential or simple data, because it can dynamically change it's weights it is easy for it to learn, store , etc, Infact it can create it's own architecture which could be more effecint for the input output labels , (It was inspired by brain and trying to make it programming friendly) It is fully dynamic, for the computational part I ran a 15 NN, 5 Groups, 5 Layers where NN had 5 hidden, head NN had 19 NN, the results were mixed, but I just trained for 30 epochs these modals would need higher epochs because they are time dependent, it ran on web on CPU using broswer JS, no library (it was blocking main thread due to I writing it as synchronous but it did worked taking just few seconds to train and run)

    • @CC1.unposted
      @CC1.unposted 5 หลายเดือนก่อน

      This was a summery of architecture I derived and I don't think anyone is gonna read it but it's nice to have a dicription

  • @christianjohnson961
    @christianjohnson961 ปีที่แล้ว

    Can you do a video on tricks like layer normalization, residual connections, byte pair encoding, etc.?

  • @The_DorkLord
    @The_DorkLord 6 หลายเดือนก่อน +1

    13:00 This may be a silly question, but would it be possible for the transformer to encounter a sentence where all words would have a score of 0.0, creating an issue with simply using an exponential function? I imagine it would be vanishingly rare, but something along the lines of Chomsky's "Colorless green ideas sleep furiously" would seem like the type of sentence that would create such an issue. I assume that this is not a real problem, but I am curious as to why it isn't one.

    • @algorithmicsimplicity
      @algorithmicsimplicity  6 หลายเดือนก่อน +1

      It's almost impossible for that to happen in practice because we compare words against themselves. So if one word has no relationship with any other word in the sentence, it will still have a large score for itself: so the normalized weight will be 1 for itself and 0 for all other words. Which means that its vector won't include information from any other words, but that's kind of what you want if it really doesn't have any relationship to any other words.

    • @The_DorkLord
      @The_DorkLord 6 หลายเดือนก่อน +1

      @@algorithmicsimplicity Right, of course, that makes sense. I hadn't thought about words having weights for themselves. Thanks! Your channel is really great, I love the level of depth you go into while still keeping the material approachable.