What are Transformer Neural Networks?

แชร์
ฝัง
  • เผยแพร่เมื่อ 29 พ.ย. 2024

ความคิดเห็น • 127

  • @CodeEmporium
    @CodeEmporium 3 ปีที่แล้ว +60

    Love the 3blue1brown esque style. Keep em coming

    • @ariseffai
      @ariseffai  3 ปีที่แล้ว +4

      Thank you! Working on it :)

    • @anadianBaconator
      @anadianBaconator 3 ปีที่แล้ว +5

      Appreciate both of you guys work in this field!

    • @baltofarlander2618
      @baltofarlander2618 ปีที่แล้ว

      It's called manim, sir.

    • @keeperofthelight9681
      @keeperofthelight9681 ปีที่แล้ว +1

      3blue1brown has become a classic in TH-cam videos. We would describe arts as Escheresque or poetry as shakespearean but when a streamer defines a style that is compared to as a subgenre is some remarkable of an achievement by 3blue1brown

    • @mhc4124
      @mhc4124 ปีที่แล้ว +1

      Wish more videos were coming

  • @Mutual_Information
    @Mutual_Information 3 ปีที่แล้ว +109

    I think I speak for everyone when I say.. please keep posting! This is excellent stuff!

  • @justusmzb7441
    @justusmzb7441 2 ปีที่แล้ว +1

    The best help in understanding transformers while reading „Attention is all you need“ I have found.

  • @googleyoutubechannel8554
    @googleyoutubechannel8554 ปีที่แล้ว +1

    This video is a mix of actual explanation about the nature of transformers and long tangents about implementation details that are mixed together so perfectly it ensures that it's impossible follow or even know if the author of this video understands how transformers work.

    • @aw4704
      @aw4704 ปีที่แล้ว +2

      The attention part gave me the most headaches I almost cried, felt so stupid. I found a better video in AssemblyAI that explains it better if that's the same case for you about the attention part.

  • @jwine1957
    @jwine1957 4 หลายเดือนก่อน

    Your videos exhibit exceptional quality. Thank you for this outstanding contribution.

  • @PaperTigerLive
    @PaperTigerLive ปีที่แล้ว +1

    I put 2 months into studying this. I’m a Cultural Science major and I can finally say that I understood every part of this video.
    I’m really proud of myself and very thankful for your excellent didactic style.

  • @adrewkin8375
    @adrewkin8375 ปีที่แล้ว

    Our community is the best!!! 💪💪 Thank you very much for the amazing review!!!

  • @29konna
    @29konna 2 ปีที่แล้ว +112

    Thanks for the nice content. I am not a beginner, but unfortunately it was hard for me to follow this video. There were many concepts/terms mentioned without a brief explanation, and the pace was rather fast. If you could publish the same video, with additional examples and clarifications would be much appreciated. I understand that one would need to look up some topics and references while watching the video, but in this case it felt like I have to look up things very often. Thanks again for your effort!

    • @brianbagnall3029
      @brianbagnall3029 ปีที่แล้ว +10

      Absolutely. It started well then went off the rails.

    • @UpperM3
      @UpperM3 ปีที่แล้ว +5

      you spoke for all of us, thank you for voicing our opinions. i'm a graduate data science alumni and i'm very familiar with deep learning. Watching this video, i wasn't learning anything rather i couldn't even follow-up with the video. I started litterally questioning my languistic and academic abilities. Then i realized that this guy is just summarizing the dreaded research paper "Attention is all you need" which is a bad idea if you're not going to cover explanation of terminologies and equations.

    • @thomassiby8198
      @thomassiby8198 ปีที่แล้ว +5

      @@UpperM3 not really, to learn this concept it is necessary to at least have a basic idea of the concepts related to it. I think this is the best explaination of transformers in youTuble yet

    • @GodofStories
      @GodofStories ปีที่แล้ว +3

      @@UpperM3 this is more than data science, it's computer science. You need to have some understanding of comp sci concepts - parallelization, graphs, vectors, encodings, decodings. If you are "very familiar" with deep learning, you should know all this. But, yes this explanation is not geared towards beginners to deep learning/computer science. You need to understand neural networks, representations, communication paths, lengths etc. so the definitions are verbose at times. And you can always pause, and google the terms for more examples. This is technical and math jargon heavy, but yeah, it did not say it was explained for beginners. But, we can all spend some time to become more "expert". I suggest you to take notes, and fill in the gaps with Google.
      For e.g. " They leverage what is called self-attention, to compute updated representations for each sequence in parallel. The attention mechanism is going to allow each representation to kind of differentially consider the representation in every other position. And the communication paths will have the same length for all pairs of elements"

    • @GodofStories
      @GodofStories ปีที่แล้ว

      @@UpperM3 there are also numerous videos that explain transformers without getting too technical -th-cam.com/video/SZorAJ4I-sA/w-d-xo.html

  • @phafid
    @phafid ปีที่แล้ว +1

    Thank YOU! this is the most precise and straight forward thing about Attention that I ever have. When you said the word compatibility, I finally understood why do I need to take dot product between Queries and Keys.

  • @bijan1316
    @bijan1316 2 ปีที่แล้ว +1

    incredible, insightful, delightful, enlightening

  • @samirelzein1095
    @samirelzein1095 2 ปีที่แล้ว +1

    i am at 1/3rd of the video, stopped to come here express gratitude before getting back to it! 3B1B is a proven style, no harm in using it, later on adding to it.

  • @PrinceKumar-hh6yn
    @PrinceKumar-hh6yn ปีที่แล้ว

    Really Attention is all we need.

  • @AIdevel
    @AIdevel ปีที่แล้ว

    Transformers are incredibly complicated but you tried to simplify it for us any way it does need another look
    Thank you very much indeed

  • @andreasgian3075
    @andreasgian3075 3 ปีที่แล้ว

    Maybe the best transformer explanation out there

  • @alexandrsavochkin9442
    @alexandrsavochkin9442 ปีที่แล้ว

    Thanks for the vid. This is exactly level of the details and explanation style I needed. Many other explanations are either too vague and miss important details or too hardcore and hard to follow. This is ideal: most of the technical details are here, but still easy to follow.

  • @mattiapalano1352
    @mattiapalano1352 2 ปีที่แล้ว

    best tutorial about transformer on youtube by far 🙌🏻

  • @theJeet8
    @theJeet8 3 ปีที่แล้ว +10

    Thank you for the clear and concise explanation! I understood this more than any other video I've seen yet (although I'm still learning). Looking forward to more videos like this!

  • @jcorey333
    @jcorey333 2 ปีที่แล้ว +1

    I watched a bunch of videos on this and I feel like after this, I actually understand it. Thank you!

  • @SaheelGodhane_TheTramp
    @SaheelGodhane_TheTramp ปีที่แล้ว +5

    Exceptionally explained! Please make more content! This stuff is worth paying for ;)

  • @rmac9498
    @rmac9498 3 ปีที่แล้ว +5

    These are great! I can see this channel easily becoming as popular as Yannic Kilcher’s. Thank you for all your work, your explanations give a lot of clarity without sacrificing depth!

  • @a0nmusic
    @a0nmusic ปีที่แล้ว +3

    i thought i was smart - but then i started getting my head around all this amazing machine learning and then i was humbled. thanks for sharing

  • @aaryanbhagat4852
    @aaryanbhagat4852 3 ปีที่แล้ว +3

    Much needed video, if you have time I would request you to make a video series on transformers. My personal request would be on "Do Transformer Modifications Transfer Across Implementations and Applications?" and "Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth - "
    Again, I thank you for your work!

  • @chrisogonas
    @chrisogonas ปีที่แล้ว

    Very well illustrated! Thanks

  • @tuongnguyen9391
    @tuongnguyen9391 3 ปีที่แล้ว

    I want to thank you for your effort on making these videos. I think your channel deserve more attention.

  • @mostafamohsen250
    @mostafamohsen250 ปีที่แล้ว

    great explanation but when you were explaining the self-attention mechanism before multi-head attention, you forgot to mention that they way you get Q, K, V is by multiplying the input X by the corresponding weight matrix WQ, WK, or WV. When you said that "q, k, v are all learned" it made it sound like we're somehow learning q, k, v directly where as in fact we are learning the weight matrices that help generate them

  • @GS-vt9id
    @GS-vt9id ปีที่แล้ว +1

    Thanks ! Beautiful work.

  • @digitalorca
    @digitalorca ปีที่แล้ว

    Outstanding high-quality video. Thanks!

  • @anchyzas
    @anchyzas 8 หลายเดือนก่อน

    I also feel the residual connection is definitely RNN inspired.

  • @alilotfirezaabad6995
    @alilotfirezaabad6995 ปีที่แล้ว +1

    Thanks Ari! Very useful.

  • @mrmatthewleigh
    @mrmatthewleigh 3 ปีที่แล้ว +9

    Hi Ari, Seriously great stuff! Really appreciate the videos.

  • @adrewkin8375
    @adrewkin8375 ปีที่แล้ว

    Definitely, TH-cam should review their suggestion algorithm. What do you think?

  • @Ekami67
    @Ekami67 ปีที่แล้ว

    This channel is an incredible gem, thanks so much for your work!

  • @amanpreetchander7386
    @amanpreetchander7386 ปีที่แล้ว

    Thanks for putting lot of efforts in making this video. I know how much hard work it takes to make such kind of videos. However, I would recommend as a viewer to explain with an example from the begining as it is really hard to understand with standard notations. I was really not able to follow after a while.

  • @shis10
    @shis10 ปีที่แล้ว

    Amazing video.
    Please make more content!

  • @I77AGIC
    @I77AGIC 2 ปีที่แล้ว

    best explanation I've found! great job

  • @gurudevilangovan
    @gurudevilangovan ปีที่แล้ว

    Excellent explanation! Thank you!!

  • @pranayp1950
    @pranayp1950 3 ปีที่แล้ว +2

    I really like your videos. Please keep uploading !

  • @saprativa
    @saprativa 2 ปีที่แล้ว

    Thanks for a wonderful explanation.

  • @alfcnz
    @alfcnz 3 ปีที่แล้ว +46

    I appreciate the effort to put this together! 🤩 It is really beautiful! 😍
    Unfortunately, the original formulation, equations, and diagram for the transformer are rather cryptic 😞

    • @ariseffai
      @ariseffai  3 ปีที่แล้ว +9

      Thanks Alfredo! Means a lot. I hope people can gain some clarity here.

  • @kevon217
    @kevon217 ปีที่แล้ว

    Very well explained. Nice video!

  • @Yikina7
    @Yikina7 2 ปีที่แล้ว

    You did such a good job explaining a not very simple topic, thanks!

  • @sloperclimbing5369
    @sloperclimbing5369 3 ปีที่แล้ว

    Fantastic voice and description. Keep posting

  • @faysoufox
    @faysoufox ปีที่แล้ว

    I really liked your video, I finally can say I'm starting to get how it works, thank you !

  • @TheMeltone1
    @TheMeltone1 2 ปีที่แล้ว

    Outstanding explanation!

  • @cit0110
    @cit0110 ปีที่แล้ว

    this is amazing, well done!!

  • @franklinfeng3565
    @franklinfeng3565 ปีที่แล้ว +1

    Thanks!

  • @gugusafe
    @gugusafe ปีที่แล้ว

    Thank you very much for sharing this. Wonderful work!

  • @samllanwarne6512
    @samllanwarne6512 8 หลายเดือนก่อน

    great video imagery... good job. Love seeing the flow of information, and little opinions about why things are happening. got a bit confusing when it went away from images to just pure equations, could do images and boxes with the equations?

  • @CS_n00b
    @CS_n00b ปีที่แล้ว +1

    Very good

  • @Skynet_the_AI
    @Skynet_the_AI ปีที่แล้ว

    Thanks for the video!

  • @TheBeatle49
    @TheBeatle49 ปีที่แล้ว +2

    The transformer architecture diagram resembles one of Jordan Peterson's book illustrations. Except the transformer is actually coherent and useful.

  • @karigucio
    @karigucio 11 หลายเดือนก่อน

    Why do we sum the positional and input embeddings? Wouldnt concatenating make more sense? How would that play with dimensions?

  • @ethanl9845
    @ethanl9845 2 ปีที่แล้ว

    Love the style

  • @kukuster
    @kukuster 2 ปีที่แล้ว

    2:45
    Let's take a moment to note: Dark Reader rules! :D

  • @v1hana350
    @v1hana350 2 ปีที่แล้ว +1

    What is the meaning of fine-tuning and Pre-trained in Transformers?

  • @ScienceFactsDE
    @ScienceFactsDE 3 ปีที่แล้ว

    Great explanation

  • @andrewelaryan1944
    @andrewelaryan1944 2 ปีที่แล้ว

    Great explanation of the Transformer. Thank you!

  • @juanete69
    @juanete69 ปีที่แล้ว

    During training you can feed your input encoder and decoder. What happens during prediction with new data? You don't know the output yet, How do you feed the decoder?

  • @jeepaholic326
    @jeepaholic326 3 ปีที่แล้ว

    oh my brother, the rabbit hole you have sent me down.......... I'm very content . Send food from time to time please. Thank you for this video btw. I've had it on repeat for about a day. Well, ya-know, I ain't that smart. I did get CUDA installed and working in python though so that a win. Have a good day!

  • @whowto6136
    @whowto6136 3 ปีที่แล้ว +1

    Great video! But I stiil can't understand it thoroughly..... What prerequisite knowledge do i need? Thanks!🙏

  • @saichandsharma4162
    @saichandsharma4162 2 ปีที่แล้ว

    Just awesome!

  • @addisonweatherhead2790
    @addisonweatherhead2790 3 ปีที่แล้ว +3

    Around 7:20 you say Q, K, and V are all identical. What exactly do you mean? If we have separate matrices that generate these for each head (with the i indexing you mentioned), then doesn't that mean we will have h different WQ matrices, meaning we have h different Q vectors for each word (i.e. h different Q matrices for the sequence of words)?

    • @ariseffai
      @ariseffai  3 ปีที่แล้ว +2

      This part could've been a little clearer. What I think can cause some confusion is that the letters Q, K and V are used in two different ways:
      1) in the signature of the original Attention function (6:05)
      2) in the signature of the MultiHead function (7:20)
      You're exactly right that we will have h different projected Q vectors for each word (same with K and V). What I'm referring to is that, for the encoder, the three arguments of the function MultiHead(Q, K, V) are all identical. The Q argument of MultiHead in layer i+1 of the encoder will be the stacked word representations output by layer i. This will also be the case for K and V, thus Q=K=V. In cross-attention, this is not the case: Q comes from the previous decoder layer while K and V are produced by the encoder.
      When we're doing multi-head attention, the three inputs to the original Attention function are *not* identical to each other, regardless of if we're doing self- or cross-attention. These inputs are produced by applying linear transformations to the arguments of MultiHead via the weight matrices.

  • @newbie8051
    @newbie8051 หลายเดือนก่อน

    15:40 altought attention networks are expensive, we use them as they can be trained in parallel right ?
    Please correct me if I am wrong

  • @stlo0309
    @stlo0309 ปีที่แล้ว

    hi. a small doubt: when you say the dot product of Query(Q) and Key vector(k), do you mean the "element-wise multiplication of each elements of Q & K followed by the sum of all these element-wise products such the resultant is only a single number" or simply "element-wise multiplication of each elements of Q & K, such that resultant is still a vector of the same dimension as Q&K"?

  • @marcinstrzesak346
    @marcinstrzesak346 6 หลายเดือนก่อน

    I have question about this fragment 06:58. Im not sure but I suppose that R dimensionality schould be d_k*D not D*d_k

  • @andrea-mj9ce
    @andrea-mj9ce 2 ปีที่แล้ว

    Here when you use "sequence", you mean a sentence or the whole text ?

  • @ccd2927
    @ccd2927 ปีที่แล้ว

    great Video! But I start to lost from queries key values lol

  • @andrea-mj9ce
    @andrea-mj9ce 2 ปีที่แล้ว

    8:35 I don't understand what is _pos_ and _i_ here. Here pos is the position in the sequence (0 for the first word, ...) but what is i ?

  • @wolfram77
    @wolfram77 ปีที่แล้ว

    So the transformers came back to their home planet actually :)

  • @ansharora3248
    @ansharora3248 3 ปีที่แล้ว

    I can't thank you enough. :)

  • @aquienleimporta9096
    @aquienleimporta9096 2 ปีที่แล้ว

    how decoder match the size of output of the encoder with his input in every step to could make the multiplication of the matrixes

  • @georgewbushcenterforintell147
    @georgewbushcenterforintell147 ปีที่แล้ว

    Me Brain be melted in a good way .

  • @lexflow2319
    @lexflow2319 ปีที่แล้ว

    It looks like 256 and higher values of i in the positional encoding all go to 1 no matter the position value. So half the elements of the word embedding are wasted

  • @esaliya
    @esaliya 3 ปีที่แล้ว

    great content!

  • @agusavior_channel
    @agusavior_channel 2 ปีที่แล้ว

    9:51
    This seems to be an error.
    The posicional embedding is concatenated to the word embedding.
    It is not a sum.
    Also, the posicional embedding may have an arbitrary dimension.
    EDIT: I was wrong

    • @ariseffai
      @ariseffai  2 ปีที่แล้ว +1

      From the paper: "The positional encodings have the same dimension d_model as the embeddings, so that the two can be summed."
      i.imgur.com/pJU5O9n.png

    • @agusavior_channel
      @agusavior_channel 2 ปีที่แล้ว

      ​@@ariseffai Oh you are right! Thank you very much.
      I am supprise I didn't realize this before. I have to change my code haha.
      It seems weird to me because if you sum the positional to the word embedding, you are changing the information about the meaning of that particular dimension of the word embedding. I mean: Let's suppouse that the first number of the word embedding is 0.5. Let's suppose that that dimension has a meaning about the happiness of the word. That means that the word has a level of happiness of 0.5. If you add the positional embbeding you are changing that meaning of the word. You would mixing the information about happiness with some positional information. And yoy are doing this to EACH dimension of the word embedding. This seems messy to me. If you concatenate instaed of sum, this does not happen.
      But okay, I really don't know anything.

  • @travisdriessen735
    @travisdriessen735 3 ปีที่แล้ว

    Good! Liked & Subscribed

  • @softerseltzer
    @softerseltzer 3 ปีที่แล้ว

    Good stuff!

  • @PasseScience
    @PasseScience 2 ปีที่แล้ว

    Hi, thanks for the video! There are several things that are still unclear to me. First I do not understand well how the architecture is dynamic with respect to the size of the input. I mean what does change structurally when we change the size of the input, are there some inner parts that should be parallely repeated? or does this architecture fix a size of max window that we hope will be larger than any sequence input?
    The other question is the most important one, it seems every explanation of transformer architecture I have found so far focuses one what we WANT a self attention or attention layer to do but never say a word of WHY after training those attention layers will do, by emergence, what we expect them to do. I guess it has something to do with the chosen structure of data in input and output of those layers, as well as the data flow which is forced but I do not have yet the revelation.
    If you could help me with those, that would be great!

  • @YarkoFFXI
    @YarkoFFXI ปีที่แล้ว

    Great video! Do you know if layer normalization is performed exclusively along the features axis, or if it's done both along the feature axis and the tokens (words) axis? Different sources say different things :( thank you

  • @ChocolateMilkCultLeader
    @ChocolateMilkCultLeader 2 ปีที่แล้ว

    Do you have a Twitter? I always share excellent videos I find there, and would like to give you credit?

    • @ariseffai
      @ariseffai  2 ปีที่แล้ว

      @ari_seff :)

  • @ad13979
    @ad13979 3 ปีที่แล้ว +1

    What animation software did you use?

    • @ariseffai
      @ariseffai  3 ปีที่แล้ว +1

      This one used a combination of matplotlib, keynote, and FCP. I've also used manim in a couple videos.

  • @npr1m991
    @npr1m991 2 ปีที่แล้ว +1

    Dude I was looking for a channel like yours for weeks !

  • @pmemoli9299
    @pmemoli9299 11 หลายเดือนก่อน

    Badass

  • @joliver1981
    @joliver1981 3 ปีที่แล้ว

    13:06 is incorrect. The queries and keys come from encoder and the values come from previous attention block.

    • @LukaszWiklendt
      @LukaszWiklendt 2 ปีที่แล้ว

      Authors write in the paper: "In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models". Of course, you're free to try out alternatives, but the video is not incorrect.

  • @brianbagnall3029
    @brianbagnall3029 ปีที่แล้ว

    Did everybody get that? 😅

  • @AdrianBoyko
    @AdrianBoyko ปีที่แล้ว

    I’m going to guess that they are robots in disguise?

  • @garyzhai9540
    @garyzhai9540 3 ปีที่แล้ว +1

    If this is for a university lecture, then this would be marvelous, however, most audiences are laymen, the explanation obfuscates the ease of understandig.

  • @artiekushner6849
    @artiekushner6849 ปีที่แล้ว

    algorithm, more!

  • @debasishraychawdhuri
    @debasishraychawdhuri ปีที่แล้ว

    It is not clear from the visualization what you are talking about. A lot of the words are not defined and the diagrams are missing a lot of stuff.

  • @ulamss5
    @ulamss5 ปีที่แล้ว

    Almost none of the terms and jargon were explained. People who need this video won't understand it, people who don't won't click.

  • @DavodAta
    @DavodAta 9 หลายเดือนก่อน

    We turn off the music when we talk

  • @StephenGillie
    @StephenGillie ปีที่แล้ว

    This is probaby a good video...
    ...
    ...
    ...
    Recorded with the
    ---
    Pensive pauses trademarked by 3 blue 1 brown.
    ---
    ---
    People say you're supposed to think during them
    ---
    ---
    But I think while the person is talking
    ---
    ___
    __
    And so I have all of this time
    ...
    ...
    ..
    Between the points the video makes
    ---
    ---
    ---
    Alone and with no new info.
    ---
    ---
    While everyone else thinks.
    ----
    ---
    Teacher, is there a version of this that's already complete, where I could read it long-form without waiitng for others to think?

  • @yabdelm
    @yabdelm 3 ปีที่แล้ว +1

    Someone needs to make an intuitive introduction to COMPLETE beginners, those who have no experience in AI, math, linear algebra, etc. etc. Almost all 'simple' tutorials I've come across never really give an intuitive explanation for the details, just a lot of jargon and math.

    • @lbognini
      @lbognini 2 ปีที่แล้ว +1

      Completly agree with you. The most important is the intuition behind. Not how things are calculated.
      I'm not that novice in AI but i confess i understood very little here. For instance, how come we're using sin and cos to encode a position?
      We know the properties of those functions. So there's no point recalling them.
      What is the intuition behind keys, values and queries?
      We know how to project vectors and how to deal with linear transformations.
      And last, we're trying to translate a sentence. How come the input of the decoder is the target sentence we're trying to obtain?
      Those are the things a video should try to clarify. Otherwise, there's no value added.

  • @gapsongg
    @gapsongg ปีที่แล้ว

    I like the animation style and the way you talk, but it was still really hard to follow. I did not understand much even though I study CS in a master

  • @tribuiduonguc7788
    @tribuiduonguc7788 ปีที่แล้ว

    This is introduction to almost everything, I can't catch up or understand everything. Maybe this is for someone who understand them all already

  • @laalbujhakkar
    @laalbujhakkar 2 ปีที่แล้ว

    why does this video have back ground music?

  • @dancar2537
    @dancar2537 3 ปีที่แล้ว

    what a disaster this transformers are. they did not know what they were doing ha ha ha

  • @patpearce8221
    @patpearce8221 ปีที่แล้ว +1

    Too much jargon without explanation mate

  • @DrJanpha
    @DrJanpha ปีที่แล้ว

    I lost my attention half way to

  • @DrLouMusic
    @DrLouMusic ปีที่แล้ว

    Whyyyyyy the annoying piano!!!!!