The Position Encoding In Transformers

แชร์
ฝัง
  • เผยแพร่เมื่อ 30 ก.ย. 2024
  • Transformers and the self-attention are powerful architectures to enable large language models, but we need a mechanism for them to understand the order of the different tokens we input into the models. The position encoding is that mechanism! There are many ways to encode the positions, but let me show you the way it was developed in the "Attention is all you need" paper. Let's get into it!

ความคิดเห็น • 5

  • @TemporaryForstudy
    @TemporaryForstudy 2 หลายเดือนก่อน

    nice. but i have one doubt. like how adding sine and cosine values ensuring that we are encoding the positions. like how did the author come to this conclusion
    why not other values?

    • @TheMLTechLead
      @TheMLTechLead  2 หลายเดือนก่อน +1

      The sine and cosine functions provide smooth and continuous representations, which help in learning the relative positions effectively. For example, the encoding for positions k and k+1 will be similar, reflecting their proximity in the sequence. The frequency-based sinusoidal functions allow the encoding to generalize to sequences of arbitrary length without needing to re-learn positional information for different sequence lengths. The model can understand relative positions beyond the length of sequences seen during training. The combination of sine and cosine functions ensures that each position has a unique encoding. The orthogonality property of these functions helps in distinguishing between different positions effectively, even for long sequences. The different frequencies used in the positional encodings allow the model to capture both short-term and long-term dependencies within the sequence. Higher frequency components help in understanding local relationships, while lower frequency components help in capturing global structures.
      Also, sinusoidal functions are differentiable, which is crucial for backpropagation during training. This ensures that the model can learn to use the positional encodings effectively through gradient-based optimization methods.

  • @math_in_cantonese
    @math_in_cantonese 2 หลายเดือนก่อน

    I have a question, for pos=0 and "horizontal_index"=2, shouldn't it be PE(pos,2) = sin(pos/10000^(2/d_model)) ?
    I believe you used the same symbol "i" for 2 different way of indexing, right ?
    7:56

    • @TheMLTechLead
      @TheMLTechLead  2 หลายเดือนก่อน

      Yeah you are right, I realized I made that mistake. I need to reshoot it.

    • @AlainDrolet-e4z
      @AlainDrolet-e4z 13 วันที่ผ่านมา

      Thank you Damien, and math_in_cantonese
      I'm in the middle of writing a short article discussing position encoding.
      Damien, feel proud that you are the first reference I quote in the article!
      I was just going crazy trying to nail the exact meaning of "i".
      In Damien's video it is clear he means "i" the dimension index, and the values shown with sin/cos match.
      But now I could not make any logic of this understanding with the equation formulation below:
      PE(pos,2i) = sin(pos/10000^2i/dmodel)
      PE(pos,2i+1) = cos(pos/10000^2i/dmodel)
      If see this as PE(pos, 0) referreing to the first column (column zero)
      and, say, PE(pos,5) as referring to the sixth column (column 5), with 5 = 2i+1 => i = (5-1)/2 = 2.
      So "i" is more like the index of a (sin,cos) pair of dimensions. Its range is d_model/2.
      The original sin (😄, pun intended) is in the Attention is all you need.
      There they simply state:
      > where pos is the position and i is the dimension
      This is wrong, it seems, 2i and 2i+1 are the dimensions.
      In any case big thank you Damien, I have watched, many of your videos.
      They are quite useful in ramping me up on LLM and the rest.
      Merci beaucoup
      Alain