วีดีโอ

มุมมอง 0

ความคิดเห็น •

@pengbo87 ปีที่แล้ว ⁺¹
man I just love you for sharing and such easy to understand explanation
@gabrielmongaras ปีที่แล้ว ⁺¹
Glad to hear that you're finding my explanations helpful!
@AlexKrentsel 11 หลายเดือนก่อน
thanks for this, great explanation
@mateuszk3210 4 หลายเดือนก่อน
Interesting paper, I am having similar problem while training on 300 long sequences I need to extend to 1000 and I am using RoPE. Do you know if this interpolation can be used with RoPE, or should I look into something like ALiBI ? I recall I was reading ALiBi also has some issues and accuracy is worse. There is also LongRoPE.
@Skinishh ปีที่แล้ว
Thank you for another great video! 🙏
Does this also work for ALiBi?
@gabrielmongaras ปีที่แล้ว
I don't think it works on ALiBi since ALiBi deals with constant integer value and ALiBi basically solves the problem this paper wants to solve which is extending context length after training. Though this paper does so after training while ALibi leverages properties of the added mask during training.
@Skinishh ปีที่แล้ว
@@gabrielmongaras but in ALiBi they also trained the model with a min/mac constant being multiplied to the QK values, right? So the same idea of this paper could be implemented instead? I.e., instead of multiplying by the constant, just multiply by constant/scaling_factor ?
@gabrielmongaras ปีที่แล้ว ⁺¹
Definitely would be worth a try! Alibi can already get some good accuracy when extrapolating. I'm curious if combining the two methods could go to even longer sequences that both Alibi and this method couldn't achieve.
@kibrutemesgen1759 ปีที่แล้ว
Amazing explanation, I am thinking u r doing a PhD. do u have any idea how we can implement this method in code to finetune llama2? any resource is appreciated.
@HarisJabbar ปีที่แล้ว ⁺¹
Question : The model can still take only 2048 tokens, so we still have to chunk a 4096 tokens in two blocks, right? PI only deals with modifying the positional embedding. It cannot help with the fact that the attention is still on a window of 2048 tokens.
@gabrielmongaras ปีที่แล้ว ⁺²
The attention "window" is the max size the model was trained on. If I really wanted to, I could give the model 1 million tokens and it would still work (assuming I have enough RAM), but the model doesn't generalize to positional encodings beyond what it was trained on (in this case 2048). So, this method interpolates the positional encodings used in the model. If I interpolate by a factor of two, then I can put 4096 tokens in the model and the positional encodings at position 4096 would correspond with the original signal at 2048, thus extending the "window" of tokens the model has seen.
@HarisJabbar ปีที่แล้ว ⁺²
@@gabrielmongaras Thanks for the reply. What I meant was that the matrix of embedding remains fixed (2048 x hidden_size in this case). So the input still has to be (batch_size x 2048), unless I am missing something.
@HarisJabbar ปีที่แล้ว ⁺¹
@@gabrielmongaras The point where I was stuck was the 'traditional' position_ids used in the input of a model. They basically limit the number of input tokens. That's not the case in LLaMA, as it uses ALiBi for position encoding and thus has no input embedding limit. Thanks again for the video and the explanation!
@acasualviewer5861 9 หลายเดือนก่อน
@@HarisJabbar I'm confused about this as well.
@AurobindoTripathy 3 หลายเดือนก่อน
@@HarisJabbar the Llama paper says RoPE..., is there a newer paper with a different PE??