What is LoRA? Low-Rank Adaptation for finetuning LLMs EXPLAINED

แชร์
ฝัง
  • เผยแพร่เมื่อ 19 มิ.ย. 2024
  • How does LoRA work? Low-Rank Adaptation for Parameter-Efficient LLM Finetuning explained. Works for any other neural network as well, not just for LLMs.
    ➡️ AI Coffee Break Merch! 🛍️ aicoffeebreak.creator-spring....
    📜 „Lora: Low-rank adaptation of large language models“ Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L. and Chen, W., 2021. arxiv.org/abs/2106.09685
    📚 sebastianraschka.com/blog/202...
    📽️ LoRA implementation: • Low-rank Adaption of L...
    Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
    Dres. Trost GbR, Siltax, Vignesh Valliappan, Mutual Information, Kshitij
    Outline:
    00:00 LoRA explained
    00:59 Why finetuning LLMs is costly
    01:44 How LoRA works
    03:45 Low-rank adaptation
    06:14 LoRA vs other approaches
    ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
    🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
    Patreon: / aicoffeebreak
    Ko-fi: ko-fi.com/aicoffeebreak
    ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
    🔗 Links:
    AICoffeeBreakQuiz: / aicoffeebreak
    Twitter: / aicoffeebreak
    Reddit: / aicoffeebreak
    TH-cam: / aicoffeebreak
    #AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research​
    Music 🎵 : Meadows - Ramzoid
    Video editing: Nils Trost
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 71

  • @MikeTon
    @MikeTon 5 หลายเดือนก่อน +9

    Insightful : Especially the comparison from LORA to prefix tuning and adapters at the end!

    • @AICoffeeBreak
      @AICoffeeBreak  5 หลายเดือนก่อน +1

      Thank you! Glad you liked it.

  • @rockapedra1130
    @rockapedra1130 8 หลายเดือนก่อน +5

    Perfect. This exactly what I wanted to know. "Bite-sized" is right!

  • @wholenutsanddonuts5741
    @wholenutsanddonuts5741 9 หลายเดือนก่อน +20

    I’ve been using LoRAs for a while now but didn’t have a great understanding of how they work. Thank you for the explainer!

    • @wholenutsanddonuts5741
      @wholenutsanddonuts5741 9 หลายเดือนก่อน +3

      I assume this works the same for diffusion models like stable diffusion?

    • @AICoffeeBreak
      @AICoffeeBreak  9 หลายเดือนก่อน +8

      For any neural network. You just need to figure out based on your application which matrices you should reduce and which not.

    • @wholenutsanddonuts5741
      @wholenutsanddonuts5741 9 หลายเดือนก่อน +1

      @@AICoffeeBreak so super easy then! 😂 seriously though that’s awesome to know!

  • @DerPylz
    @DerPylz 9 หลายเดือนก่อน +6

    Yay, thanks!

  • @butterkaffee910
    @butterkaffee910 9 หลายเดือนก่อน +5

    I love lora ❤ even for vit's

  • @pranav_tushar_sg
    @pranav_tushar_sg 6 หลายเดือนก่อน +2

    thanks!

    • @AICoffeeBreak
      @AICoffeeBreak  6 หลายเดือนก่อน +2

      You're welcome!

  • @soulfuljourney22
    @soulfuljourney22 5 วันที่ผ่านมา +1

    Concept of rank of a matrix,tauught in such an effective way

  • @SoulessGinge
    @SoulessGinge 4 หลายเดือนก่อน +3

    Very clear and straightforward. The explanation of matrix rank was especially helpful. Thank you for the video.

    • @AICoffeeBreak
      @AICoffeeBreak  4 หลายเดือนก่อน +1

      Thank You for the visit! hope to see you again soon!

  • @keshavsingh489
    @keshavsingh489 9 หลายเดือนก่อน +4

    So simple explanation, thank you soo much!!

  • @outliier
    @outliier 9 หลายเดือนก่อน +2

    What a great topic!

  • @minkijung3
    @minkijung3 3 หลายเดือนก่อน +2

    Thanks Letitia. Your explanation was very clear and helpful to understand the paper.

    • @AICoffeeBreak
      @AICoffeeBreak  3 หลายเดือนก่อน +1

      I'm so glad it's helpful to you!

  • @user-ig3rp7fk9c
    @user-ig3rp7fk9c 5 หลายเดือนก่อน +2

    Firstly thanks for the amazing video. Can you also make a video about QLoRA.

  • @AnthonyGarland
    @AnthonyGarland 8 หลายเดือนก่อน +3

    Thanks!

    • @AICoffeeBreak
      @AICoffeeBreak  8 หลายเดือนก่อน +1

      Wow, thanks a lot! 😁

  • @deviprasadkhatua
    @deviprasadkhatua 5 หลายเดือนก่อน +3

    Excellent explaination. Thanks!

    • @AICoffeeBreak
      @AICoffeeBreak  5 หลายเดือนก่อน +1

      Glad you enjoyed it!

  • @michelcusteau3184
    @michelcusteau3184 4 หลายเดือนก่อน +4

    By far the clearest explanation on youtube

    • @AICoffeeBreak
      @AICoffeeBreak  4 หลายเดือนก่อน +1

      Thank you very much for the visit and for leaving this heartwarming comment!

    • @elinetshaaf75
      @elinetshaaf75 4 หลายเดือนก่อน +1

      true!

  • @kindoblue
    @kindoblue 9 หลายเดือนก่อน +3

    Loved the explanation. Thanks

  • @jarj5313
    @jarj5313 หลายเดือนก่อน +1

    THANKS THAT WAS GREAT EXPLANATION

  • @karndeepsingh
    @karndeepsingh 8 หลายเดือนก่อน +2

    Thanks again for amazing video. I would also request a detailed video on Flash Attention. Thanks

    • @AICoffeeBreak
      @AICoffeeBreak  8 หลายเดือนก่อน +2

      Noted. It's on The List.
      Thanks! 😄

  • @ambivalentrecord
    @ambivalentrecord 8 หลายเดือนก่อน +2

    Great explanation Letitia

    • @AICoffeeBreak
      @AICoffeeBreak  8 หลายเดือนก่อน +2

      Glad you think so! 😄

  • @thecodest2498
    @thecodest2498 3 หลายเดือนก่อน +1

    Thank you sooooo much for this video. I started reading the paper, was very terrified by it, then I thought I should watch some TH-cam video, watch one video, was asleep half-way through the video. Woke up again and stumbled across your video, your coffee woke me up and now I got the LoRA. Thanks for your efforts.

    • @AICoffeeBreak
      @AICoffeeBreak  3 หลายเดือนก่อน +2

      Wow, this warms my coffee heart, thanks!

  • @amelieschreiber6502
    @amelieschreiber6502 9 หลายเดือนก่อน +2

    LoRA is awesome! It also helps with overfitting in protein language models as well. Cool video!

  • @m.rr.c.1570
    @m.rr.c.1570 4 หลายเดือนก่อน +1

    Thank you for clearing my concepts regarding LoRA

  • @Lanc840930
    @Lanc840930 9 หลายเดือนก่อน +2

    Very comprehensive explanation! Thank you

    • @Lanc840930
      @Lanc840930 9 หลายเดือนก่อน +1

      Thanks a lot. And I have a question for “linear dependence ”, is this mention in original paper?

    • @AICoffeeBreak
      @AICoffeeBreak  9 หลายเดือนก่อน +2

      The paper talks about the rank of a matrix, so about linear dependency between rows / columns.

    • @Lanc840930
      @Lanc840930 9 หลายเดือนก่อน +1

      oh, I see! Thank you 😊

  • @bdennyw1
    @bdennyw1 9 หลายเดือนก่อน +3

    Fantastic video as always. QLora is even better if you are GPU poor like me.

  • @deepak_kori
    @deepak_kori 5 หลายเดือนก่อน +1

    You are just amazing >>> so beautiful so elegant just wow😇😇

  • @alirezafarzaneh2539
    @alirezafarzaneh2539 26 วันที่ผ่านมา

    Thanks for the simple and educating video!
    If I'm not mistaken, prefix tuning is pretty much the same as embedding vectors in diffusion models! How cool is that? 😀

  • @yacinegaci2831
    @yacinegaci2831 6 หลายเดือนก่อน +2

    Great explanation, thanks for the video!
    I have a lingering question about LoRA: Is it necessary to approximate the low-rank matrices of the difference weights (the Delta W in the video). Or can we reduce the size of the original weight matrices? If I understood the video correctly, at the end of LoRA training, I have the full parameters of the roginal model + the difference weights (in reduced size). My question is why can't I learn low rank matrices for the original weights as well?

    • @AICoffeeBreak
      @AICoffeeBreak  6 หลายเดือนก่อน +2

      Hi, in principle you can, even though I would expect you could lose some model performance. The idea of finetuning with LoRA is that the small finetuning updates should have low rank. matrices. BUT there is work using LoRA for pretraining, called ReLoRA. Here is the paper 👉 arxiv.org/pdf/2307.05695.pdf
      There is also this discussion on Reddit going on: 👉 www.reddit.com/r/MachineLearning/comments/13upogz/d_lora_weight_merge_every_n_step_for_pretraining/

    • @yacinegaci2831
      @yacinegaci2831 3 หลายเดือนก่อน +1

      @@AICoffeeBreak Oh, that's amazing. Thanks for the answer, for the links, and for your great videos :)

  • @dr.mikeybee
    @dr.mikeybee 9 หลายเดือนก่อน +4

    If we knew what abstractions were handled layer by layer, we could make sure that the individual layers were trained to completely learn those abstractions. Let's hope Max Tegmark's work on introspection get us there.

  • @floriankowarsch8682
    @floriankowarsch8682 9 หลายเดือนก่อน +3

    As always amazing content! 😌
    It's perfect to refresh knowledge & learn something new.
    I think interesting about LoRA is how strong it actually regularizes fine-tuning: Is it possible it overfit when using a very small matrix in LoRA? Can LoRA also harm optimization?

    • @TheRyulord
      @TheRyulord 9 หลายเดือนก่อน +4

      Still possible to overfit but more resistant to overfitting compared to a full finetune. All the work I've seen on LoRAs say that it's just as good as a full finetune in terms of task performance as long as your rank is high enough for the task. What's interesting is that the necessary rank is usually quite low (around 2) even for relatively big models (llama 7B) and reasonable complex tasks. At least that's all the case for language modelling. Might be different for other domains.

  • @ArunkumarMTamil
    @ArunkumarMTamil หลายเดือนก่อน

    how is Lora fine-tuning track changes from creating two decomposition matrix? How the ΔW is determined?

  • @Micetticat
    @Micetticat 9 หลายเดือนก่อน +2

    LoRA: how can it be so simple? 🤯

    • @AICoffeeBreak
      @AICoffeeBreak  9 หลายเดือนก่อน +4

      Kind of tells us that fine-tuning all parameters in an LM is overkill.

  • @terjeoseberg990
    @terjeoseberg990 6 หลายเดือนก่อน +3

    I thought this was long range wide band radio communications.

  • @ayyship
    @ayyship 9 หลายเดือนก่อน +3

    Why use weight matrixes to start with if you can use lora representation? Assuming you gain space, the only downside I can think of is the additional compute to get back the weight matrix. But that should be smaller then the gain of the speed up of backward propagation.

    • @AICoffeeBreak
      @AICoffeeBreak  7 หลายเดือนก่อน +1

      Thanks for this question. You do not actually start with the weight matrices, you learn A and B directly from which you reconstruct the delta W matrix. Sorry this was not clear enough in the video.

  • @kunalnikam9112
    @kunalnikam9112 หลายเดือนก่อน

    In LoRA, Wupdated = Wo + BA, where B and A are decomposed matrices with low ranks, so i wanted to ask you that what does the parameters of B and A represent like are they both the parameters of pre trained model, or both are the parameters of target dataset, or else one (B) represents pre-trained model parameters and the other (A) represents target dataset parameters, please answer as soon as possible

  • @alislounge
    @alislounge หลายเดือนก่อน

    Which one is the most and which one is the least 'compute efficient'? Adapters, Prefix Tuning or LORA?

  • @onomatopeia891
    @onomatopeia891 4 หลายเดือนก่อน +1

    Thanks! But how do we determine the correct rank? Is it just trial and error with the value of R?

    • @AICoffeeBreak
      @AICoffeeBreak  4 หลายเดือนก่อน +1

      Exactly. At least so far. Maybe some theoretical understanding will come up in time.

  • @davidromero1373
    @davidromero1373 8 หลายเดือนก่อน +1

    Hi a question, can we use lora to just reduce the size of a model and run inference, or we have to train it always?

    • @AICoffeeBreak
      @AICoffeeBreak  8 หลายเดือนก่อน +2

      LoRA just reduces the size of the trainable parameters for fine-tuning. But the number of parameters of the original model stays the same.

  • @ryanhewitt9902
    @ryanhewitt9902 8 หลายเดือนก่อน

    Aren't we effectively using the same kind of trick when we train the transformer encoder / self-attention block? Assuming row vectors, we can use the form W_v⋅v.T⋅k⋅W_k.T⋅W_q⋅q.T. Ignoring the *application* of attention and focusing its calculation, we get the form k⋅W_k.T⋅W_q⋅q.T . Since W_k and W_q are projection matrices from embedding length to dimension D_k, we have the same sort of low rank decomposition where D_k corresponds to "r" in your video. Is that right?

  • @mkamp
    @mkamp 9 หลายเดือนก่อน +2

    Absolutely awesome explanation. Would like to get your take on LoRA vs (IA)**3 as well. It seems that people still prefer LoRA over (IA)**3 even though the latter has a slightly higher performance?

  • @mesochild
    @mesochild หลายเดือนก่อน

    what do i have to learn to understand this help please

  • @dineth9d
    @dineth9d 8 หลายเดือนก่อน +2

    Thanks!