Lecture 32: Unsloth

แชร์
ฝัง
  • เผยแพร่เมื่อ 22 ต.ค. 2024

ความคิดเห็น • 18

  • @danielhanchen
    @danielhanchen วันที่ผ่านมา +4

    Thanks for inviting me! If anyone has any questions, feel free to comment below or ask on the GPU MODE or Unsloth server!

  • @robertjalanda
    @robertjalanda วันที่ผ่านมา +4

    Daniel is such a gem and Unsloth is the best. Would not be able to afford or do proper fine-tuning without unsloth

  • @mfc1190
    @mfc1190 วันที่ผ่านมา +6

    This dude is awesome.

  • @waynelau3256
    @waynelau3256 2 วันที่ผ่านมา +3

    WAS WAITING FOR THIS THANKS🎉

  • @nvbkdw
    @nvbkdw วันที่ผ่านมา +2

    heros!

  • @NoorR-ox5im
    @NoorR-ox5im วันที่ผ่านมา +2

    Wait around 1:16:00 I thought the question was about expectation as in minibatches working because linearity of expectation? That should be correct as far as I know but this variable input length issue maybe should be looked into wrt minibatches as well! Also it was fairly standard to do full batch training until it became impossible :)

    • @NoorR-ox5im
      @NoorR-ox5im วันที่ผ่านมา

      Also about the next question, perhaps muP resolves some of those concerns?

    • @danielhanchen
      @danielhanchen วันที่ผ่านมา

      Oh I was a bit unsure on the exact question so I thought it was related to the grad accum bug

  • @TheQu3tzalify
    @TheQu3tzalify วันที่ผ่านมา

    Gradient accumulation IS mathematically equivalent to full batch training. My implementations have always returned the same results for both (as everyone should see if they did testing!). The original bug comes from a poor quality implementation of gradient accumulation.

    • @danielhanchen
      @danielhanchen วันที่ผ่านมา +2

      For non sequence models, it's fine. For LLMs where all the sequence lengths are the same, it's fine - both cases can use the generally accepted GA formulas. The blog post we wrote up proved mathematically the old GA formulation was incorrect especially for padded LLM finetuning and pretraining.
      This is also not a issue of grad accumulation being implemented poorly in trainers - this problem exists in nearly all trainers that use grad accum.

    • @TheQu3tzalify
      @TheQu3tzalify 23 ชั่วโมงที่ผ่านมา

      ​@@danielhanchen How do you go from:
      L = 1/m_bar * L1 + 1/m_bar * L2 + 1/m_bar * L3 + 1/m_bar * L4 = 1/m_bar * (L1 + L2 + L3 + L4)
      to:
      L = G * 1/m_bar * (L1 + L2 + L3 + L4) ?
      It seems like when you wrote "Let's first set them to the mean length of the entire document to make our calculations easier", it actually hid the problem.
      Then in the "Extra - mathematical proof" section what prevents you from having precalculated the proper denominator m1 + m2 + m3 + m4 and then doing (L1 + L2) / sum + (L3 + L4) / sum? Because that's the original and proper formulation of gradient accumulation for padded sequences.

    • @danielhanchen
      @danielhanchen 22 ชั่วโมงที่ผ่านมา

      @@TheQu3tzalify Oh actually you're correct - it's a mistake in the formulation - I forgot to write it's not L1/m + L2/m + L3/m + L4/m, which will get you (L1+L2+L3+L4)/m, but rather we also need the average length L_bar ie we shall just use L ie L/m + L/m + L/m + L/m = 4 * L/m = G * L/m, and so we divide by G in gradient accumulation to get back L/m.
      1/n*sum(Li) / 1/n*sum(Mi) will cancel the 1/n = mean(L)/mean(m)
      The incorrect version will get G*mean(L)/mean(m), and so we have to divide by G.
      So the *G is still there, just I skipped some steps and should have explained better. huggingface.co/docs/accelerate/en/usage_guides/gradient_accumulation has more details on why there's a *G and a division by G to fix it up.

    • @danielhanchen
      @danielhanchen 21 ชั่วโมงที่ผ่านมา +1

      @@TheQu3tzalify Sorry I skipped some steps - I updated the blog to make it clearer on the first part.
      Yes the second point you make is correct - that's what most trainers should do, but nearly all do not do that. Most implementations use torch.nn.CrossEntropyLoss directly, and use the mean reduction, which means you can't pre-derive the denominator (which you have to, as you also mentioned). Instead you set it to the sum, then as you said, derive it manually and divide at the end.

  • @tomtyiu
    @tomtyiu วันที่ผ่านมา +2

    Create we have Vision support?