Full Fine tuning with Fewer GPUs - Galore, Optimizer Tricks, Adafactor

แชร์
ฝัง
  • เผยแพร่เมื่อ 11 พ.ย. 2024

ความคิดเห็น • 18

  • @andpoul
    @andpoul หลายเดือนก่อน +1

    Very up to date! Includes GaLore, etc.

  • @darkmatter9583
    @darkmatter9583 หลายเดือนก่อน

    Thanks

  • @imranullah3097
    @imranullah3097 หลายเดือนก่อน

    Can you implement a few papers in pytorch like gradtts and more

  • @padmasrivaddiparthi7287
    @padmasrivaddiparthi7287 หลายเดือนก่อน

    Hey Trelis! Can you help me setup a **Multi-node, multi-gpu** training infra using RunPod. I figured this out using the community cloud option where I can set a Public IP for my pods and expose the TCP ports with the same internal and external port numbers. However, I'm not able to add a shared disk across my community pods to save checkpoints in case of node failure. I totally failed to set communication between two different pods when I launched them in the secure cloud. But secure cloud allows network volume that can be shared across different pods.
    Can you help me set-up infra for multi-node multi-gpu set up in secure cloud. In paperspace this was easy, but I am not able to figure this out using RunPod. Any suggestions are welcome

    • @TrelisResearch
      @TrelisResearch  หลายเดือนก่อน +1

      Did you ask runpod support?
      Try that and let me know. I’ll see if I can help

    • @padmasrivaddiparthi7287
      @padmasrivaddiparthi7287 หลายเดือนก่อน

      @@TrelisResearch Runpod support asked me to select a machine that has more gpus instead of multi-node. But this isn't what I'm planning to achieve. I want to run a bunch of experiments on the multi-node multi-gpu setup.
      In runpod, when launching multiple pods, sometimes they are getting allocated on different secure clouds (with different public IPs) and sometimes they are getting on the same public IP. The later case is not a problem because the pods can communicate with each other using their private IPs and exposed TCP ports if they are on the same public IP. However, the former case, I failed to figure out how to establish a communication over TCP ports. I tried port forwarding but I'm getting prompted with some password which I don't have.
      Is it just me or is the runpod not configured to allow port forwarding to communicate across different public IPs? Any other ideas to solve this? Paperspace is much better for multi-node multi-gpu setup but it is postpaid and I'm afraid I might run into insane cloud bills. Runpod has the prepaid option which is very safe in my case.

    • @TrelisResearch
      @TrelisResearch  หลายเดือนก่อน

      @@padmasrivaddiparthi7287 yeah you're right, what you want is not just adding more gpus.
      I haven't done any multi node, so I'm unsure.
      Perhaps you can look at latitude? they are also pre-paid (although you have to prepay 100 bucks to get started).
      Will see if some time I do a vid, but not high priority right now

    • @padmasrivaddiparthi7287
      @padmasrivaddiparthi7287 หลายเดือนก่อน

      @@TrelisResearch thanks for the suggestion. I checked Latitude, it is expensive indeed!

  • @VijayEranti
    @VijayEranti หลายเดือนก่อน

    Can we convert full fine tuned model to lora (svd on delta weights)

    • @TrelisResearch
      @TrelisResearch  หลายเดือนก่อน +1

      You could try but probably you lose too much quality.

    • @TrelisResearch
      @TrelisResearch  หลายเดือนก่อน

      Also, you have to re setup the model which is not trivial .
      Maybe I’ll try for a video some time

    • @TrelisResearch
      @TrelisResearch  หลายเดือนก่อน

      Also, you have to re setup the model which is not trivial .
      Maybe I’ll try for a video some time

  • @mdrafatsiddiqui
    @mdrafatsiddiqui หลายเดือนก่อน

    Hi. Will this work for continued pretraining on text books for domain specific adaptive learning? All i see on the internet are LoRA videos. I have seen your video on FFT and thats what i want for my use case.

    • @TrelisResearch
      @TrelisResearch  หลายเดือนก่อน +2

      Yup this is full fine tuning. Can be for pretraining or continued pretraining

    • @mdrafatsiddiqui
      @mdrafatsiddiqui หลายเดือนก่อน +1

      @@TrelisResearch I tried GaLore with Subspace Descent and off the bat, it had better eval/loss than any of the earlier methods. How could this perform better than AdamW?

    • @TrelisResearch
      @TrelisResearch  หลายเดือนก่อน +1

      @@mdrafatsiddiqui haha nice. Well doing GaLore prevents overfitting, so that could be the reason