Hey Trelis! Can you help me setup a **Multi-node, multi-gpu** training infra using RunPod. I figured this out using the community cloud option where I can set a Public IP for my pods and expose the TCP ports with the same internal and external port numbers. However, I'm not able to add a shared disk across my community pods to save checkpoints in case of node failure. I totally failed to set communication between two different pods when I launched them in the secure cloud. But secure cloud allows network volume that can be shared across different pods. Can you help me set-up infra for multi-node multi-gpu set up in secure cloud. In paperspace this was easy, but I am not able to figure this out using RunPod. Any suggestions are welcome
@@TrelisResearch Runpod support asked me to select a machine that has more gpus instead of multi-node. But this isn't what I'm planning to achieve. I want to run a bunch of experiments on the multi-node multi-gpu setup. In runpod, when launching multiple pods, sometimes they are getting allocated on different secure clouds (with different public IPs) and sometimes they are getting on the same public IP. The later case is not a problem because the pods can communicate with each other using their private IPs and exposed TCP ports if they are on the same public IP. However, the former case, I failed to figure out how to establish a communication over TCP ports. I tried port forwarding but I'm getting prompted with some password which I don't have. Is it just me or is the runpod not configured to allow port forwarding to communicate across different public IPs? Any other ideas to solve this? Paperspace is much better for multi-node multi-gpu setup but it is postpaid and I'm afraid I might run into insane cloud bills. Runpod has the prepaid option which is very safe in my case.
@@padmasrivaddiparthi7287 yeah you're right, what you want is not just adding more gpus. I haven't done any multi node, so I'm unsure. Perhaps you can look at latitude? they are also pre-paid (although you have to prepay 100 bucks to get started). Will see if some time I do a vid, but not high priority right now
Hi. Will this work for continued pretraining on text books for domain specific adaptive learning? All i see on the internet are LoRA videos. I have seen your video on FFT and thats what i want for my use case.
@@TrelisResearch I tried GaLore with Subspace Descent and off the bat, it had better eval/loss than any of the earlier methods. How could this perform better than AdamW?
Very up to date! Includes GaLore, etc.
Cheers thanks
Thanks
Can you implement a few papers in pytorch like gradtts and more
What others?
Hey Trelis! Can you help me setup a **Multi-node, multi-gpu** training infra using RunPod. I figured this out using the community cloud option where I can set a Public IP for my pods and expose the TCP ports with the same internal and external port numbers. However, I'm not able to add a shared disk across my community pods to save checkpoints in case of node failure. I totally failed to set communication between two different pods when I launched them in the secure cloud. But secure cloud allows network volume that can be shared across different pods.
Can you help me set-up infra for multi-node multi-gpu set up in secure cloud. In paperspace this was easy, but I am not able to figure this out using RunPod. Any suggestions are welcome
Did you ask runpod support?
Try that and let me know. I’ll see if I can help
@@TrelisResearch Runpod support asked me to select a machine that has more gpus instead of multi-node. But this isn't what I'm planning to achieve. I want to run a bunch of experiments on the multi-node multi-gpu setup.
In runpod, when launching multiple pods, sometimes they are getting allocated on different secure clouds (with different public IPs) and sometimes they are getting on the same public IP. The later case is not a problem because the pods can communicate with each other using their private IPs and exposed TCP ports if they are on the same public IP. However, the former case, I failed to figure out how to establish a communication over TCP ports. I tried port forwarding but I'm getting prompted with some password which I don't have.
Is it just me or is the runpod not configured to allow port forwarding to communicate across different public IPs? Any other ideas to solve this? Paperspace is much better for multi-node multi-gpu setup but it is postpaid and I'm afraid I might run into insane cloud bills. Runpod has the prepaid option which is very safe in my case.
@@padmasrivaddiparthi7287 yeah you're right, what you want is not just adding more gpus.
I haven't done any multi node, so I'm unsure.
Perhaps you can look at latitude? they are also pre-paid (although you have to prepay 100 bucks to get started).
Will see if some time I do a vid, but not high priority right now
@@TrelisResearch thanks for the suggestion. I checked Latitude, it is expensive indeed!
Can we convert full fine tuned model to lora (svd on delta weights)
You could try but probably you lose too much quality.
Also, you have to re setup the model which is not trivial .
Maybe I’ll try for a video some time
Also, you have to re setup the model which is not trivial .
Maybe I’ll try for a video some time
Hi. Will this work for continued pretraining on text books for domain specific adaptive learning? All i see on the internet are LoRA videos. I have seen your video on FFT and thats what i want for my use case.
Yup this is full fine tuning. Can be for pretraining or continued pretraining
@@TrelisResearch I tried GaLore with Subspace Descent and off the bat, it had better eval/loss than any of the earlier methods. How could this perform better than AdamW?
@@mdrafatsiddiqui haha nice. Well doing GaLore prevents overfitting, so that could be the reason