You are the best sir, I went through the paper but it was very difficult for me to understand even after 3 4 repetitions, because terminology were too complex for early 2 month practitioners like me, but you described the whole paper so seamlessly, I am so grateful for your time and knowledge,,,keep up the good work, sir..Love from NY, bhaiyaa
Hello, how did they come up with these these values: [-1.0, 0.696,...,1.0]? I mean difference between 0 to 0.080 is less than that between -1.0 and -0.696 and so on... They intentionally did this because most of the values are usually near 0 and hence we want more precision near 0?
Regular fine-tuning would need 780+ GB for 65B model. considering 16-bit precision wouldn't it be little less, (65X2) + some memory for optimizer + some memory for gradient and activation (say ~150-160GB)?
Thanks, Parth. arxiv.org/pdf/2305.14314.pdf does say that it needs 780GB GPU RAM. For model, yes, it will take around 130GB. But then there are activations for every layer (which depends on sequence length and batch size). Authors assumed batch size 1 and sequence length 512. Memory for activations is much larger.
You are the best sir, I went through the paper but it was very difficult for me to understand even after 3 4 repetitions, because terminology were too complex for early 2 month practitioners like me, but you described the whole paper so seamlessly, I am so grateful for your time and knowledge,,,keep up the good work, sir..Love from NY, bhaiyaa
Hello, how did they come up with these these values: [-1.0, 0.696,...,1.0]? I mean difference between 0 to 0.080 is less than that between -1.0 and -0.696 and so on... They intentionally did this because most of the values are usually near 0 and hence we want more precision near 0?
Regular fine-tuning would need 780+ GB for 65B model. considering 16-bit precision wouldn't it be little less, (65X2) + some memory for optimizer + some memory for gradient and activation (say ~150-160GB)?
Thanks, Parth. arxiv.org/pdf/2305.14314.pdf does say that it needs 780GB GPU RAM. For model, yes, it will take around 130GB. But then there are activations for every layer (which depends on sequence length and batch size). Authors assumed batch size 1 and sequence length 512. Memory for activations is much larger.
@@dlByManish thanks, Manish. Makes sense. Didn't know activation takes so much memory while fine-tuning :)