Hi, thanks for the video. Just one question, around minute 14:00 you said that the student’s timesteps are between 1and 4, but in the paper the authors state that the final timestep (tau_n) for the student must be 1000 (so equal to the teacher one). So what do you think? The student’s timestep should be something like {1,2,3,1000} or what?
I think they do that so they can use the same scheduler for both models to keep a consistent SNR. Timestep 1000 represents 100% noise which is where you always start from. I'm guessing they use uniform steps after that to get a wide rate of SNR values: {1, 250, 500, 1000}
Gabriel is the GOAT
Wow thanks love your channel
Hi, thanks for the video. Just one question, around minute 14:00 you said that the student’s timesteps are between 1and 4, but in the paper the authors state that the final timestep (tau_n) for the student must be 1000 (so equal to the teacher one). So what do you think? The student’s timestep should be something like {1,2,3,1000} or what?
I think they do that so they can use the same scheduler for both models to keep a consistent SNR. Timestep 1000 represents 100% noise which is where you always start from. I'm guessing they use uniform steps after that to get a wide rate of SNR values: {1, 250, 500, 1000}
would you be able to do a video on the mamba ssm paper? your videos help me understand much better