You've been talking about knowledge distillation for a while, and now I know why! This will permit deployment on hardware of arbitrary specifications in exchange for a far less penalty on inference accuracy. Beautiful.
It is interesting to look at noisy students together with knowledge distillation. But it is important to note that noisy students is NOT a distillation technique. It is exactly the opposite: they leverage on unlabeled data to make models larger and larger.
Thank you! In this case we are just using "distillation" to describe predicting labeled output of another network, particularly with some kind of temperature smoothing in the softmax as well. I agree the term is misleading, self-training might be better.
@@connor-shorten True. I like the term "knowledge transfer". I see it as an umbrella term for any student-teacher dynamic, regardless of whether the student is smaller than the teacher (e.g. distilation, compression) or larger (e.g. network morphisms, noisy students).
You've been talking about knowledge distillation for a while, and now I know why! This will permit deployment on hardware of arbitrary specifications in exchange for a far less penalty on inference accuracy. Beautiful.
Finally you got it!
Awesome! Knowledge distillation definitely seems to be one of the most interesting / promising ideas in Deep Learning right now!
Thank you for such a clear presentation! This is an amazing summary of the four papers and saves me a lot of time :)
Thank you! Awesome explanation.
Thank you so much!!
This is awesome, I’m excited to try this on my raspberry pi’s
You should make a video about that f you have time! I would be very interested in seeing if this could work on raspberry pis!
great video about knowledge distillation, now I need to go to the lab
It is interesting to look at noisy students together with knowledge distillation. But it is important to note that noisy students is NOT a distillation technique. It is exactly the opposite: they leverage on unlabeled data to make models larger and larger.
Thank you! In this case we are just using "distillation" to describe predicting labeled output of another network, particularly with some kind of temperature smoothing in the softmax as well. I agree the term is misleading, self-training might be better.
@@connor-shorten True. I like the term "knowledge transfer". I see it as an umbrella term for any student-teacher dynamic, regardless of whether the student is smaller than the teacher (e.g. distilation, compression) or larger (e.g. network morphisms, noisy students).
thanks you!
Thank you!