@ Yannic Kilcher. Thanks , it has been best explaination. You simplified maths as well. Would request you to explain all recent papers in same way. Thansk
Hi, thanks a lot for these amazing paper reviews. Can you make a video about layer normalization as well, and why is it more suited for recurrent nets than batch normalization?
Absolutely beautiful man. I love how you mentioned the model.train and mdoel.eval coding implication as well. OPut of curiousity: 1) What software are you using to show the paper (not adobe right?) 2) What kind of drawing pad are you using? I have a Wacom but since I cannot see what I'm doing on it, it is annoying to teach using it really.
why we normalize the data and then we multiply by gamma and add it with beta i understand it produce the best distribution of data but rather cant we just multiply by gamma and add the beta without even do the normalization part ?
But why do you need the lambda/beta? what's wrong with just shifting to average 0, variance 1? And also - how do you train them, you mean they are part of the network and so they are trained but I thought we want things not to be shaky but you are actually adding these that are adding to the 'shakiness'... what's the point?
It's not that shaky. It's another layer trying to learn the better data dimensions. With images identity layers work well. So the batchnorm learning should effectively reverse the mean/variance shift.
Nice video! You might also want to check out "How Does Batch Normalization Help Optimization?" (arxiv.org/abs/1805.11604), presented at NeurIPS18, which casts doubt on the idea that batchnorm improves performance through reduction in internal covariate shift.
I really love the way you explain. You are using very standard language which the old pre-deeplearning guys are familiar with.
This was great! Your explanation of batch normalization is by far the most intuitive one I've found.
Love your content! Some of the best explanations on the internet. Would be amazing if you could go through the Neural ODE or Taskonomy paper next.
thanks for putting effort in explaining in simpler manner
This is really cool explanation , would love to hear more from you.
That's the best video on youtube about batchnorm, thanks for going over the paper.
Absolutely amazing explanation!
Thank you so much for your intro.
I had a hard time grasping the concepts, this helped a lot. Thank you :)
Why don’t you have more subscribers, so helpful!!!
@
Yannic Kilcher. Thanks , it has been best explaination. You simplified maths as well. Would request you to explain all recent papers in same way. Thansk
Thank you for the review. I like to watch your videos instead of reading paper
Hi, thanks a lot for these amazing paper reviews. Can you make a video about layer normalization as well, and why is it more suited for recurrent nets than batch normalization?
Thanks for going over the paper!
Absolutely beautiful man. I love how you mentioned the model.train and mdoel.eval coding implication as well. OPut of curiousity: 1) What software are you using to show the paper (not adobe right?) 2) What kind of drawing pad are you using? I have a Wacom but since I cannot see what I'm doing on it, it is annoying to teach using it really.
What is the app are you using to edit and write on the the paper.
Batch Normalization doesn't reduce internal covariate shift, see: How Does Batch Normalization Help Optimization? arXiv:1805.11604
Awesome thanks!
I really like it, thank you! :)
Could you make a video about group normalization from FAIR?
Wouldn't it be cool for some professors to make the students derive the derivatives in the test =).
I had to do that my DL exams. Just feedforward though, nothing this involved =)
why we normalize the data and then we multiply by gamma and add it with beta
i understand it produce the best distribution of data but rather cant we just multiply by gamma and add the beta without even do the normalization part ?
But why do you need the lambda/beta? what's wrong with just shifting to average 0, variance 1? And also - how do you train them, you mean they are part of the network and so they are trained but I thought we want things not to be shaky but you are actually adding these that are adding to the 'shakiness'... what's the point?
Idea is to learn the better repression. Identity or normalised or something in between. Think of it as data preprocessing
Agreed it does question the original hypothesis/definition of normalisation at input layer as well
It's not that shaky. It's another layer trying to learn the better data dimensions. With images identity layers work well. So the batchnorm learning should effectively reverse the mean/variance shift.
someone clear the following doubt:
does gamma and beta value will be different for each input feature in a particular layer?
yes
@@YannicKilcher Thanks a lot!!
Nice video! You might also want to check out "How Does Batch Normalization Help Optimization?" (arxiv.org/abs/1805.11604), presented at NeurIPS18, which casts doubt on the idea that batchnorm improves performance through reduction in internal covariate shift.
YOU DA MAN
LONG LIVE YANNIC KILCHER
i really cant understand it
All of the cool kids use SELU
You got lost in the weeds on this one.
The backprop was bit unclear n perhaps the hardest bit
If not more then almost as complicated explanation as in the paper.
lol