When the previous layer is covered, all things are clear. Brilliant explanation. Batch normalization works similarly the way input standardization works.
Wow this is the best explanation I've seen so far! I really like Andrew Ng, he has an amazing talent for explaining even the most complicated things in a simple way and when he has to use mathematics to explain some concepts he does it in such a brilliant way that they become even simpler to understand and not more complicated as with some tutors
The "covariant shift" explanation has been falsified as an explanation for why BatchNorm works. If you are interested check out the paper "How does batch normalization help optimization?"
when X changes (despite f(x) = y remaining the same, you can't expect the same model to perform well (eg: X1 is pics of black cats only. y=1. else for non-cats y=0. if X2 is pics of all colored cats, model won't do too well). This is co-variant shift. this co-variant shift is tackled during training through input standardization and batch normalization batch normalization ensures that the mean and variance of the distribution of the values of hyperparameters in the previous layer remains the same. Doesn't allow these hyperparameters' values to shift much. it doesn't allow values to change too much, thus reducing the coupling between hyperparameters of different layers and increases independence and hence, increase speed of learning
The original paper where the batch normalization technique was introduced (by Sergey Ioffe, Christian Szegedy) says that removing dropout speeds up training, without increasing overfitting and there also recommendations not to use drop out together with batch normalization since it adds noise to stats calculations (mean and variance)...so should we really use DO with BN?
I think problems may arise if you don't have all of your training data ready and are looking to perform some transfer learning (training on new data) in the future since this is very domain dependent, as hinted at by the minibatch size-regularization effect but also and more importantly by the batch norm hyperparameters. I would always try to implement this. It seems ironic that it generalizes well but is constrained by the prescribed covariance from the training data.
6:00, I have a question, do the values of beta[2] and gamma[2] not change as well during training? So the distribution of hidden unit values z[2] also keeps changing. Then the covariate shift problem is still there.
7:55 why don't we use the mean and variance of the entire trg set instead of just those of a mini-batch? Wouldn't this reduce noise further (similar to using larger mini-batch size)? Unless we want those noise to seek out regularizing effect?
Larger batch sizes are detrimental, like Yann Lecun once said "training with large minibatches is bad for your health. More importantly, it's bad for your test error. Friends dont let friends use minibatches larger than 32." as far as i understand it, with bigger batches you get stuck in narrower local optima while the noisier sets helps you generalize better and get pushed out of those local optima. Theres still lots of argument about this tho in some cases with very noisy data like predicting stock prices.
To be precise, it's neither both, though closer to the second. When training your network, you normalize all Z[l] - scalars corresponding to each neuron of l-th layer. Z[l] = W[l] * A[l-1]. Where W[l] is current layer weights matrix, and A[l-1] is previous layer activations. So, you normalize numbers which are not yet activations of current layer, but calculated from weights of current layer and previous layer activations.
You said that Batch Norm limits the change in values of the 3rd layer ( or more generally, any deeper layer) due to parameters of earlier layers, however, when you are performing Gradient Descent, the values of the new parameters ( parameters due to Batch Norm gamma and beta ), are also being learnt, and are changing with the help of learning rate and henceforth, the mean and variance of earlier layers are changing and are not limited to 0 and 1 respectively ( or more generally whatever you set it to ), so, I am not able to intuite this fixing of mean and variance of the parameters of earlier layer to prevent covariate shift. Can anyone help me out with this?
My understanding is this: imagine the 4 neuron in hidden layer 3 represent the feature of [shape of the head of cat, shape of the body of cat, shape of the tail of cat, color of the cat]. The first 3 dimension will have high value as long as there is a cat in the image, but the color varies a lot. So when u normalize this vector, the changes in color will make less contribution to the prediction. Therefore, relatively the feature that really matters (like the first three dimension) will have larger influence.
batch normalization adds two trainable parameters to each layer, so the normalized output is multiplied by a “standard deviation” parameter (gamma) and the beta is added ("mean” parameter). In other words, batch normalization lets SGD do denormalization by changing only these two weights for each activation, instead of losing the stability of the network by changing all the weights.
Page 316 of www.deeplearningbook.org/contents/optimization.html has an answer to this i guess: > At first glance, this may seem useless-why did we set the mean to 0, and then introduce a parameter that allows it to be set back to any arbitrary value β? The answer is that the new parametrization can represent the same family of functions of the input as the old parametrization, but the new parametrization has different learning dynamics. In the old parametrization, the mean of H was determined by a complicated interaction between the parameters in the layers below H. In the new parametrization, the mean of γH'+β is determined solely by β. The new parametrization is much easier to learn with gradient descent.
That is just another way to show the difference in distribution of training and testing data. So in images the distribution difference is shown by one set having black cats another having non-black cats. While in the graph, the distribution difference is shown by the differences in position of the positive and negative data points. In short, these are two different examples hightlighting a single issue i.e covariance shift.
According to th-cam.com/video/DtEq44FTPM4/w-d-xo.html , the covariate shift explanation (which was proposed by the original batch norm paper) has since been debunked by more recent papers. I don't know much about this though, if someone else would like to elaborate.
When the previous layer is covered, all things are clear. Brilliant explanation. Batch normalization works similarly the way input standardization works.
Wow this is the best explanation I've seen so far! I really like Andrew Ng, he has an amazing talent for explaining even the most complicated things in a simple way and when he has to use mathematics to explain some concepts he does it in such a brilliant way that they become even simpler to understand and not more complicated as with some tutors
like this guy - has calm voice / patience
Great work, you have the natural talent to make difficult topics easily learnable
This guy makes it look so easy... one has to love him
Beautifully explained, classic Andrew Ng
This video is just pure gold!
Best explanation of batch norm
The "covariant shift" explanation has been falsified as an explanation for why BatchNorm works. If you are interested check out the paper "How does batch normalization help optimization?"
God bless you
Amazing explanation
when X changes (despite f(x) = y remaining the same, you can't expect the same model to perform well (eg: X1 is pics of black cats only. y=1. else for non-cats y=0. if X2 is pics of all colored cats, model won't do too well). This is co-variant shift.
this co-variant shift is tackled during training through input standardization and batch normalization
batch normalization ensures that the mean and variance of the distribution of the values of hyperparameters in the previous layer remains the same. Doesn't allow these hyperparameters' values to shift much.
it doesn't allow values to change too much, thus reducing the coupling between hyperparameters of different layers and increases independence and hence, increase speed of learning
Keras people needs to watch this video!
谢谢
Thanks for sharing the great video, explained in simple and good manner.
Beautifully Explained
The original paper where the batch normalization technique was introduced (by Sergey Ioffe, Christian Szegedy) says that removing dropout speeds up training, without increasing overfitting and there also recommendations not to use drop out together with batch normalization since it adds noise to stats calculations (mean and variance)...so should we really use DO with BN?
"don't use it for regularization" - just use it all the time for general good practice, or are there times when I shouldn't use it?
I think problems may arise if you don't have all of your training data ready and are looking to perform some transfer learning (training on new data) in the future since this is very domain dependent, as hinted at by the minibatch size-regularization effect but also and more importantly by the batch norm hyperparameters. I would always try to implement this. It seems ironic that it generalizes well but is constrained by the prescribed covariance from the training data.
In some regression problems, it hurts the absolute value which might be critical.
thank you
Since gamma and beta are parameters will be updated, how can mean and variance remain unchanged?
good to understand but still more nmerical calculations, will show effect
Wow, great explanation! Thanks!
Thank you!
Is it always have batch normalization in neural network?
Great explanation
Ingenious!
Great explanation. Thank you.
很棒的讲解,比李沐讲的好太多了
6:00, I have a question, do the values of beta[2] and gamma[2] not change as well during training? So the distribution of hidden unit values z[2] also keeps changing. Then the covariate shift problem is still there.
Or maybe I should convince myself that beta[2] and gamma[2] don't change much?
7:55 why don't we use the mean and variance of the entire trg set instead of just those of a mini-batch? Wouldn't this reduce noise further (similar to using larger mini-batch size)? Unless we want those noise to seek out regularizing effect?
Larger batch sizes are detrimental, like Yann Lecun once said
"training with large minibatches is bad for your health.
More importantly, it's bad for your test error.
Friends dont let friends use minibatches larger than 32."
as far as i understand it, with bigger batches you get stuck in narrower local optima while the noisier sets helps you generalize better and get pushed out of those local optima.
Theres still lots of argument about this tho in some cases with very noisy data like predicting stock prices.
@@lupsik1 great response!
don't the activation function such as sigmoid in each node already normalize the outputs from neurons for the most part ?
but the outputs are not zero centered
generally, sigmoids are not used because of saturation and not been zero centre outputs. instead, ReLU are used
I am not able to grab the batch norm working. Pls help me...
I'm confused. Is that normalizing all neurons within each layer, or normalizing all activations computed from a mini-batch of one neuron ?
第二种
To be precise, it's neither both, though closer to the second. When training your network, you normalize all Z[l] - scalars corresponding to each neuron of l-th layer. Z[l] = W[l] * A[l-1]. Where W[l] is current layer weights matrix, and A[l-1] is previous layer activations.
So, you normalize numbers which are not yet activations of current layer, but calculated from weights of current layer and previous layer activations.
great xplnatoion
I am watching it again
Andrew Yang is really good at math
You said that Batch Norm limits the change in values of the 3rd layer ( or more generally, any deeper layer) due to parameters of earlier layers, however, when you are performing Gradient Descent, the values of the new parameters ( parameters due to Batch Norm gamma and beta ), are also being learnt, and are changing with the help of learning rate and henceforth, the mean and variance of earlier layers are changing and are not limited to 0 and 1 respectively ( or more generally whatever you set it to ), so, I am not able to intuite this fixing of mean and variance of the parameters of earlier layer to prevent covariate shift. Can anyone help me out with this?
My understanding is this: imagine the 4 neuron in hidden layer 3 represent the feature of [shape of the head of cat, shape of the body of cat, shape of the tail of cat, color of the cat]. The first 3 dimension will have high value as long as there is a cat in the image, but the color varies a lot. So when u normalize this vector, the changes in color will make less contribution to the prediction. Therefore, relatively the feature that really matters (like the first three dimension) will have larger influence.
Having the same doubt
batch normalization adds two trainable parameters to each layer, so the normalized output is multiplied by a “standard deviation” parameter (gamma) and the beta is added ("mean” parameter). In other words, batch normalization lets SGD do denormalization by changing only these two weights for each activation, instead of losing the stability of the network by changing all the weights.
Page 316 of www.deeplearningbook.org/contents/optimization.html has an answer to this i guess:
>
At first glance, this may seem useless-why did we set the mean to 0, and then introduce a parameter that allows it to be set back to any arbitrary value β? The answer is that the new parametrization can represent the same family of functions of the input as the old parametrization, but the new parametrization has different learning dynamics. In the old parametrization, the mean of H was determined by a complicated interaction between the parameters in the layers below H. In the new parametrization, the mean of γH'+β is determined solely by β. The new parametrization is much easier to learn with gradient descent.
@@bryan3792 thanks
What is 'z' in this video?
what does coloured image got to do with the location of data point in graph?
pixel values
That is just another way to show the difference in distribution of training and testing data. So in images the distribution difference is shown by one set having black cats another having non-black cats. While in the graph, the distribution difference is shown by the differences in position of the positive and negative data points. In short, these are two different examples hightlighting a single issue i.e covariance shift.
If the mini-batch size is only 1, is BN still working?
minibatch with the size of 1 is not a mini batch. Its using each point in the data seperately. you cannot batch norm with size=1
According to th-cam.com/video/DtEq44FTPM4/w-d-xo.html , the covariate shift explanation (which was proposed by the original batch norm paper) has since been debunked by more recent papers. I don't know much about this though, if someone else would like to elaborate.
Is he the GOD?