RMSProp (C2W2L07)
ฝัง
- เผยแพร่เมื่อ 31 ต.ค. 2024
- Take the Deep Learning Specialization: bit.ly/2PFq843
Check out all our courses: www.deeplearni...
Subscribe to The Batch, our weekly newsletter: www.deeplearni...
Follow us:
Twitter: / deeplearningai_
Facebook: / deeplearninghq
Linkedin: / deeplearningai
I comment seldom on videos, but this is finally one video that actually explains algorithms and is not just blabbering about general ideas - so helpful. I started writing my own algorithms for gradient descent and I was desperately searching for ideas on how to speed it up. Thanks a lot.
Hi Dr Andrew. Big fan of all your work. BTW I think you should add the keyword "neural network optimization" to your video title so the youtube indexer can index this keyword and show this video up on search result more. I hope more people can benefit from your videos. Thanks !
I only wrote "RMSprop" and this what I got :P
It is worth mentioning that we are no longer stepping in the direction of the steepest descent. The sign information in the gradient appears to be sufficient to not only find the optimum, but also (apparently) to improve convergence time (compared to steepest descent direction)
I watch your videos over and over. Great man
I think it's safe to say that it doesn't matter whether the denominator's epsilon is inside or outside the square root?
Andrew, I think there's a problem with your audio/mic because I hear high frequency squeaks through entire video
Thanks for the explanation. I had one question:
What if the updates are more horizontal than vertically inclined? In that case, wouldnt we be dampening out the progress in the direction we actually want to progress faster?
Do we have reason to believe that the projection of gradients on the orthogonal axis (orthogonal to pne pointing exactly in direction of minima) are likely to be greater compared to the projection of gradients on the axis aligned towards the location of the minima?
it simply divides bigger value weights with bigger values and smaller with smaller, to equalize alpha this algo solves problem where their are bigger and smaller values of weights, you cannot use same value of alpha for both ,you have to use different values of alpha for both of them.
The horizontal/vertical inclination isn't something preset, it actually depends on the type of optimization function you choose.
If you were to use the stochastic gradient descent, there will be a lot of oscillations and cannot learn fast.
But by using the momentum/RMSProp, these oscillations are reduced, which equates to a faster learning rate.
usually there are more than two dimensions. So what it actually does is it dampens out the updates in the unwanted directions and keeps the convergence path to one of the paths to global minima of the convex function.
That way dW would be large hence it would automatically dampen out.
I have the same question, after reading all the responses to your question, I dont think anyone has actually addressed your question
Thank you !
(Doubt) Intuition behind RMSprop and Momentum:
RMSprop:
Sdw = β.Sdw + (1-β).dw²
Sdb = β.Sdb + (1-β).db²
W = W - α.dw/(√Sdw) , b = b - α.db/(√Sdb)
At 3:00, we see that dw is relatively small and db is large.
So by dividing, we are able to update W at a higher rate compared to b
and that way, we move quicker horizontally.
Momentum:
Vdw = β.Vdw + (1-β).dw
Vdb = β.Vdb + (1-β).db
W = W - α.Vdw , b = b - α.Vdb
If dw and db follow the same here, we update b at a higher rate that W.
This way, we would move more vertically and slower horizontally?
Any help would be highly appreciated, thanks.
You are right for the rmsprop algorithm but wrong for the momentum algorithm. Specifically for momentum, if db strongly oscillated in the iteration range from 1 to t, then Vdb at time t + 1 will be close to 0, (because db varies from negative to positive and vice versa, so moving average of db (Vdb) is close to 0), thus, with the update formula "b = b - α.Vdb", db at the time t and t+1 will be roughly equal, or b is updated with a slow rate for the future iteration.
I have understood it finally . let me know if haven't figured it out yet.
Amazing. Beautifully explained.
I am totally confused about using b and w as directions of gradient descent steps. The whole explanation looks like depending on that the parameter b controls the vertical component of each gradient step whereas parameter w controls the horizontal component. But in real life, is it really the case? I belive parameter w and b both inlcudes components in each direction. What am I missing here? The whole explanation doesn't make sense to me at all.
Kind of in the same situation here. I cannot translate this idea to a k dimensional case.
No, you are not correct in this : "I believe parameter w and b both includes components in each direction", every parameter adds a new dimension to the space , if you have 2 parameters and 1 bias that's a 3 dimensional space for example,so with 1 w and 1 b, you have a 2 dimensional space (or a plane as seen in the video) .
@@luisleal4169 why is dw small and db large? Why not the opposite?
@@osiris1102 see the axes (which are in horizontal plane) as the slope (from the diagram) is more steeper in the "b" direction , therefore dL/db ie db is large and dL/dw ie dw is small.
I can see how large values of Sdw slow down learning, but doesn't it also slow down learning just as much when we're not oscillating?
s values seem to be estimating a moving variance?
And then the squaring gives us moving standard deviation.
So it's like negative feedback:
the noisier the gradient,
the bigger the std will be,
which will shrink the gradient more, making it less noisy.
Great Lecture Andrew!!!
Hello! Great video. Why isn't bias correction used in this algorithm, meanwhile it is used in ADAM, where part of RMSProp is present? Furthermore, what is a good default value for beta in RMSProp?
How do we decide which dimension to be the b(referred to in the video) in multidimensional space?
yeaah how ? anyone?
what is the key difference between Gradient Descent with momentum and RMS prop? visualizations are same though.\
doubt: If square root of Sdb is less than 1, then instead of damping the oscillation in b direction, it will have opposite effect, isn't it?
great explanation, thanks!
Is dW the derivative of the loss function respect to W ?
yes
Why it's called root mean squared prop?
5:35
look at video closely, what in the formula are using?
धन्यवाद प्रभुजी :) मै आपका आभारी हुं ।
great explanaion
But need to watch again right?
@@osiris1102 I am watching it again
How db is high dimensional. from my understanding it is the bias term?
Consider this is a simple example with just one bias, but even with a small neural network ,for example one with a hidden layer, with 5 neurons you already have a 5 dimensional space. (and that just counting the biases, but remember that both weights and biases are trainable parameters and your space involves all of them)
You know the video is going to be amazing when you see this man with his computer screen.
1:27 if its the first update then what past value would Sdw or Sdb have?
0 may be
They start at zero, but in the Adam Optimization video, he explains bias correction which fixes some issues that come with assuming the RMSDelta starts at zero
1:00
YANG GAN
Is b here the bias term in z = wx + b?
Yes