awesome explanation. The Tutor is brilliant in terms of covering the deep hidden answers by explaining the why things. So far the best tutor of deep learning.
3:10 Problems with sigmoid Saturation of sigmoids causes gradients to vanish. 8:00 Initialize weights to a large value, causes update to be very large and saturates neurons faster. 8:40 Sigmoids are not zero centered, value is between 0 to 1. Issues with non-zero centered activations. Restrict directions of weight updates and therefore make convergence longer to achieve 14:00 Sigmoids are computationally expensive to compute 15:00 Tanh activation function Improves over sigmoid, is zero centered. Does not mitigate saturation of neurons. Computationally even more expensive 16:00 ReLU as a piecewise non-linear activation function Does not saturate in positive region Computation is easy Causes dead neurons, that is once a neuron receives negative update direction, all further updates are stopped. All connected neurons to this neuron also do not receive updates from this neuron. Large no. of neurons die off if learning rate is set high 23:00 Leaky ReLU Allows a small gradient to flow, for negative updates To avoid dead neuron issue
Can we think of dead neuron(ReLu activation) as a form of dropping out neuron in Dropout regularization? Because weights of both incoming and outcoming edges of dead neuron will not get updated in that iteration(and they chose to not get updated forever). This leads to my next question, Is ReLu acting as some form of regularization?
Dropouts are stochastic in nature, dead neurons kill off a lot of neurons they are connected with too. Also in dropout layers, some neurons are muted just for that iteration so that the neurons learn independently. These neurons are muted temporarily (only for a particular training iteration) So I do not think ReLU acts like a regularizer, your thoughts are welcome 😀
b' = b- eta(grad(b)) if b was originally small, choosing a high value for eta would blow up the second term. Hence, b' will be even more negative. (The cause for a neuron to die).
awesome explanation. The Tutor is brilliant in terms of covering the deep hidden answers by explaining the why things. So far the best tutor of deep learning.
Bestest explanation on activation functions. Thank you sir.
Beautifully explained
3:10 Problems with sigmoid
Saturation of sigmoids causes gradients to vanish.
8:00 Initialize weights to a large value, causes update to be very large and saturates neurons faster.
8:40 Sigmoids are not zero centered, value is between 0 to 1.
Issues with non-zero centered activations.
Restrict directions of weight updates and therefore make convergence longer to achieve
14:00 Sigmoids are computationally expensive to compute
15:00 Tanh activation function
Improves over sigmoid, is zero centered.
Does not mitigate saturation of neurons.
Computationally even more expensive
16:00 ReLU as a piecewise non-linear activation function
Does not saturate in positive region
Computation is easy
Causes dead neurons, that is once a neuron receives negative update direction, all further updates are stopped. All connected neurons to this neuron also do not receive updates from this neuron.
Large no. of neurons die off if learning rate is set high
23:00 Leaky ReLU
Allows a small gradient to flow, for negative updates
To avoid dead neuron issue
really nice
Hi, brilliant lecture and course.
How can I have access to slides? It's really cumbersome to write the handout by ourselves.
tnx
Sir, please also tech us Machine Learning
Can we think of dead neuron(ReLu activation) as a form of dropping out neuron in Dropout regularization?
Because weights of both incoming and outcoming edges of dead neuron will not get updated in that iteration(and they chose to not get updated forever).
This leads to my next question, Is ReLu acting as some form of regularization?
Dropouts are stochastic in nature, dead neurons kill off a lot of neurons they are connected with too.
Also in dropout layers, some neurons are muted just for that iteration so that the neurons learn independently. These neurons are muted temporarily (only for a particular training iteration)
So I do not think ReLU acts like a regularizer, your thoughts are welcome 😀
how the relu units get died if we set learning rate too high ?
20:34
b' = b- eta(grad(b))
if b was originally small, choosing a high value for eta would blow up the second term.
Hence, b' will be even more negative. (The cause for a neuron to die).
@@mr_law886 what if grad(b) < 0?
Choosing a high value of eta will make (-eta(grad(b)) larger, So b' most possibly positive value.