I have had courses and put a lot of effort reading material online, but your explanation is by far the one that will remain indelible in my mind. Thank you
Very few videos online give some key concepts here, like what we're truly trying to minimize with the penalty expression. Most just give the equation but never explain the intuition behind L1 and L2. Kudos man
@Emm-- not sure how / if I can reply to your comment. An iso-surface is the set of points such that a function f(x) has constant value, e.g. all x such that f(x) = c. For a Gaussian distribution, for example, this is an ellipse, shaped according to the eigenvectors and eigenvalues of the covariance matrix. So, the iso-surfaces of theta1^2 + theta2^2 are circles, while the iso-surfaces of |theta1|+|theta2| look like diamonds. The iso-surface of the squared error on the data is also ellipsoidal, with a shape that depends on the data. Alpha scales the importance of the regularization term in the loss function, so higher alpha means more regularization. I didn't prove the sparsity assertion in the recording, but effectively, the "sharpness" of the diamond shape on the axes (specifically, the discontinuous derivative at e.g. theta1=0) means that it is possible for the optimum of the sum of (data + regularization) to have its optimum at a point where some of the parameters are exactly zero. If the function is differentiable at those points, this will effectively never happen -- the optimum will effectively always be at some (possibly small, but) non-zero value.
Whoa, I wasn't ready for the superellipse, that's a nice suprise. That helps me understand the limit case of p -> inf. Also exciting to think about rational values for P such as the 0.5 case. Major thanks for the picture at 7 minutes in. I learned about the concept of compressed sensing the other day, but didn't understand how optimization under regularized L1 norm leads to sparsity. This video made it click for me. :)
Sometimes I wish some profs would present a TH-cam playlist of good videos instead of giving their lectures themselves. This is so much better explained. There are so many good resources on the net, why are there still so many bad lectures given?
Thank you for the great explanation. Some questions: 1. At 2:09 the slide says that the regularization term alpha x theta x thetaTranspose is known as the L2 penalty. However, going by the formula for Lp norm, isn't your term missing the square root? Shouldn't the L2 regularization be: alpha x squareroot(theta x thetaTranspose)? 2. At 3:27 you say "the decrease in the mean squared error would be offset by the increase in the norm of theta". Judging from the tone of your voice, I would guess that statement should be self-apparent from this slide. However, am I correct in understanding that this concept is not explained here; rather, it is explained two slides later?
+RandomUser20130101 "L2 regularization" is used loosely in the literature to mean either Euclidean distance, or squared Euclidean distance. Certainly, the L2 norm has a square root, and in some cases (L2,1 regularization, for example; see en.wikipedia.org/wiki/Matrix_norm) the square root is important, but often it is not; it does not change, for example, the isosurface shape. So, there should exist values of alpha (regularization strength) that will make them equivalent; alternatively, the path of solutions as alpha is changed should be the same. offset by increase: regularization is being explained in these slides generally; using the (squared) norm of theta is introduced as a notion of "simplicity" in the previous slides, and I think it is not hard to see (certainly if you actually solve the values) that to get the regression curve in the upper right of the slide at 3:27 requires high values of the coefficients, causing a trade-off between the two terms. Two slides later is the geometric picture in parameter space, which certainly also illustrates this trade-off point.
Just replace the "regularizing" cost term that is the sum of squared values of the parameters (L2 penalty), with one that is the sum of the absolute values of the parameters.
Thank you. One of the best explanations of L1 vs L2 regularization!
I have had courses and put a lot of effort reading material online, but your explanation is by far the one that will remain indelible in my mind. Thank you
Oh my G. After 5 years of confusion, I finally understood Lp regularization!
Thank you so much Alex!
Best explanation of regularization I ever saw! Concise, detailed just enough, and covers all the practically important aspects. Thank you Sir!
Nice video , this is what I dig in youtube , an actual concise clear explanation worth any paid course
Very few videos online give some key concepts here, like what we're truly trying to minimize with the penalty expression. Most just give the equation but never explain the intuition behind L1 and L2. Kudos man
@Emm-- not sure how / if I can reply to your comment.
An iso-surface is the set of points such that a function f(x) has constant value, e.g. all x such that f(x) = c. For a Gaussian distribution, for example, this is an ellipse, shaped according to the eigenvectors and eigenvalues of the covariance matrix.
So, the iso-surfaces of theta1^2 + theta2^2 are circles, while the iso-surfaces of |theta1|+|theta2| look like diamonds. The iso-surface of the squared error on the data is also ellipsoidal, with a shape that depends on the data.
Alpha scales the importance of the regularization term in the loss function, so higher alpha means more regularization.
I didn't prove the sparsity assertion in the recording, but effectively, the "sharpness" of the diamond shape on the axes (specifically, the discontinuous derivative at e.g. theta1=0) means that it is possible for the optimum of the sum of (data + regularization) to have its optimum at a point where some of the parameters are exactly zero. If the function is differentiable at those points, this will effectively never happen -- the optimum will effectively always be at some (possibly small, but) non-zero value.
great video even after 10 years! thanks! :)
Whoa, I wasn't ready for the superellipse, that's a nice suprise. That helps me understand the limit case of p -> inf. Also exciting to think about rational values for P such as the 0.5 case.
Major thanks for the picture at 7 minutes in. I learned about the concept of compressed sensing the other day, but didn't understand how optimization under regularized L1 norm leads to sparsity. This video made it click for me. :)
Best explanation yet on what ridge regression does.
Wonderful video to give some intuition on L1 vs L2. Thank you!
This really is an incredible explanation of the idea behind regularization. Thanks a lot for your insight!
First heard of this via more theoretical material. Very cool to see a discussion from a more applied (?) perspective.
I just found out you videos now, thank you for a such wonderful explanation, it really helps me to understand this term
Thank you! That's a very clear and concise explanation.
Great presentation with very reasonable depth!
Thank you!! This really helped to understand the difference between L1 and L2.
Wow, that was such a great explanation. Thank you.
Very clear explained, helped a lot, thanks Alex!
Which excellent videos you posted! Congratulations!
Why don’t we draw concentric circles and diamonds as well? To represent optimization space of regularization term?
This is superb. Thanks for putting it together.
Thanks! This helps me understand regularization term a lot.
Many thanks for the brilliant video !!
Thank you Alexander - very well explained !
I learn a lot from this video.Thank you!
OMG! This stuff is just way too cool! I love maths.
Next time, I'd love it if you included the effect lambda has on regularization, including visuals!
As my old friend Borat would say: Very Nice!
awesome explanation. thank you
Thanks for the videos, I really enjoy learning from them!
Awesome description, thanks 🙏
That was a great video Alex!
How to identify the extremities of ellipse with the equation?
Apologies. But what is rationale of concentric ellipses ??? Understood the l1/l2 area though
Thank you, that was an elegant explanation.
Awesome Explanation sir. Thanks much!
Sometimes I wish some profs would present a TH-cam playlist of good videos instead of giving their lectures themselves. This is so much better explained. There are so many good resources on the net, why are there still so many bad lectures given?
fuck thats truth but depressing
Thank you for the great explanation. Some questions:
1. At 2:09 the slide says that the regularization term alpha x theta x thetaTranspose is known as the L2 penalty. However, going by the formula for Lp norm, isn't your term missing the square root? Shouldn't the L2 regularization be: alpha x squareroot(theta x thetaTranspose)?
2. At 3:27 you say "the decrease in the mean squared error would be offset by the increase in the norm of theta". Judging from the tone of your voice, I would guess that statement should be self-apparent from this slide. However, am I correct in understanding that this concept is not explained here; rather, it is explained two slides later?
+RandomUser20130101 "L2 regularization" is used loosely in the literature to mean either Euclidean distance, or squared Euclidean distance. Certainly, the L2 norm has a square root, and in some cases (L2,1 regularization, for example; see en.wikipedia.org/wiki/Matrix_norm) the square root is important, but often it is not; it does not change, for example, the isosurface shape. So, there should exist values of alpha (regularization strength) that will make them equivalent; alternatively, the path of solutions as alpha is changed should be the same.
offset by increase: regularization is being explained in these slides generally; using the (squared) norm of theta is introduced as a notion of "simplicity" in the previous slides, and I think it is not hard to see (certainly if you actually solve the values) that to get the regression curve in the upper right of the slide at 3:27 requires high values of the coefficients, causing a trade-off between the two terms. Two slides later is the geometric picture in parameter space, which certainly also illustrates this trade-off point.
+Alexander Ihler Thank you for the info.
Great video! Did help me a lot!
Beautiful!
The most perfect video
Finally I know what does those isosurface diagrams mean found in PRML
Nice, clear explanation. Thnx.
Thank you for excellent video
Thanks you make great videos :)
English major: Brevity is the soul of wit.
Statistics/Math Major: Verbal SCAD type regularization is the soul of wit.
How is L1 regularization performed?
Just replace the "regularizing" cost term that is the sum of squared values of the parameters (L2 penalty), with one that is the sum of the absolute values of the parameters.
Awesome!
Thank you so much!!!
I love your accent
Thank You!
6:47
sorry i can give only one like : )
Lasso gives sparse parameter vectors. QUOTE. OF THE DAY, GO AHEAD AND FINISH THE REPORT :P
what is the best in real world ? why your boos keep paying you