Read it on a book. Didn't understand jack shit back then. Your videos are awesome. Rich, small, consise. Please make a video on Linear Discriminant Analysis and how its related to bay's theorem. This video will be saved in my data science playlist.
Notes for my future revision. *Priror β* 10:30 Value of Prior β is normally distributed. The by product of using Normal Distribution is Regularisation. Because the prior values of β won't be too large (or too small) from the mean. Regularisation keep values of β small.
Regardless of how they were really initially devised, seeing the regularization formulas pop out of the bayesian linear regression model was eye-opening - thanks for sharing this insight
Man I'm going to copy-paste your video whenever I want to explain regularization to anyone! I knew the concept but I would never explain it the way you did. You nailed it!
For me, the coolest thing about statistics is that every time I do a refresh on these topics, I get some new ideas or understandings. It's lucky that I came across this video after a year, which could also explain why we need to "normalized" the X (0 centered, with stdev = 1) before we feed them into the MLP model, if we use regularization terms in the layers.
I've seen everything in this video many, many times, but no one had done as good a job as this in pulling these ideas together in such an intuitive and understandable way. Well done and thank you!
This is my favorite video out of a large set of fantastic videos that you have made. It just brings everything together in such a brilliant way. I keep getting back to it over and over again. Thank you so much!
Brilliant and clear explanation, I was struggling to grasp the main idea for a Machine Learning exam but your video was a blessing. Thank you so much for the amazing work!
Best of all videos on Bayesian regression; other videos are so boring and long but this one has quality as well as ease of understanding..Thank you so much!
Really good explanation. I really like how you gave context and connected all topics together and it make perfect sense. While maintaining the perfect balance b/w math and intution. Great worl. Thank You !
Wow, killer video. This was a topic where it was especially nice to see everything written on the board in one go. Was cool to see how a larger lambda implies a more pronounced prior belief that the parameters lie close to 0.
Great video, I learned exactly what I was looking for. I have years of experience with machine learning but Bayesian approaches not so much. In a world full of poorly explained concepts, this video stands out as an exemplar, very well done. A few thoughts I had as I watched this. I always viewed regularization as a common sense approach, almost a heuristic. When you consider that you’re trying to minimize the loss function, while putting some constraint on the betas, it seems like a natural solution to simply add the magnitude or some function of the magnitude of the betas to that loss function because now by doing that you’re making the value of the lost function bigger, so in order for the algorithm to increase the value of beta it would really have to be worthwhile on the error term. Lasso and Ridge use absolute value and square, but the key is that they must be a measure of magnitude, ie they must be positive, so we could use a 4th degree or 6th degree or any even degree. I’m curious if each of these would have a Bayesian counterpart? Also, sigma/tau is given in the Bayesian approach, which lambda is tuned or solved for in the regularization approach, so while the functional form is the same there’s no guarantee that lambda will equal (sigma/tau)^2. I do wonder if E(lambda)=(sigma/tau)^2? Ie, if you solved for lambda over many samples from a population, would the average be (sigma/tau)^2, which would means lambda is an estimator of (sigma/tau)^2?
Excellent tutorial! I have applied RIDGE as the loss function in different models. However, it is the first time I understand the mathematical meaning of lambda. It is really cool!
Awesome video. I didn't realize that the L1, L2 regularization had a connection with the Bayesian framework. Thanks for shedding some much needed light on the topic. Could you please also explain the role of MCMC Sampling within Bayesian Regression models? I recently implemented a Bayesian Linear Regression model using PyMC3, and there's definitely a lot of theory involved with regards to MCMC NUTS (No U-Turn) Samplers and the associated hyperparameters (Chains, Draws, Tune, etc.). I think it would be a valuable video for many of us. And of course, keep up the amazing work! :D
Can you please please do a series on categorical distribution, multinomial distribution, Dirichlet distribution, Dirichlet process and finally non parametric Bayesian tensor factorisation including clustering of steaming data. I will personally pay you for this. I mean it!! There are a few videos on these things on youtube, some are good, some are way high-level. But, no one can explain the way you do. This simple video has such profound importance!!
Most insightful! L1 as Laplacian toward the end was a bit skimpy, though. Maybe I should watch your LASSO clip. Could you do a video on elastic net? Insight on balancing the L1 and L2 norms would be appreciated.
At the end I understand it too finally. A hint for peaple who also struggle on BR like me: do a Bayesian linear regression in Python from any tutorial that you find online, you are going to understand, trust me. I think that one of the initial problems for a person that face a Bayesian approach it’s the fact that you are actually obtaining a posterior *of weights*!. Now looks kinda obvious but at the beginning I was really stuck, I could not understand what was actually the posterior doing.
There is an error at the beginning of the video, in frequentist approaches X is treated as non random covariate data and y is the random part so the high variance of OLS should be expressed as small changes to y => big changes to OLS estimator. The changes to covariate matrix becoming big changes to OLS estimator is more like a non robustness of OLS wrt outlier contamination. Also the lambda should be 1/2τ^2 not σ^2/τ^2 since: ln(P(β))=-p * ln(τ * √2*π) - ||β||₂/2τ^2 Overall this was very helpful cheers!
Thank You , I saw this before but i didnt understand. Please , where can i find the complete derivation? And maybe You can do a complete series in this topic
You are the go-to for me when I need to understand topics better. I understand Bayesian parameter estimation thanks to this video! Any chance you can do something on the difference between Maximum Likelihood and Bayesian parameter estimation? I think anyone that watches both of your videos will be able to pick up the details but seeing it explicitly might go a long way for some.
Great video. The relation between the prior and LASSO penalty was a "wow" moment for me. It would be helpful to see actual computation example in python or R. A common problem I see in Bayesian lectures is - too much focus on math rather to show how actually/ how much the resulting parameters differs. Specially, when to consider bayesian approach over ols.
I wonder if this is related to BIC, Bayesian Information Criterion. It's about choosing the simpler model with fewer variables, similar to regularization.
05:29 - "Given the known parameter vector y" - did you misspeak here? Or are you saying the observed data is also called a parameter vector? I would have thought that would be reserved terminology for the betas. It just seems weird to talk about observed data being called parameters, when you have more explicit model parameters (betas) in the same equation.
It's a great video. Few people manage to boil things down to that point while retaining some of the key steps involved and you nailed it. Now, a few points -- mostly for the benefit of your viewers. First, reading Tibshirani's original paper, it seems like the Bayesian interpretation is more of a happy coincidence than the primary motivation and it would make sense because a Bayesian statistician most likely wouldn't bother looking for a maximum aposteriori estimator. They almost always use the mean, median or mode of their marginal posteriors -- or just show you the whole distribution. Second, there is a frequentist justification for LASSO: see, for example, Zou (2006) "The Adaptative LASSO and Its Oracle Properties." Zou shows that there are some conditions under which LASSO will *correctly* do what it was intended to do -- that is, jointly solve your model selection and estimation problems. However, in general, there's a tension between getting consistent model selection and consistent estimation. Fortunately, Zou also gives a very simple solution that involves only a very mild modification of LASSO: (1) allow each parameters to be penalized slightly differently and (2) cleverly choose penalty weights. If you do that, LASSO will choose the correct model and yields asymptotic normality (and root-n convergence). If you care about what the coefficients mean and not just the forecast, that might be important. That said, adaptative LASSO will occasionally perform *better* at forecasting. Third, when you mention linear regression, you should include an asterisk somewhere: you don't need Gaussian iid errors for OLS to have some desirable properties. I'm sure you're well aware of all of that, but I'll include a few examples for the benefit of your viewers: 1. Conditionally mean zero errors (E(e|X)=0) gives you absence of bias (E(bhat|X) = b, the true value); 2. (1) and homoskedastic errors (E(e(i)|X) = sigma^2 for all i = 1,...,N) shows OLS is the lowest variance unbiased linear estimator (Gauss-Markov theorem); 3. (1), (2) and e ~ iid P, but P is another elliptic symmetric distribution (say, a Student), then your scale-invariant statistics like t and F retain their exact finite sample distribution (that's in King's 1979 thesis) 4. And there's a whole host of situations where none of the above applies, but you can get invoke asymptotic arguments to justify some properties as approximately holding in finite sample. Since your viewers seem to be interested mostly in forecasting, say all X's are covariance stationnary (*unconditional* means, variances and covariances are all finite and don't depend on time) and the error term follows a weak white noise process (not serially correlated, not contemporaneously correlated with the X's, but homoskedastic). Then both X'e/N and X'X/N satisfies a law of large number, so OLS will be convergent by a continuous mapping argument. Similarly, X'e/sqrt(N) satisfies a central limit theorem, so it will also be asymptotically normal. In other words, you get a property kind of like (1) and another similar to (3), except it applies in a much broader setting. Fourth, if people are curious, I have two published papers with my coauthors that look into deep comparisons of many forecasting tools in the context of macroeconomic forecasting. Variants of LASSO are included and, for macro data, that concern of dimension reduction seems to be best handled using some kind of factor model (think, PCA or something like that). They can look me up on Scholar to find them.
I have a question: why a beta j will follow a distribution according to a prior,isnt beta a parameter and hence a constant hence there wont be any distribution of betas, rather there will be a prior distribution of beta hats, ie our estimates of beta. Please reply❤
Amazing! But where did Ridge and Lasso start from? Were they invented with Bayesian statistics as a starting point, or is that a duality that came later?
As soon as you explained the results from Bayesian my jaw was wide open for like 3 minutes this is so interesting
Love you, bro, I got my joining letter from NASA as a Scientific Officer-1, believe me, your videos always helped me in my research works.
This video is a true gem, informative and simple at once. Thank you so much!
Glad it was helpful!
Read it on a book. Didn't understand jack shit back then. Your videos are awesome. Rich, small, consise. Please make a video on Linear Discriminant Analysis and how its related to bay's theorem. This video will be saved in my data science playlist.
Amazing, you kept it simple and showed how regularization terms in linear regression originated from Bayesian approach!! Thank U!
Notes for my future revision.
*Priror β*
10:30
Value of Prior β is normally distributed. The by product of using Normal Distribution is Regularisation. Because the prior values of β won't be too large (or too small) from the mean.
Regularisation keep values of β small.
Regardless of how they were really initially devised, seeing the regularization formulas pop out of the bayesian linear regression model was eye-opening - thanks for sharing this insight
Yes. This really blew my mind. Boom.
Unbelievable, you explained linear reg, explained in simple terms Bayesian stat, and showed the connection under 20min .... Perfect
Man I'm going to copy-paste your video whenever I want to explain regularization to anyone! I knew the concept but I would never explain it the way you did. You nailed it!
For me, the coolest thing about statistics is that every time I do a refresh on these topics, I get some new ideas or understandings. It's lucky that I came across this video after a year, which could also explain why we need to "normalized" the X (0 centered, with stdev = 1) before we feed them into the MLP model, if we use regularization terms in the layers.
I've seen everything in this video many, many times, but no one had done as good a job as this in pulling these ideas together in such an intuitive and understandable way. Well done and thank you!
This is my favorite video out of a large set of fantastic videos that you have made. It just brings everything together in such a brilliant way. I keep getting back to it over and over again. Thank you so much!
Man .. I absolutely love the way you explain the math and the breakdown of these concepts! Really really fantastic job ❤
Thanks a ton!
Brilliant and clear explanation, I was struggling to grasp the main idea for a Machine Learning exam but your video was a blessing. Thank you so much for the amazing work!
This is incredible. Clear, well paced and explained. Thank you!
Best of all videos on Bayesian regression; other videos are so boring and long but this one has quality as well as ease of understanding..Thank you so much!
Really good explanation. I really like how you gave context and connected all topics together and it make perfect sense. While maintaining the perfect balance b/w math and intution. Great worl. Thank You !
Amazing video! Really clearly explained! Keep em coming!
Glad you liked it!
Awesome explanation! Especially the details on the prior were so helpful!
Glad it was helpful!
This was an excellent introduction to Bayesian Regression. Thanks a lot!
I used to be afraid of Bayesian Linear Regression until I saw this vid. Thank you sooo much
Awesome! Youre welcome
Man! What a great explanation of Bayesian Stats. It's all starting to make sense now. Thank you!!!
Wow, killer video. This was a topic where it was especially nice to see everything written on the board in one go. Was cool to see how a larger lambda implies a more pronounced prior belief that the parameters lie close to 0.
I also think it’s pretty cool 😎
This is the best explanation of L1 and L2 I've ever heard
Great video, I learned exactly what I was looking for. I have years of experience with machine learning but Bayesian approaches not so much. In a world full of poorly explained concepts, this video stands out as an exemplar, very well done.
A few thoughts I had as I watched this. I always viewed regularization as a common sense approach, almost a heuristic. When you consider that you’re trying to minimize the loss function, while putting some constraint on the betas, it seems like a natural solution to simply add the magnitude or some function of the magnitude of the betas to that loss function because now by doing that you’re making the value of the lost function bigger, so in order for the algorithm to increase the value of beta it would really have to be worthwhile on the error term. Lasso and Ridge use absolute value and square, but the key is that they must be a measure of magnitude, ie they must be positive, so we could use a 4th degree or 6th degree or any even degree. I’m curious if each of these would have a Bayesian counterpart?
Also, sigma/tau is given in the Bayesian approach, which lambda is tuned or solved for in the regularization approach, so while the functional form is the same there’s no guarantee that lambda will equal (sigma/tau)^2. I do wonder if E(lambda)=(sigma/tau)^2? Ie, if you solved for lambda over many samples from a population, would the average be (sigma/tau)^2, which would means lambda is an estimator of (sigma/tau)^2?
It just blown my mind too. I can feel you brother. Thank you!
Thanks a lot! Great! I am reading Elements of Statistical Learning and did not understand what they were talking about. Now I got it.
I'd never considered a Bayesian approach to linear regression let alone its relation to lasso/ridge regression. Really enlightening to see!
Thanks!
This is brillian man! Brilliant! Literally solved where the lamda comes from!
Mi mente explotó con este video. Gracias
At last!! I could find an explanation for the lasso and ridge regression lamdas!!! Thank you!!!
Happy to help!
at last!!! Now I can see what lamda was doing in tne lasso and ridge regression!! great video!!
Glad you liked it!
This is truly cool. I had the same thing with the lambda. It’s good to know that it was not some engineering trick.
One of the best explanation out there, thanks :)
Excellent tutorial! I have applied RIDGE as the loss function in different models.
However, it is the first time I understand the mathematical meaning of lambda. It is really cool!
very cool the link you explained between regularization and prior
Cristal clear! , thank you so much, the explanation is very structured and detailed
Awesome video. I didn't realize that the L1, L2 regularization had a connection with the Bayesian framework. Thanks for shedding some much needed light on the topic. Could you please also explain the role of MCMC Sampling within Bayesian Regression models? I recently implemented a Bayesian Linear Regression model using PyMC3, and there's definitely a lot of theory involved with regards to MCMC NUTS (No U-Turn) Samplers and the associated hyperparameters (Chains, Draws, Tune, etc.). I think it would be a valuable video for many of us.
And of course, keep up the amazing work! :D
good suggestion!
Your videos are a true gem, and an inspiration even. I hope to be as instructive as you are if I ever become a teacher!
Awesome explanation!
Thank you for sharing this fantastic content.
Glad you enjoy it!
you are so good at this, this video is amazing
Thank you so much!!
Your videos are great. Love the connections you make so that stats is intuitive as opposed to plug and play formulas.
This video is super informative! It gave me the actual perspective on regularization.
Mind blown on the connection between regularization and priors in linear regression
Thanks, man. A really good and concise explanation of the approach (together with the video on Bayesian statistics).
Love this content! More examples like this are appreciated
More to come!
you got a subscriber, awesome explanation. I spent hours learning it from other source, but no success. You are just great
Your videos are a Godsend!
Max ( P(this is the best vid explaining these regressions | TH-cam) )
Super informative and clear lesson! Thank you very much!
Thank you so much for this.
Can you please please do a series on categorical distribution, multinomial distribution, Dirichlet distribution, Dirichlet process and finally non parametric Bayesian tensor factorisation including clustering of steaming data. I will personally pay you for this. I mean it!!
There are a few videos on these things on youtube, some are good, some are way high-level. But, no one can explain the way you do.
This simple video has such profound importance!!
This is sooo clear. Thank you so much!
This blew my mind.Thanks
Most insightful! L1 as Laplacian toward the end was a bit skimpy, though. Maybe I should watch your LASSO clip. Could you do a video on elastic net? Insight on balancing the L1 and L2 norms would be appreciated.
Yea, Elasticnet and comparison to Ridge/Lasso would be very helpful
Great video, do you have some sources I can use for my university presentation? You helped me a lot 🙏 thank you!
truly excellent explanation; well done
What a wonderful explanation!!
Glad you think so!
This was incredible, thank you so much.
Incredible explanation!
This is an awesome explanation
Thank you for this amazing video, It clarified many things to me!
At the end I understand it too finally. A hint for peaple who also struggle on BR like me: do a Bayesian linear regression in Python from any tutorial that you find online, you are going to understand, trust me. I think that one of the initial problems for a person that face a Bayesian approach it’s the fact that you are actually obtaining a posterior *of weights*!. Now looks kinda obvious but at the beginning I was really stuck, I could not understand what was actually the posterior doing.
This was awesome, thanks a lot for your time :)
such a nice explanation. I mean thats the first time I actually understood it.
Thanks for video.. Its really helpful.. I was trying to understand how regularization terms are coming.. Now i got. Thanks ..
You are a great teacher thank you for your videos!!
This video is amazing!!! so helpful and clear explanation
You are THE LEGEND
Thanks, that was a good one. Keep up the good work!
Thanks a lottttt! I had so much difficulty understanding this.
There is an error at the beginning of the video, in frequentist approaches X is treated as non random covariate data and y is the random part so the high variance of OLS should be expressed as small changes to y => big changes to OLS estimator.
The changes to covariate matrix becoming big changes to OLS estimator is more like a non robustness of OLS wrt outlier contamination.
Also the lambda should be 1/2τ^2 not σ^2/τ^2 since:
ln(P(β))=-p * ln(τ * √2*π) - ||β||₂/2τ^2
Overall this was very helpful cheers!
Thank You , I saw this before but i didnt understand. Please , where can i find the complete derivation? And maybe You can do a complete series in this topic
you are a great teacher!!!🏆🏆🏆
Thank you! 😃
You are the go-to for me when I need to understand topics better. I understand Bayesian parameter estimation thanks to this video!
Any chance you can do something on the difference between Maximum Likelihood and Bayesian parameter estimation? I think anyone that watches both of your videos will be able to pick up the details but seeing it explicitly might go a long way for some.
Great video. The relation between the prior and LASSO penalty was a "wow" moment for me. It would be helpful to see actual computation example in python or R. A common problem I see in Bayesian lectures is - too much focus on math rather to show how actually/ how much the resulting parameters differs. Specially, when to consider bayesian approach over ols.
Great video with a very clear explanation. COuld you also do a video on Bayesian logistic regression
Legendary video
I wonder if this is related to BIC, Bayesian Information Criterion. It's about choosing the simpler model with fewer variables, similar to regularization.
thank you so much for the great explanation
Tks a lot for this clear explanation !
your videos are awesome so much better than my prof
05:29 - "Given the known parameter vector y" - did you misspeak here? Or are you saying the observed data is also called a parameter vector? I would have thought that would be reserved terminology for the betas. It just seems weird to talk about observed data being called parameters, when you have more explicit model parameters (betas) in the same equation.
perfect explanation thank you
Excellent!
Thank you! Cheers!
My mind is blown.....woow...
Nice i never thought that 👍🏼👍🏼
Thanks from korea 사랑해요!
You're welcome!!!
Great video, just a question, where can I get some example of the algebra?
Beautiful!
Thank you! Cheers!
wonderful stuff! thank you
Thank you very much. Pretty helpful video!
Great thanks! .. was feeling the same discomfort about the origin of these...
It's a great video. Few people manage to boil things down to that point while retaining some of the key steps involved and you nailed it. Now, a few points -- mostly for the benefit of your viewers.
First, reading Tibshirani's original paper, it seems like the Bayesian interpretation is more of a happy coincidence than the primary motivation and it would make sense because a Bayesian statistician most likely wouldn't bother looking for a maximum aposteriori estimator. They almost always use the mean, median or mode of their marginal posteriors -- or just show you the whole distribution.
Second, there is a frequentist justification for LASSO: see, for example, Zou (2006) "The Adaptative LASSO and Its Oracle Properties." Zou shows that there are some conditions under which LASSO will *correctly* do what it was intended to do -- that is, jointly solve your model selection and estimation problems. However, in general, there's a tension between getting consistent model selection and consistent estimation. Fortunately, Zou also gives a very simple solution that involves only a very mild modification of LASSO: (1) allow each parameters to be penalized slightly differently and (2) cleverly choose penalty weights. If you do that, LASSO will choose the correct model and yields asymptotic normality (and root-n convergence). If you care about what the coefficients mean and not just the forecast, that might be important. That said, adaptative LASSO will occasionally perform *better* at forecasting.
Third, when you mention linear regression, you should include an asterisk somewhere: you don't need Gaussian iid errors for OLS to have some desirable properties. I'm sure you're well aware of all of that, but I'll include a few examples for the benefit of your viewers:
1. Conditionally mean zero errors (E(e|X)=0) gives you absence of bias (E(bhat|X) = b, the true value);
2. (1) and homoskedastic errors (E(e(i)|X) = sigma^2 for all i = 1,...,N) shows OLS is the lowest variance unbiased linear estimator (Gauss-Markov theorem);
3. (1), (2) and e ~ iid P, but P is another elliptic symmetric distribution (say, a Student), then your scale-invariant statistics like t and F retain their exact finite sample distribution (that's in King's 1979 thesis)
4. And there's a whole host of situations where none of the above applies, but you can get invoke asymptotic arguments to justify some properties as approximately holding in finite sample. Since your viewers seem to be interested mostly in forecasting, say all X's are covariance stationnary (*unconditional* means, variances and covariances are all finite and don't depend on time) and the error term follows a weak white noise process (not serially correlated, not contemporaneously correlated with the X's, but homoskedastic). Then both X'e/N and X'X/N satisfies a law of large number, so OLS will be convergent by a continuous mapping argument. Similarly, X'e/sqrt(N) satisfies a central limit theorem, so it will also be asymptotically normal. In other words, you get a property kind of like (1) and another similar to (3), except it applies in a much broader setting.
Fourth, if people are curious, I have two published papers with my coauthors that look into deep comparisons of many forecasting tools in the context of macroeconomic forecasting. Variants of LASSO are included and, for macro data, that concern of dimension reduction seems to be best handled using some kind of factor model (think, PCA or something like that). They can look me up on Scholar to find them.
Wonderfully explained! Mathematicians should be more subscribed to!
fantastic! u r my savor!
I have a question: why a beta j will follow a distribution according to a prior,isnt beta a parameter and hence a constant hence there wont be any distribution of betas, rather there will be a prior distribution of beta hats, ie our estimates of beta. Please reply❤
Amazing! But where did Ridge and Lasso start from? Were they invented with Bayesian statistics as a starting point, or is that a duality that came later?
Holy shit! This is amazing. Mind blown :)