NOTE: There seems to be an update to the elastic net package and if you run the code, your results might not be exactly what I got in the video. However, the concepts are still the same. Support StatQuest by buying my books The StatQuest Illustrated Guide to Machine Learning, The StatQuest Illustrated Guide to Neural Networks and AI, or a Study Guide or Merch!!! statquest.org/statquest-store/
Thank you so much for this amazing work you are doing and your wonderful explanations! You cannot imagine the help you are providing for stat-ungifted students like me! Greetings from Belgium!
Just a beginner to explore Elastic net regression. Your videos are the best I found to get around with all the concepts. Thanks for your works. They will help me with my research! All the best :)
I really love your videos! They are so easy to understand! I could hardly understand what lecturers taught in lectures, but I could quickly understand your video with lively pictures and detailed annotations! I love your beautiful song as well!
Oh my god... I've been struggling for hours and read so many VERY THICK books and gone through so many videos that I was honestly just sick of things! Then I found you. Where have you been all my life? XD Thank you so much! This was both silly enough that it cheered me up after so much frustration AND it was slow and direct enough for even me to understand! Thank you so much!
Have you considered making a book that includes your explanations of concepts along with code examples? I really think it would make its way to the top along with ESL and ISLR. Thanks for your work!
Wow! That's a huge complement. Maybe one day I'll make a book. Right now I only have enough spare time to make these videos - but maybe they will be successful enough that I can work on teaching stats and ML full time.
@@statquest You definitely should! There is a huge amount of people getting into Analytics and Machine Learning without a proper quantitative background who struggle with textbooks like ISLR.
For anyone who is a bit confused, giving a concrete example, say you're trying to predict if someone will commit fraud. Y contains records of people who have and have not committed fraud. Meanwhile, X contains the "features" about those people, like their sex, age, income, and so on. Each feature is a column in X, and each row is a person. You are trying to predict if someone will commit fraud, so you put in these features (x.train) into a linear regression algorithm, with if those people actually did commit fraud (y.train). If anyone is curious why glmnet requires x.train to be a 2 column+ matrix (two or more features), the package maintainer Trevor Hastie said, "glmnet is designed to select variables from a (large) collection. Allowing for 1 variable would have created a lot of edge case programming, and I was not interested in doing that. Sorry!"
Clearly explained video!!! Hi, I am doing an elastic-net regression to logistic regression to see whether the result is yes or no. My question is that at the video 16:17, how can I calculate the deviance instead of mse by using categorical "y.test" and numeric "predicted"? Hope to see your reply soon. Thanks!!!
I agree, with the opinion that this is one of the best channels on statistical TH-cam! I ´d like to answer how do you make the same in logistic and cox regression. Specifically, how do you obtain the line -> mean Example: With Ridge Cox Regression alpha0.fit
Hi @shattowsky!! Did you ever figure out how to do this step for logistic regression??? I'm in the same boat - I really hope you see this and respond!!
Why do we dived the data into training and testing if the cv.glmnet already does cross validation? Or in other words. Shoudn't we introduce all data in the cv.glmnet function and set it to 3-fold?
Sir you must be a full time professor in any reputed university. You can explain the math to a nonmath person. I have found your lecture the best lecture till date. thanks a lot for posting it. sir I have a query could you please guide me where should I start to study for applying the lasso and ridge for panel data.. god bless you sir.. Sir please help me...
Hey Josh.. Your videos are too good.. Simple yet explanatory.. We are lucky to have you here.. i wanted to ask a question to you on this video.. why do we generally use to have training set as 2/3 or maybe 70% of total data.. why not any other number.. suppose if we have 10 million rows and i want to train a model then 50% of the data as training set still gives us a good amount of data to train.. then why always 70:30..
Please could you do a lesson to explain us how to perform the elastic net with logistic regression? Should we use differences in likelihood instead of mean squared error? Thanks!!
Thank you so much for the video! I know the video is 4 years old, but just in case someone reads this, at 11:03 as you explained in previous videos, Ridge doesn't eliminate parameters. What is really happening here? I couldn't understand Thank you again!
So, Lasso and Elastic-Net can both remove parameters, and using lambda.1se gives us the model that performs within 1 standard error of the absolute best, but has the fewest parameters. However, we will also use lambda.1se for Ridge, even though ridge can't remove parameters, just to be consistent.
Apologies if this is an inane question, but is it actually necessary to do the partitioning of a training set (as opposed to simply "ideal" to do it)? I watched the creator of the package's webinar and have looked over the package documentation, and it didn't seem to be a requirement that I partition my data like you do in this example (ie, deliberately creating a training subset). It appears that he performs the cross-validation on the same dataset as the elastic net regression. I really don't think I have enough data to create a training set...I have an unfortunately small sample size (despite extensive efforts - it's a tough field) and a lot of explanatory variables (many of which are correlated). (And it's a categorical DV, so I'm doing a multinomial model, fwiw.)
Always just do what you can with the data you have. If you don't have enough data for separate training and testing datasets, then don't split the data up.
@@statquest Hi again - I was wondering if you knew how to do this using LOOCV. I emailed Trevor Hastie, explaining my small dataset issue, and he said that LOOCV would make more sense for me, then. But I've looked everywhere and can't find any tutorials or example code that show how to do this. I mean, I know that I'd set the number of folds to be the same as my sample size, but I don't know how else to set the R code up, which steps to skip, etc...all of the examples seem to do the train/test splitting. I understand if it's too much to ask, but any guidance at all would be greatly appreciated!!
Hi again - I was wondering if you knew how to do this using LOOCV. I emailed Trevor Hastie, explaining my small dataset issue, and he said that LOOCV would make more sense for me, then. But I've looked everywhere and can't find any tutorials or example code that show how to do this. I mean, I know that I'd set the number of folds to be the same as my sample size, but I don't know how else to set the R code up, which steps to skip, etc...all of the examples seem to do the train/test splitting. I understand if it's too much to ask, but any guidance at all would be greatly appreciated!!
Uhuuu Thanks again for opening new windows for us. Question: for Logistic regression, can I just use classification accuracy to compare models? Logloss would be the counterpart for MSE, but you know... try to tell a CEO the logloss is higher for a given alpha lol
You can use a confusion matrix for logistic regression and all associated metrics (accuracy) th-cam.com/video/Kdsp6soqA7o/w-d-xo.html . You can also use ROC/AUC: th-cam.com/video/4jRBRDbJemM/w-d-xo.html
how can we deal with the variable in dataform which is not continuous with glmnet, such as categorical variable and ordinal variable?I have read many papers and all of them told me dataform should be transfer into matrix before we conduct the glmnet.but ,you know,in matrix,all the variables are the same type.so??thank you.
Lets say your data looks like this: Type Value A 1 A 2 B 3 B 2 C 4 C 3 and we call this "Data" And we want to predict Value based on Type which is a group or (factor) variable type. In R, using glmnet, you can use command: X
Hi Josh... An awesome Video.. Thanks @10:15 - the attribute lambda.1se = that resulted in simplest model(fewest non zero paramenters) and within 1 std error of Lambda that had the least sum I request for a better understanding here.... When we say within 1 std error - we mean - the model for which the predicted values are within 1 std error from y.. is that correct? Please clarify Also What is a cross Validation error?
lambda.min represents the value for lambda that resulted in the lowest cross validation error (the average of the sums of the squared residuals between the observed and predicted values for each iteration of cross validation). lambda.1se is the value for lambda that results in the simplest model such that the cross validation error is within one standard error of the minimum. Choice of lambda.1se vs lambda.min boils down to this...Statistically speaking, the cross validation error for lambda.1se is indistinguishable from the cross validation error for lambda.min, since they are within 1 SE of each other. So we can pick the simpler model without much risk of severely hindering the ability to accurately predict values for 'y' given values for 'x'.
Thanks a lot! I want to ask Do we need to scale the testing and validation set for prediction? What if we only have one sample to use? There’s no way to scale it to obtain a risk score..?
Very intuitive way of teaching. I used lasso for a bunch of categorical variables and it's giving one Beta estimates for each unlike glm or lm. e.g. Education variable has many levels - No education, High school, Graduate, Masters and Doctorate but lasso has given one coefficient -0.254. How to interpret these Beta?
You have to transform your categorical variables via one-hot-encoding. See: stats.stackexchange.com/questions/136085/can-glmnet-logistic-regression-directly-handle-factor-categorical-variables-wi/210075
@StatQuest with Josh Starmer Is there a way we can look up at the coefficents for the parameters (Variables) of the model? To look up wich variables are kept in the model and wich shrink.
@@galan8115 You use the "coef()" function. The parameters for the "coef()" function are the same as they are for the "predict()" function. For example, here is how to get the parameters for the Ridge regression: coef(alpha0.fit, s=alpha0.fit$lambda.1se)
Excellent video, it helped me a lot to understand these regressions. I have a question, in the Elastic Net example we have manipulated the alpha values from 0 to 1 and it gave us that Lasso is still the best. But can you change the lambda values to find different Elastic Net regressions and see if any of them are better than Lasso? What is the value of lambda worked in the last example?
Thanks for the helpful video. In the last part you do a 10-fold CV for each lambda value and for each alpha value, since you are looping and running glmnet for each combination of lambda and alpha. What is the purpose of doing an additional 66 - 34 training-testing split and evaluating the models again? Why not just take the MSEs from the results of the cross-validation?
It is common to reserve a separate set of data, which I call x.test and y.test, that was not used in training at all to give a sense of long term performance. Why? To quote from: datascience.stackexchange.com/questions/18339/why-use-both-validation-set-and-test-set "You cannot use the cross validation set to measure performance of your model accurately, because you will deliberately tune your results to get the best possible metric, over maybe hundreds of variations of your parameters. The cross validation result is therefore likely to be too optimistic."
Could you please answer my question?If you want to use the obtained omics data such as protein/gene matrix to build a machine learning model, should you remove the correlated variables by Elastic-Net Regression or some methods like this?? Thank you a lot!
@@statquest If it is in modeling, the multicollinearity variable should be removed. However, when doing pathway enrichments during the differential expression analysis, it is desirable to obtain the clusters of similar variables. Should the two situations be treated differently?
@@statquest So in omics filed,we also do not include redundant variables in the model even though those variables are significantly differential expression if they are highly correlated?Thank you for your answer!LOVE U~
@@赵宛冰 It really depends on what you are trying to accomplish. If you are interested in pathway analysis, then all you use are the differentially expressed genes - all of them - even correlated ones. If you are interested in separating samples - using PCA or LDA or k-means clustering or whatever, again, the differentially expressed genes - all fo them, even the correlated ones - are very useful. However, if you are trying to use gene expression to predict if someone will develop cancer or heart disease, then it's not clear if the correlated genes will help or not. My guess is that they would still help, and Elastic-Net regression does the best in that situation - it treats correlated variables as a group and reduces their influence as a group.
Hi stat community and Josh! What if I want to compare elastic net and naive elastic net results according to Zou and Hastie (2005) approach which is a simple rescaling of the coefficient, what I have to do? Is the cv.glmnet function the naive version of the elastic net or is it adjusted? Thank you!
Thanks for the informative video! I was wondering why we would split up the sample in a testing and training set and also you use the k-fold cross-validation method? Is this standard procedure? It is my understanding that one either uses the k-fold corss-validation or the validation set approach. Thanks for your help!
@@statquest Follow up question on this: Am I understanding it correctly that the K fold validation here is only happening on the train set to estimate the best value for the lambda parameter? Would it make sense to, for example, also use K-fold CV while splitting the data in test/train (like in your video on cross validation). So in practice if we divide the data in 5 folds: use 4 folds to do the CV to determine lambda and train the model & 1 fold for testing, then use 4 different folds to do the CV to the determine labda and the fifth one for testing, and so on. Hope I made myself clear enough. Thanks a lot in advance!!
@@wenjiechen101 Use the coef.glmnet() function. For example, to get the coefficients for the first model in this example, we would use: coef.glmnet(alpha0.fit, s=alpha0.fit$lambda.1se) NOTE: This will print out all 5000 coefficients! So you might try head(coef.glmnet(alpha0.fit, s=alpha0.fit$lambda.1se)) to just look at the first 6.
Great Video. Can you give a real life case study example like for linear regression to predict the amount spent by a customer on a e-commerce site or for logistic regression whether the person will default on loan payment, etc.
Really nice video! I would love to know how can I extract the adjusted R² for the linear regression. Thak you Josh, you make a great job with these videos, they are really useful!
Thank you so much for your videos, they always help a lot to understand what really happens behind the formulas! I've been wondering whether one needs to use the foldid argument in the cv.glmnet function in the first (fitting) loop of the elastic net. The documentation says that if alpha is being cross-validated, one may use a fixed foldid vector to make the folding comparable for all alpha values. Is that one of the issues with the glmnet update? Thanks! :)
Technically, we don't need the double brackets for assignment, but it helps me keep in mind that this list isn't just a named array. When we want to extract the values from the list, we use the double brackets, '[[', ']]', to say that we just want the value stored there, and not the name and the value.
I'm new to R and though I've looked I wasn't able to find the answer to this: how do I extract the fitted weights from the fit object? also: I've performed this (with a slight modification: I did logistic regression by making the necessary changes according to the video) on my data but the MSE values do not change for different alphas at all - how to interpret this?
This page describes how to access the coefficients: web.stanford.edu/~hastie/glmnet/glmnet_alpha.html Basically, you use the print() function, as in print(alpha0.fit), to determine the value of lambda that you are interested in, and then you use the coef() function to extract those coefficients, as in coef(alpha0.fit, s=0.1) (where 's' is the value for lambda. I'm not sure why it's called 's').
It's a combination of those two goals. We want the simplest model gives us the lowest MSE. If two different MSEs are indistinguishably low (i.e. within one standard error of each other) then we pick the one with the simpler model. The "within one standard error" threshold suggests that if we collected a new dataset, the model that just got the lowest score on the original dataset, may not get the lowest score on the new dataset (since there will be a different MSE for every dataset).
I have question, LASSO is useful for feature selection, how did you know from the start that only 15 feature (out of the 5000) will be informative ? I want to use LASSO to find and use the informative genes.
This video is intended to show how LASSO works and thus, the datasets were created in such a way to highlight feature selection. Thus, we created a dataset were 15 of the features were useful.
This is GREAT, Josh!! Thanks for the video. Do you have by any chance any video or source I can watch/read for running this script when using spatial data? I have a shapefile and I need to run a Lasso model. Any help would be greatly appreciated!!!!
The value for lambda that gives us the lowest mean squared error does not always result in the simplest model. So we find the value for lambda that has a mean squared error that is statistically indistinguishable from the lowest mean squared error (in other words, both mean squared errors are within 1 standard error of each other) that gives us the simplest model. All things being equal (i.e. there is no statistical difference the mean squared errors) we would like the value for lambda that gives us the simplest model, so that is what we use. Doest that make sense? I just woke up from a nap and my brain might not be working yet....
Hey Josh. So after running both a ridge and lasso regression models on my data; the MSE values are the same. What does this mean/say about my models..? I tried looking this up but I can't really find anything.
@@statquest So in other words, my best model is simply just a least squares regression? I also tried using range of alpha values from 0.0 to 1.0, and all 11 MSE values are the same :
I think this video is for a cleaned dataset wherein you have already imputed missing values , gotten rid of outliers to name a few. What you are talking about comes in data preprocessing step
Calculating mse = ((y.test - predicted)^2) ..... Does this mean that we are calculating the variance and then squaring it and finally dividing by the number of test samples ?
@@statquest I am asking in general is mse = ((y.test - predicted)) is a variance ??? Becuse, we are subtracting predicted from real test value....and in test value we get variance right ?
Hi Josh, if it doesn't sound too greedy and pushy :-), perhaps would you also consider a video comparing different mediation analysis packages on R, such as "mediation", "mbess", "processr" (R version of the SPSS macro) etc. I (and some other folks) have been trying to figure out their differences and respective advantages and will certainly appreciate your perspective.
Hi Josh, What is "y" here? I see you are taking y=apply(x[,1:real_p],1,sum)+rnorm(n) But why do we need the sum? I'm working on dataset "divorce_margarine" available in package "dslabs" In this case, what should be my "y"?
'y' is the dependent variable. In other words, 'y' is thing we are trying to predict. In this equation, we are making 'y' a function of the first 15 "independent variables" (the things that are making the predictions). We are using Regularization to filter out the other variables that are not related to 'y'. In your case, 'y' should be whatever you are trying to predict.
Can you add a code to see the predicted Y for each participant? Such that we can compare the actual-y vs. predicted-y? Can you also add the code for the residuals? I am interested to categorize participants based on their residuals. Thank you so much (I keep my fingers crossed that you see my comment and help me to figure it out). Samaneh
The predicted y values are stored in alpha0.predicted (or alpha1.predicted, or alphaWhateverWeSetAlphaTo.predicted). The residuals for alpha0.predicted = y.test - alpha0.predicted.
Thank you for your video! It's very helpful! Is it possible that we might get different results because of the update of the package or functions? I got the slight different results and that had the the best method altered. When alpha = 0.9, we got the lowest value at 1.182, compared to mse being 1.184 when alpha = 1. Is there any possible explanation? Thank you!
@@Jas-ti7hr I just reran my own script and got the same values you got, so there must be a change in the updated version of the package. This is a little disconcerting, but it is what it is. Thanks for pointing it out.
NOTE: There seems to be an update to the elastic net package and if you run the code, your results might not be exactly what I got in the video. However, the concepts are still the same.
Support StatQuest by buying my books The StatQuest Illustrated Guide to Machine Learning, The StatQuest Illustrated Guide to Neural Networks and AI, or a Study Guide or Merch!!! statquest.org/statquest-store/
ok
Really appreciate how you explain every argument of the function. Such a life saver!
Thanks!
This is literally one of the best channels on youtube! This channel will be massive in a couple of years.
Thank you so much! I really hope that it continues to grow. I have a lot of fun working on these videos.
greetings from th future. You weren't wrong!
Thank you so much for this amazing work you are doing and your wonderful explanations! You cannot imagine the help you are providing for stat-ungifted students like me! Greetings from Belgium!
Wow, thank you!
I love how every time I feel super anxious trying to find out solutions for my questions, you being a lifesaver and also make me laugh lol
BAM! :)
StatQuest killing it yet again. Literally using every video to supplement my grad degree.
Awesome!!! I'm glad my videos are so helpful. :)
Just a beginner to explore Elastic net regression. Your videos are the best I found to get around with all the concepts. Thanks for your works. They will help me with my research! All the best :)
Glad it was helpful!
I really love your videos! They are so easy to understand! I could hardly understand what lecturers taught in lectures, but I could quickly understand your video with lively pictures and detailed annotations! I love your beautiful song as well!
Happy to hear that!
Oh my god... I've been struggling for hours and read so many VERY THICK books and gone through so many videos that I was honestly just sick of things! Then I found you. Where have you been all my life? XD Thank you so much! This was both silly enough that it cheered me up after so much frustration AND it was slow and direct enough for even me to understand! Thank you so much!
Awesome!!! I’m glad you like the videos. :)
Most comprehensive explanation of how to implement ridge / lasso in R I can find. Thanks!
Hooray! :)
one of the best videos ive watched for my upper division statistics classes
Thank you!
My entire project for courses done based on your concepts. Thank you very much
Bam! :)
Josh this is so clear, I don't know why you don't have many reviews with R. I hope we would get similiar R contents ! Thank you.
Thank you very much! :)
Thank you so much Josh! before watching your videos this was literally impossible for me to learn. I really appreciate your work.
Happy to help!
You are the literal 🐐 of learning anything I’ve ever needed to know for statistics and modeling in R
Thank you!
Fantastic video. Thanks Josh. You have made it so simple and easy to understand.
Thank you! :)
Have you considered making a book that includes your explanations of concepts along with code examples? I really think it would make its way to the top along with ESL and ISLR. Thanks for your work!
Wow! That's a huge complement. Maybe one day I'll make a book. Right now I only have enough spare time to make these videos - but maybe they will be successful enough that I can work on teaching stats and ML full time.
@@statquest You definitely should! There is a huge amount of people getting into Analytics and Machine Learning without a proper quantitative background who struggle with textbooks like ISLR.
And with a CD in the back to play the introduction songs for each statistical test or chapter. Like in the good old days!
Thanks so much for you series videos, and for this tutorial paradigm. You are always BAM!!!!
Hooray! Thank you!
I am a devoted fan of your channel, thank you very much.
Thank you!
For anyone who is a bit confused, giving a concrete example, say you're trying to predict if someone will commit fraud. Y contains records of people who have and have not committed fraud. Meanwhile, X contains the "features" about those people, like their sex, age, income, and so on. Each feature is a column in X, and each row is a person. You are trying to predict if someone will commit fraud, so you put in these features (x.train) into a linear regression algorithm, with if those people actually did commit fraud (y.train).
If anyone is curious why glmnet requires x.train to be a 2 column+ matrix (two or more features), the package maintainer Trevor Hastie said, "glmnet is designed to select variables from a (large) collection. Allowing for 1 variable would have created a lot of edge case programming, and I was not interested in doing that. Sorry!"
This is a great comment. Thank you! :)
@@statquest You're welcome! ^_^
Excellent Explanations. This was very much useful for my assignment. Thanks a million!
Glad it was helpful!
You just make it very clear and understandable. Thank you Josh!
Would have been nice if you showed the final model and prediction accuracy over a holdout test set
I can't love your videos more
pls keep making videos for us!
greetings from Germany
Thank you very much!!! :)
My wish came true! Thank you Josh!
Hooray! It took one week longer than I hoped, but better late than never! :)
adamlığın zekatını ver be. böyle iyi anlatılır mı
Thank you! :)
amazing I learnt so much which i could not learn in class,
Glad it was helpful!
the best channel ever! you save my life!
Bam! :)
i like how the seed is the same as the hitchhiker's guide's answer to life... lol.
Exactly! That’s my favorite seed.
Top, as usual! Exactly what I needed! Thanks, my friend
Hooray! :)
BAAAAAAM! Thanks a lot Josh Starmer!
StatQuest gang rise up
:)
Clearly explained video!!!
Hi, I am doing an elastic-net regression to logistic regression to see whether the result is yes or no. My question is that at the video 16:17, how can I calculate the deviance instead of mse by using categorical "y.test" and numeric "predicted"?
Hope to see your reply soon. Thanks!!!
See: www.sthda.com/english/articles/36-classification-methods-essentials/149-penalized-logistic-regression-essentials-in-r-ridge-lasso-and-elastic-net/
@@statquest Thanks!!! This helped me a lot!!!
This video is awesome, as the others. Thank you!
Thanks! :)
Truly amazing work 👏 🙌 👌
Thank you so much 😀!
This is amazing! Thank you for making these!
I agree, with the opinion that this is one of the best channels on statistical TH-cam!
I ´d like to answer how do you make the same in logistic and cox regression. Specifically, how do you obtain the line -> mean
Example: With Ridge Cox Regression
alpha0.fit
Unfortunately I've only used 'mse' and haven't tried logistic or cox regression.
Hi @shattowsky!! Did you ever figure out how to do this step for logistic regression??? I'm in the same boat - I really hope you see this and respond!!
11:35 , When choosing `family = "multinomial"`, should I check deviance rather than MSE?
Why do we dived the data into training and testing if the cv.glmnet already does cross validation? Or in other words. Shoudn't we introduce all data in the cv.glmnet function and set it to 3-fold?
You could do it that way, but often people like to reserve a small amount of data for validation, as done here.
Sir you must be a full time professor in any reputed university. You can explain the math to a nonmath person. I have found your lecture the best lecture till date. thanks a lot for posting it.
sir I have a query could you please guide me where should I start to study for applying the lasso and ridge for panel data.. god bless you sir.. Sir please help me...
Great explanation!
Thanks!
Wow nice music and lecture!
Thanks! :)
Hey Josh.. Your videos are too good.. Simple yet explanatory.. We are lucky to have you here.. i wanted to ask a question to you on this video.. why do we generally use to have training set as 2/3 or maybe 70% of total data.. why not any other number.. suppose if we have 10 million rows and i want to train a model then 50% of the data as training set still gives us a good amount of data to train.. then why always 70:30..
70/30 is just a convention, it's not a rule. 70/30 tends to work well in practice, but that's the only justification for using it.
Please could you do a lesson to explain us how to perform the elastic net with logistic regression? Should we use differences in likelihood instead of mean squared error? Thanks!!
Although I'm not certain how it is done, I think you are correct - that we simply replace the SSR with the log likelihood.
Loved it! Thank you so much:)
Thanks! :)
You are simply the best !!!!
Thank you! :)
Thank you so much for the video!
I know the video is 4 years old, but just in case someone reads this, at 11:03 as you explained in previous videos, Ridge doesn't eliminate parameters. What is really happening here? I couldn't understand
Thank you again!
So, Lasso and Elastic-Net can both remove parameters, and using lambda.1se gives us the model that performs within 1 standard error of the absolute best, but has the fewest parameters. However, we will also use lambda.1se for Ridge, even though ridge can't remove parameters, just to be consistent.
Apologies if this is an inane question, but is it actually necessary to do the partitioning of a training set (as opposed to simply "ideal" to do it)? I watched the creator of the package's webinar and have looked over the package documentation, and it didn't seem to be a requirement that I partition my data like you do in this example (ie, deliberately creating a training subset). It appears that he performs the cross-validation on the same dataset as the elastic net regression. I really don't think I have enough data to create a training set...I have an unfortunately small sample size (despite extensive efforts - it's a tough field) and a lot of explanatory variables (many of which are correlated). (And it's a categorical DV, so I'm doing a multinomial model, fwiw.)
Always just do what you can with the data you have. If you don't have enough data for separate training and testing datasets, then don't split the data up.
@@statquest thank you, I really appreciate your reply!!
@@statquest Hi again - I was wondering if you knew how to do this using LOOCV. I emailed Trevor Hastie, explaining my small dataset issue, and he said that LOOCV would make more sense for me, then. But I've looked everywhere and can't find any tutorials or example code that show how to do this. I mean, I know that I'd set the number of folds to be the same as my sample size, but I don't know how else to set the R code up, which steps to skip, etc...all of the examples seem to do the train/test splitting. I understand if it's too much to ask, but any guidance at all would be greatly appreciated!!
How can I compare models in terms of the importance of variables?
Thanks for this video!!!
Love your videos! Would be great if you had this one for python coders also
Hi again - I was wondering if you knew how to do this using LOOCV. I emailed Trevor Hastie, explaining my small dataset issue, and he said that LOOCV would make more sense for me, then. But I've looked everywhere and can't find any tutorials or example code that show how to do this. I mean, I know that I'd set the number of folds to be the same as my sample size, but I don't know how else to set the R code up, which steps to skip, etc...all of the examples seem to do the train/test splitting. I understand if it's too much to ask, but any guidance at all would be greatly appreciated!!
Unfortunately I can't help you with your code.
Josh, thank you for this great video. How can we extract 15 parameters used for predicting outcome from the fit model?
Uhuuu Thanks again for opening new windows for us. Question: for Logistic regression, can I just use classification accuracy to compare models? Logloss would be the counterpart for MSE, but you know... try to tell a CEO the logloss is higher for a given alpha lol
You can use a confusion matrix for logistic regression and all associated metrics (accuracy) th-cam.com/video/Kdsp6soqA7o/w-d-xo.html . You can also use ROC/AUC: th-cam.com/video/4jRBRDbJemM/w-d-xo.html
how can we deal with the variable in dataform which is not continuous with glmnet, such as categorical variable and ordinal variable?I have read many papers and all of them told me dataform should be transfer into matrix before we conduct the glmnet.but ,you know,in matrix,all the variables are the same type.so??thank you.
Lets say your data looks like this:
Type Value
A 1
A 2
B 3
B 2
C 4
C 3
and we call this "Data"
And we want to predict Value based on Type which is a group or (factor) variable type. In R, using glmnet, you can use command:
X
sorry for the late reply,thank you very much
Hi Josh... An awesome Video.. Thanks
@10:15 - the attribute lambda.1se = that resulted in simplest model(fewest non zero paramenters) and within 1 std error of Lambda that had the least sum
I request for a better understanding here....
When we say within 1 std error - we mean - the model for which the predicted values are within 1 std error from y.. is that correct?
Please clarify
Also What is a cross Validation error?
lambda.min represents the value for lambda that resulted in the lowest cross validation error (the average of the sums of the squared residuals between the observed and predicted values for each iteration of cross validation). lambda.1se is the value for lambda that results in the simplest model such that the cross validation error is within one standard error of the minimum.
Choice of lambda.1se vs lambda.min boils down to this...Statistically speaking, the cross validation error for lambda.1se is indistinguishable from the cross validation error for lambda.min, since they are within 1 SE of each other. So we can pick the simpler model without much risk of severely hindering the ability to accurately predict values for 'y' given values for 'x'.
Thank You Very Much
You're a hero man
Thank you! :)
Thank you Josh!
You're welcome! :)
Super great video!
Is the SME formula same for logistic regression?
Thanks a lot!
I want to ask
Do we need to scale the testing and validation set for prediction?
What if we only have one sample to use? There’s no way to scale it to obtain a risk score..?
You can remember the scaling coefficients from the training data and apply them to the testing data.
Very intuitive way of teaching. I used lasso for a bunch of categorical variables and it's giving one Beta estimates for each unlike glm or lm. e.g. Education variable has many levels - No education, High school, Graduate, Masters and Doctorate but lasso has given one coefficient -0.254. How to interpret these Beta?
You have to transform your categorical variables via one-hot-encoding. See: stats.stackexchange.com/questions/136085/can-glmnet-logistic-regression-directly-handle-factor-categorical-variables-wi/210075
@StatQuest with Josh Starmer Is there a way we can look up at the coefficents for the parameters (Variables) of the model? To look up wich variables are kept in the model and wich shrink.
Sure, just compare the optimal coefficients to the original least-squares fit.
@@statquest Yes but how do i acces the coefficiients i mean :D.
@@galan8115 You use the "coef()" function. The parameters for the "coef()" function are the same as they are for the "predict()" function. For example, here is how to get the parameters for the Ridge regression: coef(alpha0.fit, s=alpha0.fit$lambda.1se)
@@statquest Thank you!
Excellent video, it helped me a lot to understand these regressions. I have a question, in the Elastic Net example we have manipulated the alpha values from 0 to 1 and it gave us that Lasso is still the best. But can you change the lambda values to find different Elastic Net regressions and see if any of them are better than Lasso?
What is the value of lambda worked in the last example?
cv.glmnet() automatically tests different values for lambda for us and uses cross validation to find the best one. See: 9:11
@@statquest You are totally correct, I forgot that part. Thank you very much, the video is perfect and very well explained!!!!
Suppose I am fitting an ordinary Least Squares model to my data set and found it has multicollinearity. Can i use the steps that you just discussed?
Yes!
Thanks for the helpful video. In the last part you do a 10-fold CV for each lambda value and for each alpha value, since you are looping and running glmnet for each combination of lambda and alpha. What is the purpose of doing an additional 66 - 34 training-testing split and evaluating the models again? Why not just take the MSEs from the results of the cross-validation?
It is common to reserve a separate set of data, which I call x.test and y.test, that was not used in training at all to give a sense of long term performance. Why? To quote from: datascience.stackexchange.com/questions/18339/why-use-both-validation-set-and-test-set
"You cannot use the cross validation set to measure performance of your model accurately, because you will deliberately tune your results to get the best possible metric, over maybe hundreds of variations of your parameters. The cross validation result is therefore likely to be too optimistic."
@@statquest This makes sense, thanks for responding quickly and for the answer!
Could you please answer my question?If you want to use the obtained omics data such as protein/gene matrix to build a machine learning model, should you remove the correlated variables by Elastic-Net Regression or some methods like this??
Thank you a lot!
It depends on the method. Removing noise from your data usually helps, though, so it's not a bad idea to try it.
@@statquest If it is in modeling, the multicollinearity variable should be removed. However, when doing pathway enrichments during the differential expression analysis, it is desirable to obtain the clusters of similar variables. Should the two situations be treated differently?
@@赵宛冰 Of course, those are two separate problems.
@@statquest So in omics filed,we also do not include redundant variables in the model even though those variables are significantly differential expression if they are highly correlated?Thank you for your answer!LOVE U~
@@赵宛冰 It really depends on what you are trying to accomplish. If you are interested in pathway analysis, then all you use are the differentially expressed genes - all of them - even correlated ones. If you are interested in separating samples - using PCA or LDA or k-means clustering or whatever, again, the differentially expressed genes - all fo them, even the correlated ones - are very useful. However, if you are trying to use gene expression to predict if someone will develop cancer or heart disease, then it's not clear if the correlated genes will help or not. My guess is that they would still help, and Elastic-Net regression does the best in that situation - it treats correlated variables as a group and reduces their influence as a group.
Hi stat community and Josh! What if I want to compare elastic net and naive elastic net results according to Zou and Hastie (2005) approach which is a simple rescaling of the coefficient, what I have to do? Is the cv.glmnet function the naive version of the elastic net or is it adjusted? Thank you!
Hopefully someone else can answer this question! :)
Thanks for the informative video! I was wondering why we would split up the sample in a testing and training set and also you use the k-fold cross-validation method? Is this standard procedure? It is my understanding that one either uses the k-fold corss-validation or the validation set approach.
Thanks for your help!
It's actually quite common to combine both methods.
@@statquest Thanks for the quick response. This information is very helpful!
@@statquest Follow up question on this: Am I understanding it correctly that the K fold validation here is only happening on the train set to estimate the best value for the lambda parameter? Would it make sense to, for example, also use K-fold CV while splitting the data in test/train (like in your video on cross validation). So in practice if we divide the data in 5 folds: use 4 folds to do the CV to determine lambda and train the model & 1 fold for testing, then use 4 different folds to do the CV to the determine labda and the fifth one for testing, and so on. Hope I made myself clear enough. Thanks a lot in advance!!
@@Pablovgd If you have a lot of data, you can do it that way.
Thank you so much for the vedio! May I ask how can we know what is the fitted model looks like after regulization?
I'm not really sure what you mean by "looks like". Are you asking how to extract the specific parameter estimates or how to draw a graph of the model?
@@statquest Yes, I meant how to extract the specific parameter estimates. Especially when we need to do the interpretation.
@@wenjiechen101 Use the coef.glmnet() function. For example, to get the coefficients for the first model in this example, we would use: coef.glmnet(alpha0.fit, s=alpha0.fit$lambda.1se) NOTE: This will print out all 5000 coefficients! So you might try head(coef.glmnet(alpha0.fit, s=alpha0.fit$lambda.1se)) to just look at the first 6.
For more details, see: cran.r-project.org/web/packages/glmnet/glmnet.pdf
@@statquest Thank you so much!
Great Video. Can you give a real life case study example like for linear regression to predict the amount spent by a customer on a e-commerce site or for logistic regression whether the person will default on loan payment, etc.
Really nice video!
I would love to know how can I extract the adjusted R² for the linear regression.
Thak you Josh, you make a great job with these videos, they are really useful!
Hey Josh could you please do a video on comparing Bagging, Boosting and Stacking?
Yes! That is on the to-do list. Hopefully I can get to it soon.
@@statquest Awesome!
Hi, just God bless you.
Thank you!
Thank you so much for your videos, they always help a lot to understand what really happens behind the formulas!
I've been wondering whether one needs to use the foldid argument in the cv.glmnet function in the first (fitting) loop of the elastic net. The documentation says that if alpha is being cross-validated, one may use a fixed foldid vector to make the folding comparable for all alpha values. Is that one of the issues with the glmnet update? Thanks! :)
Possibly!
In 14:55 I Dont understand the notation of the double list
Technically, we don't need the double brackets for assignment, but it helps me keep in mind that this list isn't just a named array. When we want to extract the values from the list, we use the double brackets, '[[', ']]', to say that we just want the value stored there, and not the name and the value.
I'm new to R and though I've looked I wasn't able to find the answer to this: how do I extract the fitted weights from the fit object?
also: I've performed this (with a slight modification: I did logistic regression by making the necessary changes according to the video) on my data but the MSE values do not change for different alphas at all - how to interpret this?
This page describes how to access the coefficients: web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
Basically, you use the print() function, as in print(alpha0.fit), to determine the value of lambda that you are interested in, and then you use the coef() function to extract those coefficients, as in coef(alpha0.fit, s=0.1) (where 's' is the value for lambda. I'm not sure why it's called 's').
@@statquest thanks a bunch. really useful vids! very generous of you to make them.
I supposed our end goal is not the simplest model but minimal test data error during the cross validation
It's a combination of those two goals. We want the simplest model gives us the lowest MSE. If two different MSEs are indistinguishably low (i.e. within one standard error of each other) then we pick the one with the simpler model. The "within one standard error" threshold suggests that if we collected a new dataset, the model that just got the lowest score on the original dataset, may not get the lowest score on the new dataset (since there will be a different MSE for every dataset).
I have question, LASSO is useful for feature selection, how did you know from the start that only 15 feature (out of the 5000) will be informative ? I want to use LASSO to find and use the informative genes.
This video is intended to show how LASSO works and thus, the datasets were created in such a way to highlight feature selection. Thus, we created a dataset were 15 of the features were useful.
clear ever! thank you!
Thank you so much!
You're welcome!
This is GREAT, Josh!! Thanks for the video. Do you have by any chance any video or source I can watch/read for running this script when using spatial data? I have a shapefile and I need to run a Lasso model. Any help would be greatly appreciated!!!!
Not that I know of. :(
@@statquest Thanks anyway :) Just for the record: seems that the function "glmnetcv" package "spm2" works for that.
@@AdrianaCastilloC bam!
Hello (: about the "real_p", is there a reason you choose 15 instead of other numbers?
What time point, minutes and seconds, are you asking about?
I could not understand the reason why you used the lambda.1se
The value for lambda that gives us the lowest mean squared error does not always result in the simplest model. So we find the value for lambda that has a mean squared error that is statistically indistinguishable from the lowest mean squared error (in other words, both mean squared errors are within 1 standard error of each other) that gives us the simplest model. All things being equal (i.e. there is no statistical difference the mean squared errors) we would like the value for lambda that gives us the simplest model, so that is what we use. Doest that make sense? I just woke up from a nap and my brain might not be working yet....
I appreciate for your detail explanation by taking best visualization content and examples. Can you also upload code in python as well?
I hope to do that one day.
Hey Josh. So after running both a ridge and lasso regression models on my data; the MSE values are the same.
What does this mean/say about my models..? I tried looking this up but I can't really find anything.
It may mean that the penalty is 0.
@@statquest So in other words, my best model is simply just a least squares regression?
I also tried using range of alpha values from 0.0 to 1.0, and all 11 MSE values are the same :
Very nice, but could you show an example with a dataset with missing values?
I think this video is for a cleaned dataset wherein you have already imputed missing values , gotten rid of outliers to name a few.
What you are talking about comes in data preprocessing step
Would this run faster or more efficiently on a large dataset than just a GLM?
I'd be surprised if it was faster, since we have to use cross validation to find the best values for the hyperparameter.
Does anyone know how to extract feature importance out of these models. Perhaps in terms of p-values or another metric like shap values?
I think SHAP is probably the way to go.
@@statquest Thanks for the response, Josh. Do you happen to have a resource on how to do this in R?
@@PhilipFreda Not yet. :(
When we add useful predictors, the mse also getting higher. Is it normal?
That sounds backwards.
I Will definitely buy your book, and if you want to sell it in Spanish I can do the Translation for you.
WOW!!! Thank you very much!!!
hi josh, iam watchin from Brazil, yout videos are really awesome. in this example, its possible to use regression in non-numeric variable?
By "non-numeric", do you mean "categorical"? If so, then yes, you can use those variables as well.
Calculating mse = ((y.test - predicted)^2) ..... Does this mean that we are calculating the variance and then squaring it and finally dividing by the number of test samples ?
What time point are you asking about (minutes and seconds).
@@statquest I am asking in general is mse = ((y.test - predicted)) is a variance ??? Becuse, we are subtracting predicted from real test value....and in test value we get variance right ?
Give you thumbs-up!
Thank you! :)
Thanks but i got when i want to create alpha0.fit for the ridge regression
It says : "error in rep(1,N) invalid 'times' argument"
That happens when x.train isn't 2-dimensional. You probably typoed in the x.train
hi Josh would you consider a series on multlevel modelling? love your videos
I've added Multilevel Models to my "To-Do" list, although it might be a while before I can get to it.
looking forward to it!
Hi Josh, if it doesn't sound too greedy and pushy :-), perhaps would you also consider a video comparing different mediation analysis packages on R, such as "mediation", "mbess", "processr" (R version of the SPSS macro) etc. I (and some other folks) have been trying to figure out their differences and respective advantages and will certainly appreciate your perspective.
I'll put that on the to-do list as well.
Hi Josh,
What is "y" here?
I see you are taking y=apply(x[,1:real_p],1,sum)+rnorm(n)
But why do we need the sum?
I'm working on dataset "divorce_margarine" available in package "dslabs"
In this case, what should be my "y"?
'y' is the dependent variable. In other words, 'y' is thing we are trying to predict. In this equation, we are making 'y' a function of the first 15 "independent variables" (the things that are making the predictions). We are using Regularization to filter out the other variables that are not related to 'y'. In your case, 'y' should be whatever you are trying to predict.
@@statquest Got it. Thanks a lot. I'm watching every video of yours and they are amazing :)
@@rishiprasana8285 Thanks! :)
Can you add a code to see the predicted Y for each participant? Such that we can compare the actual-y vs. predicted-y?
Can you also add the code for the residuals? I am interested to categorize participants based on their residuals.
Thank you so much (I keep my fingers crossed that you see my comment and help me to figure it out).
Samaneh
The predicted y values are stored in alpha0.predicted (or alpha1.predicted, or alphaWhateverWeSetAlphaTo.predicted). The residuals for alpha0.predicted = y.test - alpha0.predicted.
I love your videos!
Thank you for your video! It's very helpful!
Is it possible that we might get different results because of the update of the package or functions? I got the slight different results and that had the the best method altered. When alpha = 0.9, we got the lowest value at 1.182, compared to mse being 1.184 when alpha = 1. Is there any possible explanation? Thank you!
I'm not sure. Did you use my code our write your own?
@@statquest Yes, I tried to do it on my own step by step, and then I downloaded your script, but I got the same results as mine :(
@@Jas-ti7hr I just reran my own script and got the same values you got, so there must be a change in the updated version of the package. This is a little disconcerting, but it is what it is. Thanks for pointing it out.
@@statquest thank you very much for the checking!
why is alpha set to 0 for ridge regression?? Isn't that just regular OLS then?
Not for this package. The details are explained at 1:34 . You must have missed that part.