Prof Raschka, this is fantastic content! I've been following your course for a while now and always end up learning something new. Really appreciate all your hard work that goes into creating this material and sharing it with the world.
thank you for this great leacture. In your other leacture about sequential feature selection you showed that backward selection (sbs) is superior than forward selection (sfs) according to a study. How does recursive feature elimination compare to sbs and sfs?
Thank you so much for the lectures, Prof Sebastian! I was wondering what is the difference between the L1-regularized method and the RFE? I understand that one is embedded and the other is a wrapper, but they look pretty similar. Thanks in advance!
Let's take logistic regression as an example. So for a logistic regression classifier with L1 regularization, you modify the loss function. This modification will push some of the weights towards zero. RFE works a bit differently. You run a regular logistic regression classifier, and then after training, you set the smallest weight to zero. You can repeat this a number of times until you reached a user-specified number of iterations. E.g., if you have 5 rounds, 5 weights will be set to 0.
Thank you very much! 1)Can you explain about wrapper methods -- can I use some fraction of training data set because using with whole dataset sometimes can be huge? If yes, what is best practices? 2)Is there any usage of test data for feature selection or I should select features based on training set only?
Yes, you can tune the estimator inside as well. In practice, for linear models, it is probably not necessary as there is not much to tune except regularization penalties (and they would be weird to combine with RFE). And for random forests you usually also don't have to tune much in practice. For other estimators you might consider tuning its parameters.
Hello Sebastian, really great content! I have two questions and it would be great to get an answer. 1. Is RFE actually model agnostic, why or why not? 2. Could I use RandomForest, Gradient Boosting, Neural Networks as RFE core algorithm and is it recommendable, if not, why? Thank you a lot!!!
@@SebastianRaschka Thank you! Also, I think it would be great if you could give us some kind of roadmap, recommendations, suggestions to achieve our goal and become machine learning expert. I am lost in the ocean of tutorials and there is no full path from 0 to end.
@@sasaglamocak2846 That's a very good point, and at the same time this is a very hard thing to make a recommendation about, because the path is different for everyone. I have many successful colleagues who only studied the computational aspects, and I have many successful colleagues who mostly studied mathematical topics. Both are successful in their own way. I think the best path is completing 1 or 2 introductory courses or books and then seeing what interests you most and then working on that (and reading more along the way)
Yeah. Originally, it is based on the weight coefficients of a linear model via .coef_. I think it was modified to work more generally now via .feature_importances_. So, yeah, you could combine it with decision tree classifiers etc. now, but not sure how "good" the results are.
@@SebastianRaschka yes, correlation coefficient feature ranking method is used on svm when the first time paper is published. But I am still confused, is corralation feature ranking method same for all remaining classifiers (estimator) on RFE- scikit learn? Thank you. Your answer is so valuable.
@@rajeshkalakoti2434 Afaik RFE-based feature selection is usually based on model coefficients. Correlation-based feature selection is usually based on p-values (the p-values here are in context of the null-hypothesis that the feature have no effect on the target variable). This is usually also done with a linear regression model, but you could maybe also use a SVM regression model for that. Also note that what I am describing above is more of a general description and not specific to a paper (sorry, I haven't read the paper you are referring to)
Hi prof raschka. Can we use coefficients for feature importance even when features are normalised. It feels like a wrong parameter to use when p values are present
To me, it's more the opposite: if you want to use the coefficients for feature importance, I'd recommend to normalize the features. E.g., if you have a feature like meters and kilometers, the meters number is always 1000 times higher. So if you use meters instead of kilometers in your model, the corresponding weight coefficient will probably end up 1000x smaller. So, if you want to compare feature coefficients, it makes sense to me to have the features all on the same scale
Prof Raschka, this is fantastic content! I've been following your course for a while now and always end up learning something new. Really appreciate all your hard work that goes into creating this material and sharing it with the world.
Really glad to hear :)
Thank you it helped me a lot in my major project.
Glad it helped! That's really cool to hear!
thank you for this great leacture. In your other leacture about sequential feature selection you showed that backward selection (sbs) is superior than forward selection (sfs) according to a study. How does recursive feature elimination compare to sbs and sfs?
Thank you so much for the lectures, Prof Sebastian!
I was wondering what is the difference between the L1-regularized method and the RFE? I understand that one is embedded and the other is a wrapper, but they look pretty similar. Thanks in advance!
Let's take logistic regression as an example.
So for a logistic regression classifier with L1 regularization, you modify the loss function. This modification will push some of the weights towards zero.
RFE works a bit differently. You run a regular logistic regression classifier, and then after training, you set the smallest weight to zero. You can repeat this a number of times until you reached a user-specified number of iterations. E.g., if you have 5 rounds, 5 weights will be set to 0.
Thank you very much!
1)Can you explain about wrapper methods -- can I use some fraction of training data set because using with whole dataset sometimes can be huge? If yes, what is best practices?
2)Is there any usage of test data for feature selection or I should select features based on training set only?
Is it a must to split data into x_train, x_test, y_train, y_test when using RFE?
Don't we need to do hyperparameter tuning of the estimator inside the RFE? If no, then why?
Yes, you can tune the estimator inside as well. In practice, for linear models, it is probably not necessary as there is not much to tune except regularization penalties (and they would be weird to combine with RFE). And for random forests you usually also don't have to tune much in practice. For other estimators you might consider tuning its parameters.
Hello Sebastian, really great content! I have two questions and it would be great to get an answer.
1. Is RFE actually model agnostic, why or why not?
2. Could I use RandomForest, Gradient Boosting, Neural Networks as RFE core algorithm and is it recommendable, if not, why?
Thank you a lot!!!
It's model agnostic to some extend, but not completely. You need parameters for the selection, and tree-based models don't have that.
@@SebastianRaschka Thank you! Also, I think it would be great if you could give us some kind of roadmap, recommendations, suggestions to achieve our goal and become machine learning expert. I am lost in the ocean of tutorials and there is no full path from 0 to end.
@@sasaglamocak2846 That's a very good point, and at the same time this is a very hard thing to make a recommendation about, because the path is different for everyone. I have many successful colleagues who only studied the computational aspects, and I have many successful colleagues who mostly studied mathematical topics. Both are successful in their own way. I think the best path is completing 1 or 2 introductory courses or books and then seeing what interests you most and then working on that (and reading more along the way)
@@SebastianRaschka Thank you!
Doesn't RFE work with non-linear model models like DT, EXtra tree classifier, random forest, and k-nn? since Every model has coefficients weights.
Yeah. Originally, it is based on the weight coefficients of a linear model via .coef_. I think it was modified to work more generally now via .feature_importances_. So, yeah, you could combine it with decision tree classifiers etc. now, but not sure how "good" the results are.
@@SebastianRaschka yes, correlation coefficient feature ranking method is used on svm when the first time paper is published. But I am still confused, is corralation feature ranking method same for all remaining classifiers (estimator) on RFE- scikit learn? Thank you. Your answer is so valuable.
@@rajeshkalakoti2434 Afaik RFE-based feature selection is usually based on model coefficients. Correlation-based feature selection is usually based on p-values (the p-values here are in context of the null-hypothesis that the feature have no effect on the target variable). This is usually also done with a linear regression model, but you could maybe also use a SVM regression model for that. Also note that what I am describing above is more of a general description and not specific to a paper (sorry, I haven't read the paper you are referring to)
Hi prof raschka. Can we use coefficients for feature importance even when features are normalised. It feels like a wrong parameter to use when p values are present
To me, it's more the opposite: if you want to use the coefficients for feature importance, I'd recommend to normalize the features. E.g., if you have a feature like meters and kilometers, the meters number is always 1000 times higher. So if you use meters instead of kilometers in your model, the corresponding weight coefficient will probably end up 1000x smaller. So, if you want to compare feature coefficients, it makes sense to me to have the features all on the same scale