If you put a question before or after a function call you can see the docs in notebook. Check out some other magic commands here: www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/
Honestly, like most things in data science the idea is to try out a couple. Of course if you are planning on using a tree based model as the final classifier it's best that you use one for RFECV, and similarly with linear models. But which one you use and the hyper parameters associated with it, are all hyper parameters themselves :)
Great question! So when we use these as part of a pipeline we don't need to know what the final features are (and if we do cross validation the features may change in each validation). But if you want to know what the features are as a side step you can: In [1]: sel.get_support() Out[1]: array([False, True, True], dtype=bool) So these models have a get_support() function that will tell you what features they use. Hope that helps!
@Data Talks Dear Sir, I am working on feature selection using "Removing features with low variance". I am having 1452 features and code is returning me 454 features but with no feature labels i.e column headers. Know I am unable to get that which feature have been accepted. So My question '"how I can retain column headers in my output"?
Great question! Unfortunately a little complex though :/ If you are using pandas you can use the following new_cols = df.columns[vt.get_support()] new_values = vt.transform(df) new_df = pd.DataFrame(new_values, columns=new_cols) So the important function here is this one: scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold.get_support (let me know if there is a typo in the above :)
This is a great video. The way you explain is very easy to understand. I just have a few questions to ask... How do you do feature selection on categorical variables? Is it a good idea to one hot encode them and then for example use the SelectKBest algorithm? (I've read that it isn't because it's not a good idea to remove dummy variables unless you drop only the first one) So yeah, are there any special algorithms that you use for feature selection for categorical variables or a mix of categorical and numerical variables in the dataset? In practice, do you first do feature selection and then one hot encode the variables?
Great question - this question is generally answered by either using tree based models or by using embeddings. With tree based models you only need to map your categories to integers to do feature selection. Then your feature selection can act on the entire categorical column. There is nothing inherently wrong with dropping a dummy variable - at least not in the function approximator view of ML. When all of the other dummies are 0, the model is assuming the sample distribution's prob that the dropped dummies are on or off. There would obviously be a problem with doing causal inference or some explainability techniques - but I'd caution against those in the first place with linear models. You're most likely making more assumptions that you know and the causal inference you're doing is best done in different ways: see th-cam.com/play/PLgJhDSE2ZLxaIAU_C1j0Cw70f2bQ2Sa6D.html Great quesiton!
I like your video but they don't answer why at all? For example why did you choose chisquare instead of f_classif? I mean we can get that info from website, read instruction and implement it. Which feature selection is common or have you used?
I hate to be a tease, but I'm teaching an introduction to ML course at USF (videos will be posted here with a bit of a delay) where we will be going over this question exactly!
Great question! The answer is almost always no. If they are being used for data exploration (not ML) on a massive dataset then perhaps, but even then there is little harm leaving some data in the test set to test for generalization.
@@DataTalks can't follow your answer ...for Dataset Preparation or Filtering, I can use Feature Selection on the whole dataset. am I right? After Dataset is modified (some features deleted) then I'll use it for ML
Sorry for the lack of clarity! You should *not* use feature selection on the whole dataset. You should only use it on the training set. You should think of feature selection and transformation as part of the ML model itself - you are using the data to set parameters. Hope that helps!
@@DataTalks If some of the features are removed from the dataset and then Model Development is done by Splitting the dataset into Training Set and Validation Set ...What is the harm?
So you are using the data itself to tell you which features to remove. Let's say one feature is particularly important in your test set but not important in your training set. If you did feature selection on the full of the data, that feature would not be removed and you would have higher accuracy on your test set. If you did the split first you would have lower accuracy on the test set. Now you might say: well high accuracy here is better right? The problem is that in the real world, you don't have access to your test set. So if you are working in a regime where feature importances shift over time (which is not a crazy assumption). Then doing the feature selection would actually result in a lower accuracy than not. So doing feature selection on the whole dataset will overestimate how accurate your model will be in the real world. The ideal case for splitting your data is to estimate how well your model will do in the real world. Hope this makes sense :) This is definitely one of the harder concepts in data science!
Every tutorial I’ve seen on SelectKBest shows how to train the model using the new features, but how do you put the remaining features into an array that can be used to predict in test?
Even on the test set you would generally just use the K best features. But if you wanted to select the remaining features, you could put both all the features and the K best features into python sets and do set subtraction :)
Great question! You should use: scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html And pass in the mutual_info_classif as the score function
it basically means you're trying to apply a feature selection algorithm for classification to what is actually a regression problem. i.e., the technique doesn't work because it thinks the output variable is binary or categorical when it's actually ordinal or continuous
hi, great video thank you for that. I have a stupid question tho (still new to this area) : What are the differences in the methods u showed and PCA ? I mean pca does feature selection on its core by selecting the variables with highest variance so that means that the first PCs are indeed the most important ones and we can ignore the rest of them. Also from pca i know that it takes linear combinations of all variables from a dataset, and it cannot represent the original dataset well enough if the dataset is binary logic (Sparse PCA comes in and saves the day), so we cant apply pca on every dataset we find dunno if the methods u said are better off than pca because of that. One last stupid question: lets say i have a dataset that is similar to the iris dataset. And i apply normalization (StandardScaler()) then pca and i select the first 3 PC's to keep for a prediction model to train. Now lets say i want to feed live data to the model ive trained to predict , do i have to get the new data through the process as before then feed it to the model ? stupid question the answer is yes but am asking anyway. Sorry for my english i hope i explained my Q well. Thanks again
First off, none of these are stupid questions. They are all really great! Let me try to address some of them. For the last question, you are absolutely right. If you did not normalize then the construction of the PCA dimensions would be different. So pass through both SS and PCA for both fit'ing and transforming. So there is a ton of research done with feature selection and a ton of techniques: univariate feature selection, recursive feature elimination, recursive feature addition, PCA, non negative matrix factorization, Boruta, etc. They all have pros and cons. I'll say that I have found that PCA works well for me with recommendation datasets and datasets with lot's of counts. Basically places where there is strong linear correlation in the features. One of the big differences with PCA and NMF is that these methods return transformations of the original features making the results harder to interpret and debug. For me that is their main drawback. I hope this helps a bit! For my most recent data science project I used variance thresholding followed by boruta and then finally did recursive feature selection and plotted the ROCAUC of the models on the validation set as I removed features. This series of steps gave me a 3% boost in performance.
@@DataTalks Thanks for the fast reply !!! This was really a big help for me thank you !! You just gave me new things to study and a new pipeline regarding feature selection, because i only have seen the pca and wanted to learn more methods. Ill investigate those methods and come back later on with some questions. One more thing its a bit off topic, regarding preprocessing steps. (well ure the first data science with knowledge i find that reply so am gona have to take advantage of that :D am really sorry ) 1. Import dataset, find whats the target and basically what do we need to do with them. (Regression, classification problem etc) 2. Transform features values in things that a model would understand. (Different types of data like ordinal etc ) 3. Check for missing values . (Easy job when its a timeseries dataset, find out how and why are there missing data, (MAR, MCR ) although when using multivariate data i need something that will predict the missing one based on the rest of variables/features and i havent found something on it. some ppl said use mice but i havent successfully applied it so i skiped it :/ , and just remove them) And i know i learned how important af was this step when i saw this guy : th-cam.com/video/2gkw2T5jAfo/w-d-xo.html , : conference.scipy.org/proceedings/scipy2018/pdfs/dillon_niederhut.pdf 4. Check and filter dataset for noise / outliers / extreme values . (Now for this task i studied: DBSCAN, LOF, Isolation Forest, tukey IQR. I liked the IF the most because its unaffected from the "curse of dimensonality" dunno if there are better algorithms for this job , they are all good in different cases of a dataset ) 5. Now that ive managed to have a clean dataset #In my opinion "clean". Next thing is to do a Feature selection. (To visuallize the data, to understand it better, to find a way to represent the data in fewer variables, to speed up any model i may use for them.) Now for this step i only had knowledge of pca now ill expand it on the +3 methods u suggested and am really happy about that. 6. Prepare the model. (For this one ive studied (deep/machine learning, neural networks etc) and selected tensorflow with python, in theory never applied it because i gave most on my attention on the preprocessing. This means stuff like getting data normalize if they arent already from the prev steps , data split, hyperparameters tuning , layers , activation functions etc.) 7. Test test test the model in order to get the best performance. Save it and if necessary place it on a data stream. (I have no idea about this part, before feeding data into the model i have to do the same filtering and feature selection as i did before at prepeocessing steps and its all done ?! ) Please tell me if i am missing something or my thinking is correct for the most of them, and if u want to suggest an additional method (algorithms, like u suggested me on the feature selection) in any of the steps above please do so !!!! You have no idea how much i appreciate this, thank you so much !!!! Ure a life saver!!! Thanks again.
Hey Leo, this one is a pretty big question. A machine learning pipeline is going to be very different for each data science question that you will have. I've been working on a revamped data science and ML course that will try to answer this question, but it is going to be 20+ ipython notebooks at least. It will be about one year before that material will be up on YT. It will be gradually added to GitHub starting in the next couple of months though. So if you want an answer to this question, I'd suggest following me on GitHub: github.com/knathanieltucker. Happy to answer things that are a bit more specific to the above video, otherwise I'll just have to say: hang in there!
0:00 introduction
0:40 variance thresholding
3:20 univariate feature selection
4:57 recursive feature elimination
6:35 SelectFromModel
Thanks a lot
Best series about Scikit-learn on TH-cam. Thanks a ton
Thanks man for this great video. The likes doesn't truly reflect how educative this video is.
how did you show the description for the functions? I'm using jupyter notebook
If you put a question before or after a function call you can see the docs in notebook.
Check out some other magic commands here: www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/
'Cell Markdown' is what you might want to search
Hi great video, thanks! Any idea how to perform these selection techniques on categorical columns?
How do we select the best algorithm for RFECV? I have seen some example using SVC. I tried with Decision Tree. You have used Random Forest
Honestly, like most things in data science the idea is to try out a couple. Of course if you are planning on using a tree based model as the final classifier it's best that you use one for RFECV, and similarly with linear models. But which one you use and the hyper parameters associated with it, are all hyper parameters themselves :)
@@DataTalks Thanks for your reply.
How do we know which columns were selected by the feature reduction models? So we can include them in our model.
Great question! So when we use these as part of a pipeline we don't need to know what the final features are (and if we do cross validation the features may change in each validation). But if you want to know what the features are as a side step you can:
In [1]: sel.get_support()
Out[1]: array([False, True, True], dtype=bool)
So these models have a get_support() function that will tell you what features they use. Hope that helps!
testing on training data with 99.3%. lol
priceless
@Data Talks
Dear Sir,
I am working on feature selection using "Removing features with low variance". I am having 1452 features and code is returning me 454 features but with no feature labels i.e column headers. Know I am unable to get that which feature have been accepted. So My question '"how I can retain column headers in my output"?
Great question! Unfortunately a little complex though :/
If you are using pandas you can use the following
new_cols = df.columns[vt.get_support()]
new_values = vt.transform(df)
new_df = pd.DataFrame(new_values, columns=new_cols)
So the important function here is this one:
scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold.get_support
(let me know if there is a typo in the above :)
Can u pls illustrate witha real example lile some iris dataset
This is a great video. The way you explain is very easy to understand. I just have a few questions to ask...
How do you do feature selection on categorical variables?
Is it a good idea to one hot encode them and then for example use the SelectKBest algorithm? (I've read that it isn't because it's not a good idea to remove dummy variables unless you drop only the first one)
So yeah, are there any special algorithms that you use for feature selection for categorical variables or a mix of categorical and numerical variables in the dataset?
In practice, do you first do feature selection and then one hot encode the variables?
Great question - this question is generally answered by either using tree based models or by using embeddings. With tree based models you only need to map your categories to integers to do feature selection. Then your feature selection can act on the entire categorical column.
There is nothing inherently wrong with dropping a dummy variable - at least not in the function approximator view of ML. When all of the other dummies are 0, the model is assuming the sample distribution's prob that the dropped dummies are on or off. There would obviously be a problem with doing causal inference or some explainability techniques - but I'd caution against those in the first place with linear models. You're most likely making more assumptions that you know and the causal inference you're doing is best done in different ways: see th-cam.com/play/PLgJhDSE2ZLxaIAU_C1j0Cw70f2bQ2Sa6D.html
Great quesiton!
@@DataTalks Thank you so much for clarifying! I understand now. :)
I like your video but they don't answer why at all?
For example why did you choose chisquare instead of f_classif?
I mean we can get that info from website, read instruction and implement it.
Which feature selection is common or have you used?
I hate to be a tease, but I'm teaching an introduction to ML course at USF (videos will be posted here with a bit of a delay) where we will be going over this question exactly!
can these feature selection methods be applied before train_test_split?
Great question! The answer is almost always no. If they are being used for data exploration (not ML) on a massive dataset then perhaps, but even then there is little harm leaving some data in the test set to test for generalization.
@@DataTalks can't follow your answer ...for Dataset Preparation or Filtering, I can use Feature Selection on the whole dataset. am I right? After Dataset is modified (some features deleted) then I'll use it for ML
Sorry for the lack of clarity! You should *not* use feature selection on the whole dataset. You should only use it on the training set.
You should think of feature selection and transformation as part of the ML model itself - you are using the data to set parameters.
Hope that helps!
@@DataTalks If some of the features are removed from the dataset and then Model Development is done by Splitting the dataset into Training Set and Validation Set ...What is the harm?
So you are using the data itself to tell you which features to remove. Let's say one feature is particularly important in your test set but not important in your training set. If you did feature selection on the full of the data, that feature would not be removed and you would have higher accuracy on your test set. If you did the split first you would have lower accuracy on the test set.
Now you might say: well high accuracy here is better right?
The problem is that in the real world, you don't have access to your test set. So if you are working in a regime where feature importances shift over time (which is not a crazy assumption). Then doing the feature selection would actually result in a lower accuracy than not.
So doing feature selection on the whole dataset will overestimate how accurate your model will be in the real world.
The ideal case for splitting your data is to estimate how well your model will do in the real world.
Hope this makes sense :)
This is definitely one of the harder concepts in data science!
Every tutorial I’ve seen on SelectKBest shows how to train the model using the new features, but how do you put the remaining features into an array that can be used to predict in test?
Even on the test set you would generally just use the K best features. But if you wanted to select the remaining features, you could put both all the features and the K best features into python sets and do set subtraction :)
How do i implement mutual_info_classif for feature selection and then use only those columns returned by mutual_info_classif to classify data?
Great question!
You should use: scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html
And pass in the mutual_info_classif as the score function
so where are the column names selected?
They are actually paired down in the transform step! So you don't need to select them yourself!
hello there i am trying this code on my data but it is raising an error of "Unknown label type: 'continuous'" what should i do
it basically means you're trying to apply a feature selection algorithm for classification to what is actually a regression problem. i.e., the technique doesn't work because it thinks the output variable is binary or categorical when it's actually ordinal or continuous
Sir, How can we find out the co-relation between each feature and the target variable?
Simply add them to the same data frame and you can just call the method: .corr()
dataFrame.corr() will work
I love u man, you are the best, yours videos really help me
Can u pls clear me what is X = [0,0,1][0,1,0] and so on
Roger, X is a series of points with 3 binary features
hi, great video thank you for that. I have a stupid question tho (still new to this area) : What are the differences in the methods u showed and PCA ? I mean pca does feature selection on its core by selecting the variables with highest variance so that means that the first PCs are indeed the most important ones and we can ignore the rest of them.
Also from pca i know that it takes linear combinations of all variables from a dataset, and it cannot represent the original dataset well enough if the dataset is binary logic (Sparse PCA comes in and saves the day), so we cant apply pca on every dataset we find dunno if the methods u said are better off than pca because of that.
One last stupid question: lets say i have a dataset that is similar to the iris dataset. And i apply normalization (StandardScaler()) then pca and i select the first 3 PC's to keep for a prediction model to train. Now lets say i want to feed live data to the model ive trained to predict , do i have to get the new data through the process as before then feed it to the model ? stupid question the answer is yes but am asking anyway.
Sorry for my english i hope i explained my Q well. Thanks again
First off, none of these are stupid questions. They are all really great! Let me try to address some of them.
For the last question, you are absolutely right. If you did not normalize then the construction of the PCA dimensions would be different. So pass through both SS and PCA for both fit'ing and transforming.
So there is a ton of research done with feature selection and a ton of techniques: univariate feature selection, recursive feature elimination, recursive feature addition, PCA, non negative matrix factorization, Boruta, etc. They all have pros and cons. I'll say that I have found that PCA works well for me with recommendation datasets and datasets with lot's of counts. Basically places where there is strong linear correlation in the features.
One of the big differences with PCA and NMF is that these methods return transformations of the original features making the results harder to interpret and debug. For me that is their main drawback.
I hope this helps a bit! For my most recent data science project I used variance thresholding followed by boruta and then finally did recursive feature selection and plotted the ROCAUC of the models on the validation set as I removed features. This series of steps gave me a 3% boost in performance.
@@DataTalks Thanks for the fast reply !!! This was really a big help for me thank you !!
You just gave me new things to study and a new pipeline regarding feature selection, because i only have seen the pca and wanted to learn more methods.
Ill investigate those methods and come back later on with some questions.
One more thing its a bit off topic, regarding preprocessing steps. (well ure the first data science with knowledge i find that reply so am gona have to take advantage of that :D am really sorry )
1. Import dataset, find whats the target and basically what do we need to do with them. (Regression, classification problem etc)
2. Transform features values in things that a model would understand. (Different types of data like ordinal etc )
3. Check for missing values . (Easy job when its a timeseries dataset, find out how and why are there missing data, (MAR, MCR ) although when using multivariate data i need something that will predict the missing one based on the rest of variables/features and i havent found something on it. some ppl said use mice but i havent successfully applied it so i skiped it :/ , and just remove them) And i know i learned how important af was this step when i saw this guy : th-cam.com/video/2gkw2T5jAfo/w-d-xo.html ,
: conference.scipy.org/proceedings/scipy2018/pdfs/dillon_niederhut.pdf
4. Check and filter dataset for noise / outliers / extreme values . (Now for this task i studied: DBSCAN, LOF, Isolation Forest, tukey IQR. I liked the IF the most because its unaffected from the "curse of dimensonality" dunno if there are better algorithms for this job , they are all good in different cases of a dataset )
5. Now that ive managed to have a clean dataset #In my opinion "clean". Next thing is to do a Feature selection. (To visuallize the data, to understand it better, to find a way to represent the data in fewer variables, to speed up any model i may use for them.) Now for this step i only had knowledge of pca now ill expand it on the +3 methods u suggested and am really happy about that.
6. Prepare the model. (For this one ive studied (deep/machine learning, neural networks etc) and selected tensorflow with python, in theory never applied it because i gave most on my attention on the preprocessing. This means stuff like getting data normalize if they arent already from the prev steps , data split, hyperparameters tuning , layers , activation functions etc.)
7. Test test test the model in order to get the best performance. Save it and if necessary place it on a data stream. (I have no idea about this part, before feeding data into the model i have to do the same filtering and feature selection as i did before at prepeocessing steps and its all done ?! )
Please tell me if i am missing something or my thinking is correct for the most of them, and if u want to suggest an additional method (algorithms, like u suggested me on the feature selection) in any of the steps above please do so !!!!
You have no idea how much i appreciate this, thank you so much !!!! Ure a life saver!!! Thanks again.
Hey Leo, this one is a pretty big question. A machine learning pipeline is going to be very different for each data science question that you will have. I've been working on a revamped data science and ML course that will try to answer this question, but it is going to be 20+ ipython notebooks at least. It will be about one year before that material will be up on YT. It will be gradually added to GitHub starting in the next couple of months though. So if you want an answer to this question, I'd suggest following me on GitHub: github.com/knathanieltucker.
Happy to answer things that are a bit more specific to the above video, otherwise I'll just have to say: hang in there!
@@DataTalks :] Gotcha , thanks for the reply . I followed and will check ur repos, thanks again u were very helpful.
too bad you did not zoom on your lines while explaining but good presentation nonetheless