Extracting the feature names after a pipeline that contains column transformations. I can see the feature importances but no way of understanding what those features are! :)
Thanks for sharing! Try slicing the Pipeline to select the ColumnTransformer step, and then use the get_feature_names method. That method has some limitations, but it may work in your case! Here are relevant links: nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/30_examine_pipeline_steps.ipynb nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/38_get_feature_names.ipynb Does that help?
@@dataschool thanks for responding Kevin. I’ll take a look at these links now but the problem I ran into was that there’s several transformers that don’t provide the get_feature_names attribute - which stops the ability to slice the entire pipe in the way you say as it just throws an error. KBindiscretizer was the one I run into. It seems strangely difficult to get feature names out of pipes when it’s incredibly important for inspecting the model with Shapley values, feature importances and so on. Will report back if I have any luck :)
I totally hear you! The scikit-learn core developers are very aware of this, and are actively working on a solution to this problem. I've been following the discussions and turns out it's quite a big project with tons of implications. That's all to say that a solution will come at some point and they are doing their best to get there quickly!
Searched for all the questions before, does GridSearch apply shuffling, why are there different classes like KFold,StratifiedKFold and cross_val_score, when to use what. 5 minutes covered everything. Great!
Hl, I am Sohana. I just finished your blog post in your Data School website. I basically obsessed in data science but now I am engaged with my academic in dentistry. And your post such like a inspiring to me. It will be very glad for me if you share about what should be my first step to start the journey to data science. Thank you in advance.
Hi Sohana, thanks for your comment! It's hard to give personalized advice, but this post might provide a helpful path for you: www.dataschool.io/launch-your-data-science-career-with-python/ Hope that helps!
4:40 i do not understand why stratification does not make sense. Say i have multiple numerical features and i do a linear regression. Furthermore say that i have very few high values for the numerical label i want to predict. Wouldn't it then make sense to use a histrogram with a stratification to make sure that all training sets and test sets have a few very high values in the labels instead of having one set contain most of these high values simply by chance?
I cannot think of an scenario where your data is not arbitrarily ordered (randomly) since CV is done in training set and this was obtained by randomly splitting the whole dataset intro train and test. Can you elaborate more on this case?
Excellent question, thank you so much for asking! I disagree with your assertion that "CV is done on the training set." That is certainly one way of doing model evaluation, but there are many other valid ways, including cross-validation on the entire dataset. It's true that holding out an independent test set gives you a more reliable estimate of out-of-sample performance, but the test set is unnecessary if your only goal of cross-validation is model selection (including hyperparameters). In addition, holding out an independent test set is not recommended if you have a small dataset, because the reduced size of the training set makes it harder for cross-validation (when done as part of a grid search) to locate the optimal model. Thus, there are valid reasons not to hold out an independent test set. All of that is to say that the premise of this video is that you are performing cross-validation on the entire dataset, and as such, your data may not be arbitrarily ordered. Hope that answers your question!
Hi, i was applying normal liners regression and its r2 _ score is comming .758965 but when i am applying mean of cross_val_score then it is giving negative value. i am not able to understand how is possible? crossvascore also tells us r2 score with different different sample.
Randomizing input is a best practice (whether the algorithmic approach requires it or not). Also, after randomizing the first time, if you have to re-randomize to make model development work, that's usually a big red-flag. Re-randomizing should be looked at differently. Instead of using it to make a model work, use it to help ensure your model's working if your data is limited.
Thanks for your comment! To be clear, the goal of the shuffling was to ensure that the model evaluation procedure outputs reliable results, not to make the model work. The model itself will work regardless of whether the samples are shuffled. Could you elaborate on this part of your comment: "use it to help ensure your model's working if your data is limited"? I'd love to understand better. Thank you!
Very nice post! Could i ask you a question? I am working on a binary time series classification problem ( given a time series, I want to detect whether a consumer has committed fraud in residential electricity consumption). So, for my input data, the rows represents a Consumer ID (i have around 42000 consumers), and my columns are daily electricity consumption measurements of of each of those consumers between 01/01/2014 to 12/12/2016. So, my output is a binary array indicating whether a consumer has committed fraud or not. The database given to me is in a way that all fraudulent consumers are in the first N rows, and the rest of the data are non fraudulent users. I know i shouldn't shuffle my columns (since each column represents a single day measure in the timeseries, and shuffling that could mess things up), i need to shuffle only my rows, thats correct? But, since i am working with time series, i am using TimeSeriesSplit as cv parameter inside cross_val_score. But, i don't think TimeSeriesSplit has a Shuffle parameter (like KFold or StratifiedKFold here in your video), any tips how i could do this? To Shuffle my rows at each cross-validation fold? Sorry for the long post, and thanks again for the awesome video!
What do you find hard or confusing about scikit-learn? Let me know in the comments, and maybe I can help! 🙌
Extracting the feature names after a pipeline that contains column transformations. I can see the feature importances but no way of understanding what those features are! :)
Thanks for sharing! Try slicing the Pipeline to select the ColumnTransformer step, and then use the get_feature_names method. That method has some limitations, but it may work in your case!
Here are relevant links:
nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/30_examine_pipeline_steps.ipynb
nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/38_get_feature_names.ipynb
Does that help?
@@dataschool thanks for responding Kevin. I’ll take a look at these links now but the problem I ran into was that there’s several transformers that don’t provide the get_feature_names attribute - which stops the ability to slice the entire pipe in the way you say as it just throws an error. KBindiscretizer was the one I run into. It seems strangely difficult to get feature names out of pipes when it’s incredibly important for inspecting the model with Shapley values, feature importances and so on. Will report back if I have any luck :)
I totally hear you! The scikit-learn core developers are very aware of this, and are actively working on a solution to this problem. I've been following the discussions and turns out it's quite a big project with tons of implications. That's all to say that a solution will come at some point and they are doing their best to get there quickly!
You saved me a lot of headache. I have no idea why was my cross_val_score so bad. Thankfully, I came across this video
I'm glad it helped!
Searched for all the questions before, does GridSearch apply shuffling, why are there different classes like KFold,StratifiedKFold and cross_val_score, when to use what. 5 minutes covered everything. Great!
Glad to hear it was helpful!!
Hl, I am Sohana. I just finished your blog post in your Data School website. I basically obsessed in data science but now I am engaged with my academic in dentistry. And your post such like a inspiring to me. It will be very glad for me if you share about what should be my first step to start the journey to data science. Thank you in advance.
Hi Sohana, thanks for your comment! It's hard to give personalized advice, but this post might provide a helpful path for you: www.dataschool.io/launch-your-data-science-career-with-python/
Hope that helps!
Explained very well in short😄
Thank you!
4:40 i do not understand why stratification does not make sense. Say i have multiple numerical features and i do a linear regression. Furthermore say that i have very few high values for the numerical label i want to predict. Wouldn't it then make sense to use a histrogram with a stratification to make sure that all training sets and test sets have a few very high values in the labels instead of having one set contain most of these high values simply by chance?
Pretty nice! Thank you!
You're very welcome!
I cannot think of an scenario where your data is not arbitrarily ordered (randomly) since CV is done in training set and this was obtained by randomly splitting the whole dataset intro train and test. Can you elaborate more on this case?
Excellent question, thank you so much for asking! I disagree with your assertion that "CV is done on the training set." That is certainly one way of doing model evaluation, but there are many other valid ways, including cross-validation on the entire dataset.
It's true that holding out an independent test set gives you a more reliable estimate of out-of-sample performance, but the test set is unnecessary if your only goal of cross-validation is model selection (including hyperparameters). In addition, holding out an independent test set is not recommended if you have a small dataset, because the reduced size of the training set makes it harder for cross-validation (when done as part of a grid search) to locate the optimal model. Thus, there are valid reasons not to hold out an independent test set.
All of that is to say that the premise of this video is that you are performing cross-validation on the entire dataset, and as such, your data may not be arbitrarily ordered.
Hope that answers your question!
Hi, i was applying normal liners regression and its r2 _ score is comming .758965 but when i am applying mean of cross_val_score then it is giving negative value. i am not able to understand how is possible? crossvascore also tells us r2 score with different different sample.
Randomizing input is a best practice (whether the algorithmic approach requires it or not). Also, after randomizing the first time, if you have to re-randomize to make model development work, that's usually a big red-flag.
Re-randomizing should be looked at differently. Instead of using it to make a model work, use it to help ensure your model's working if your data is limited.
Thanks for your comment! To be clear, the goal of the shuffling was to ensure that the model evaluation procedure outputs reliable results, not to make the model work. The model itself will work regardless of whether the samples are shuffled.
Could you elaborate on this part of your comment: "use it to help ensure your model's working if your data is limited"? I'd love to understand better. Thank you!
Very nice post! Could i ask you a question?
I am working on a binary time series classification problem (
given a time series, I want to detect whether a consumer has committed fraud in residential electricity consumption). So, for my input data, the rows represents a Consumer ID (i have around 42000 consumers), and my columns are daily electricity consumption measurements of of each of those consumers between 01/01/2014 to 12/12/2016. So, my output is a binary array indicating whether a consumer has committed fraud or not.
The database given to me is in a way that all fraudulent consumers are in the first N rows, and the rest of the data are non fraudulent users. I know i shouldn't shuffle my columns (since each column represents a single day measure in the timeseries, and shuffling that could mess things up), i need to shuffle only my rows, thats correct?
But, since i am working with time series, i am using TimeSeriesSplit as cv parameter inside cross_val_score. But, i don't think TimeSeriesSplit has a Shuffle parameter (like KFold or StratifiedKFold here in your video), any tips how i could do this? To Shuffle my rows at each cross-validation fold?
Sorry for the long post, and thanks again for the awesome video!
Helpful..!!
Glad to hear! 🙌