Thanks for watching! 🙌 If you have any questions about cross-validation, grid search, or hyperparameter tuning, let me know! 💬 And if you're new to pipelines, I recommend starting with this video: th-cam.com/video/1Y6O9nCo0-I/w-d-xo.html
2:25 cross_val_score splits the data and THEN applies the pipeline steps, versus preprocess the data and then use cross validation on just the model ; 3:20 Preprocessing before splitting the data doesn't properly simulate reality; splitting then preprocessing does simulate reality.
To those who wonders what Kevin was trying to say about the order of preprocessing and splitting the data, this is the clarification I got from chatGPT. I can't validate its answer due to my limited knowledge, but hell yeah it makes a lot of sense to me! "In machine learning, it is common to use techniques like cross-validation to evaluate the performance of a model. Cross-validation involves splitting the data into several subsets, then using each subset to train the model and evaluate its performance. It is better to apply data preprocessing steps (such as scaling or encoding categorical variables) after splitting the data into subsets, rather than before. The reason for this is that if you preprocess the entire dataset before splitting, you may inadvertently introduce information leakage between the training and testing subsets. For example, if you scale the entire dataset before splitting it into subsets, the scaling factors will be influenced by the values in the testing subset, which should not be used during training. By contrast, if you apply preprocessing steps after splitting the data, you can be sure that the preprocessing is only based on the training data and not the testing data. This better simulates the real-world scenario, where you only have access to the training data when building a model."
Thanks for clear description~ and I want to ask a question that do I need to split all the samples to training data and test data with train_test_split before I do the grid search, or should I do cv split first then to do grid search?
Sklearn has permutation feature performance built in. I think you could just do that operation last in the pipe. If you’re looking for feature selection you can do something like sklearns sequentialFeatureSelection at the end of a pipe I think
Great question. Param names are case-sensitive! I'm using lowercase step names because make_pipeline automatically creates lowercase step names. Hope that helps!
The step name is not affected by how you define the transformer? For example you wrote logisticregression__C and not clf__C. How do you know it should be logisticregression? Is there a way to check?
Thanks for the video! Could you, please, clarify one thing if I do the following steps, standardization inside cross_val_score will be applied only to X or to X and y? If it will be applied to X and y, how can I make it to be applied only to X? scalar = StandardScaler() clf = svm.LinearSVC() pipeline = Pipeline([('transformer', scalar), ('estimator', clf)]) cv = KFold(n_splits=4) scores = cross_val_score(pipeline, X, y, cv = cv)
I'm looking for a solution to loop through a few models and run a cross_val_score on each one of them but I can't seem to have that inside a pipeline. Any thoughts?
Your params can actually be a list of dictionaries, and one of the elements of that list should have the estimator that you want to try out. The way I think of it is the set of parameters passed needs to be applicable to a particular element in that list. In this way, you can vary both the preprocessing steps as well as the models themselves. I suspect you can also vary the features passed to various parts of ColumnTransformer too, but this part I haven't yet done.
bro, something not correct here.. if you have many categorical data cross validation will throw an error as it splits up for second or third time it will find new data that was note hot encoded. please correct this for audience of urs ♥
Thanks for watching! 🙌 If you have any questions about cross-validation, grid search, or hyperparameter tuning, let me know! 💬 And if you're new to pipelines, I recommend starting with this video: th-cam.com/video/1Y6O9nCo0-I/w-d-xo.html
Amazing! Exactly what I was looking for
Great!
Excellent tutorial.
Thanks! Glad it was helpful to you!
2:25 cross_val_score splits the data and THEN applies the pipeline steps, versus preprocess the data and then use cross validation on just the model ; 3:20 Preprocessing before splitting the data doesn't properly simulate reality; splitting then preprocessing does simulate reality.
Exactly! Thanks for pulling out these quotes, Will, and thanks also for joining as a channel member! 🙏
Welcome back sensei.
Thank you! 🙏
How would you set it up to test through multiple classifiers for best model&hyperparameters?
Really good to know! Thanks!
You're very welcome! Glad it's helpful to you! 🙌
To those who wonders what Kevin was trying to say about the order of preprocessing and splitting the data, this is the clarification I got from chatGPT. I can't validate its answer due to my limited knowledge, but hell yeah it makes a lot of sense to me!
"In machine learning, it is common to use techniques like cross-validation to evaluate the performance of a model. Cross-validation involves splitting the data into several subsets, then using each subset to train the model and evaluate its performance.
It is better to apply data preprocessing steps (such as scaling or encoding categorical variables) after splitting the data into subsets, rather than before.
The reason for this is that if you preprocess the entire dataset before splitting, you may inadvertently introduce information leakage between the training and testing subsets. For example, if you scale the entire dataset before splitting it into subsets, the scaling factors will be influenced by the values in the testing subset, which should not be used during training.
By contrast, if you apply preprocessing steps after splitting the data, you can be sure that the preprocessing is only based on the training data and not the testing data. This better simulates the real-world scenario, where you only have access to the training data when building a model."
Hi Harry, thanks so much for posting this! The ChatGPT response is absolutely correct!
Thanks for clear description~
and I want to ask a question that do I need to split all the samples to training data and test data with train_test_split before I do the grid search, or should I do cv split first then to do grid search?
Great content! Can you make a video about data scaling? Good to see you're back!
Thanks for the suggestion, and for your kind comment! 🙏
After long time see you sir🤘🤘
Good to be back! New videos coming every Tuesday and Thursday through the end of October 😄
Your teaching style and presentation is excellent. Quick question: how can you get feature importance from each model during cross validation?
Sklearn has permutation feature performance built in. I think you could just do that operation last in the pipe. If you’re looking for feature selection you can do something like sklearns sequentialFeatureSelection at the end of a pipe I think
I have a question. Param names for GridsearchCV are not case-sensitive or do we need to mention step names in lower case?
Great question. Param names are case-sensitive! I'm using lowercase step names because make_pipeline automatically creates lowercase step names. Hope that helps!
Great Video !!!
Thanks!
Thanks for the tutorial. Are you effectively doing nested cross validation here?
No, this is not nested cross-validation, though you can actually achieve that by running cross_val_score on the GridSearchCV object!
@@dataschool I see! Thanks for the tip.
hello! thank you so much for this! super useful. can I ask, no need to do the train test split outside of the cross val or pipeline? thanks!
Depends on your goals... it's complicated to explain briefly, I'm sorry!
The step name is not affected by how you define the transformer?
For example you wrote
logisticregression__C and not clf__C.
How do you know it should be logisticregression? Is there a way to check?
Great question! You examine the Pipeline step names.
Thanks for the video!
Could you, please, clarify one thing
if I do the following steps, standardization inside cross_val_score will be applied only to X or to X and y?
If it will be applied to X and y, how can I make it to be applied only to X?
scalar = StandardScaler()
clf = svm.LinearSVC()
pipeline = Pipeline([('transformer', scalar), ('estimator', clf)])
cv = KFold(n_splits=4)
scores = cross_val_score(pipeline, X, y, cv = cv)
Great question! The standardization will only be applied to X.
Thanks!
I'm looking for a solution to loop through a few models and run a cross_val_score on each one of them but I can't seem to have that inside a pipeline. Any thoughts?
Your params can actually be a list of dictionaries, and one of the elements of that list should have the estimator that you want to try out. The way I think of it is the set of parameters passed needs to be applicable to a particular element in that list. In this way, you can vary both the preprocessing steps as well as the models themselves. I suspect you can also vary the features passed to various parts of ColumnTransformer too, but this part I haven't yet done.
bro, something not correct here.. if you have many categorical data cross validation will throw an error as it splits up for second or third time it will find new data that was note hot encoded.
please correct this for audience of urs ♥