Having problems with the code? I just finished updating the notebooks to use *scikit-learn 0.23* and *Python 3.9* 🎉! You can download the updated notebooks here: github.com/justmarkham/scikit-learn-videos
I watched a dozen videos on this topic. I was pretty certain I understood it, but I still had a few questions. You're video cleared those questions up amazingly! Thank you.
I really appreciate how easy this video series is to follow and that the notebooks are available so you can follow along. It would be excellent if the notebooks were updated to reflect that the cross_validation is now deprecated.
Domain of liking something always was dominated on BY spontaneity BUT never it was without reasons. I liked all your videos with much enthusiasm and the simple reason is they are just BRILLIANT!
Excellent series! This is the first time I've studied machine learning. You are doing an outstanding job of transforming it from a science fiction term into a tangible subject. I really appreciate these videos!
I found a lot and lot of videos about ML and cross validation, I watched them all, I tried to follow but it was very hard understand. But you, you make it easier, I was very confused with this cross validation and now it's more than clear. Thank you very much for this video and for your channel
Sir firstly I would like to you thanks a lot, because you spent so much time to make this video ....this is really helpful to initial phase learner's for ML ; keep doing sir , I stopped this video in mid to say thanks , you saved my lots of hour to understand cross validation.....
being a beginner to machine learning , this video lecture series are of great help ,providing crystal clear understanding of the concepts presented along the course . Dear sir please keep up with the good work
*Note:* This video was recorded using Python 2.7 and scikit-learn 0.16. Recently, I updated the code to use Python 3.6 and scikit-learn 0.19.1. You can download the updated code here: github.com/justmarkham/scikit-learn-videos
I am so glad I found you. I am aspiring an data scientist and I find all your videos extremely useful and better than any documentation since you explain the intricate details very well. Thanks!!
Just like Andrew Ng you are a genius....he gave a clear explanation in theory and you produced a mesmerising implementation techniques in such a simple way that is inexplicable..............Awesome machine learning video i have watched so far...you helped me a lot you could never imagine..thank you sir
Sir, I really appreciate your work. I strongly think that these video series is probably the best I have ever come so far. I truly praise the way you teach, this becomes utmost clear. Thank you very much for these videos and hope to see more of them.
Good video. Maybe is for my level of english but i dont understand why at 24:00 we compare the accuracy of KNN with Linear Regression when KNN is used for classification and Linear Regression is used for regression. I know Cross Validation works for both but the response variable in both cases should be different, since for KNN should be categorical and for LR should be continuos. Great series, enjoying them so far. Thanks for the good content :)
Thank you Kevin, This is fantastic material. You made it easy for 60 year old brain to comprehend ML methods. Cross-validation material was excellent. I will try to run it on my own data sets. Looking forward to the next one. Regards Richard
You are awesome...you teach everything in a simple way....ask for the feedback....and make them much better.....And the best thing is you make everything (I REPEAT EVERYTHING) easy for us....So sweet of you :)
It was confusing at first but you made it so clear I wish you will upload many other videos on ML. If you will upload a application on ML with python will be great for every one. Thank you. Love your way of teaching.
I am so happy to follow such kind of lecture because your teaching way is attractive and your language clarity is very excellent so I get knowledge an input for my ML thesis because am doing on classification (prediction) problem
Kevin, you are very good in explaining. I wish I found you earlier. I just subscribed to receive all future videos and thank you for all the explanations.
Waiting eagerly for new series because I have completed watching all your videos. Thank you so much for teaching me Python. You have made me educated you are a teacher for me and I respect you. Thank you so much.
Best Explanation ever done by anyone in Machine Learning Community. Hats Off to your Great work and effort in teaching us. May god bless you. Yes we need more concepts on Scikit learn than pandas. but use pandas functionality when needed
Thank you v much for the series. Well done. You make the complex simple, with your clear explanations - a true mark of your understanding of your subject. Respect to you Chief. :) :)
Great videos! It is well explained, once you understand why (and this is so important) you are using some function or model. Besides, you also have a great resource material (and it shows you have done it with excelence). You are an awesome professor! Congrats, from Brazil :)
I actually used your method on my GBM model, something like a 10-fold stratified cross-validation on 80% of the data for hyperparameter tuning e.g. max depth, min rows, etc. in a search grid, and then kept 20% as hold-out set, works quite consistently :)
Thank you sir for such a detailed explanation. I was struggling with the topic. Then luckily found your video and very bit of it is full of knowledge. Thanks again for making such informative videos.
First of all let me thank you at first for the extraordinary work you are doing......... You are explaining extraordinary things in a very ordinary way.............. I had a query regarding FIT with the cross validation........... When we do linear regression with a SINGLE test train split data set.........we get a SINGLE FIT to predict over the test data........ Whereas when we do cross validation (for eg cv=10) in linear regression.....we have 10 training datasets............... BUT DO WE ALSO HAVE 10 FIT MODELS AS WELL FOR EACH TRAINING MODEL ?????? OR DO WE HAVE AN AVERAGE OF ALL THE 10 FIT MODELS????? ########################################################################### I am able to get coefficient & intercept of the fit model via single training dataset X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25,random_state=1,) lm=LinearRegression() fitting=lm.fit(X_train,y_train) fitting.coef_ fitting.intercept_ How to get the intercept & coefficient for the fit model via cross validation??????? ############################################################################ what is the significance of cross_val_predict???? Does it have any relation my query???
Thanks Kevin. Do you know of any resources that make it easy to import and transform my own datasets before using sci kit learn? I have data samples contain both numerical, categorical and boolean features. Thanks Kevin
Ogheneovo Dibie I mostly use Pandas for data reading and transformation. I demonstrate Pandas in this video: th-cam.com/video/3ZWuPVWq7p4/w-d-xo.html Does that help?
Thanks very much! Currently I teach online only. You can find more of my tutorials here, and also sign up for my newsletter: www.dataschool.io/ I'll be announcing new tutorials, webcasts, and/or courses in the coming months. Stay tuned! :)
Hi Kevin, how are you!! First of all, thanks for your videos, you are awesome for posting them for us, and also because you are a great teacher and very good explaining. I have two doubts: 1. why is cross validation better than fitting a model (or training it) with all the data? 2. is cross validation a usefull method for timeseries?
The difference between the average RMSE with the Newspaper feature and 10-fold cross-validation without the Nespaper feature seems - to me - quite negligible. Could you walk through the logic behind choosing when the difference is actually small enough to keep a feature and when one should decide to drop a feature? Thanks! And, BTW: your video series is amazing! Keep it up!
Thanks for your kind words! The simple answer is that you should always prefer a simpler model (less features), unless having more features provides a "meaningful" increase in performance. There's no strict definition for "meaningful", it depends on context. Hope that helps!
Great video! The only line of code that I needed to update is reshaping the data to pass into the binarize function and then flatten the return ndarray. '''y_pred_class_2 = binarize(y_pred_prob.reshape((192,1)), threshold=0.3).flatten() '''
Great tutorial!! am glued to my seat. Please, I want to be sure that we should make sure all values(scores) are positive even when only one is negative?
Nice to see your video after long time, I have some confusion, I hope i will be clear after your response. 1. Please correct me that i understood the demerit of Train/Test split is high variance or differences between Training and Testing data will affect the Testing accuracy. 2. I really want to know what does random_state parameter does when you change it form 4 , 3,2,1 and 0. 3. For Classification , you mentioned Stratified Sampling to make K- Fold. How does it affect for the accuracy of the model ( For eg. out of total 5000 rows or observation if have 80% ham and 20 % spam mail or out of total 5000 rows or observation if have 50% ham and 50 % spam mail ) in my collected dataset. 4. Since you have numerical feature only , so you used accuracy as a metrics to select the best feature, what do you suggest if your dataset contains object datatypes like dates format, and string objects like text data. I am new student of datascience , so i am sorry for long comments
unique raj Great questions! My responses: 1. The disadvantage of train/test split is that the resulting performance estimate (called "testing accuracy") is high variance, meaning that it may change a lot depending upon the random split of the data into training and testing sets. 2. Try removing the random_state parameter, and running train_test_split multiple times. Every time you run it, you will get different splits of the data. Now use random_state=1, and run it multiple times. Every time you run it, you will get the same exact split of the data. Now change it to random_state=2, and run it multiple times. Every time you run it, you will get the same exact split of the data, though it will be different than the split resulting from random_state=1. Thus, the point of using random_state is to introduce reproducibility into your process. It doesn't actually matter whether you use random_state=1 or random_state=9999. What matters is that if you set a random_state, you can reproduce your results. 3. In this context, stratified sampling relates to how the observations are assigned to the cross-validation folds. The reason to use stratified sampling in this context is that it will produce a more reliable estimate of out-of-sample accuracy. It doesn't actually have anything to do with making the model itself more accurate. 4. My choice of classification accuracy as the evaluation metric is not actually related to the data types of the features. Your features in a scikit-learn model will always be numeric. If you have non-numeric values that you want to use as features, you have to transform them into numeric features (which I will cover in a future video). Hope that helps!
Around 29:40, when you're redoing the cross-validation with just the TV and Radio columns, wouldn't you need to refit the linear model? Because the previous one was fitted using TV, Radio, and Newspaper, but this new one is only involving TV and Radio.
Missed your videos for the past few weeks. Feels good to resume! I have a question about testing metrics in general: Once a model is checked with test/train or cross validation, and judged that the accuracy is good enough, do you build the final model on the entire data? Or just use the model built earlier on a subset of the data?
Siddarth Jay Great! The next video will be out later this week :) Yes, you should build the final model on all of your data, using the tuning parameters you selected via train/test split or cross-validation. Otherwise, you will be discarding valuable training data!
@Data School I have a question which is about the minute 29:44, you conclude that we choose the second model because the score is less than the other one. But we already negate and flip the sign of the mean_squared_error, so why we choose the smaller number? I thought it should be the higher one. Thanks!
Why scoring=accuracy works on iris data and not on sales data? I understand one is categorical problem and other is continous, but how program knows that if you used linear regression for both? Btw your videos are remarkable, very clear and to the point. Serious teaching talent!
Glad you like the videos! It's not that scikit-learn knows we used linear regression or another model. It's that certain model evaluation metrics only work for continuous data, and others only work for categorical data. Hope that helps!
Your explanations are undoubtedly lucid and simple to understand. Thanks for the tutorial series. During the process of feature engineering which is normally done before cv, the set of features are sampled from the training set (i.e. suppose my model is dealing with linguistic data, in each cross validation set, it should handle unknown words which weren't part of training set. ) But in the KFold method the model will have to learn features k times, from k different training sets, to follow above procedure, which might be expensive. If all the features are learnt prior to KFold from the entire dataset available, it might give the model slight advantage and won't be able to handle unknown features accurately. Could you give guidelines for effectively handling such cases? Thank you again!
Glad you like the series! I think I understand most of your question... and this video will help to answer it: th-cam.com/video/ZiKMIuYidY0/w-d-xo.html
At minute 22:15. Why is a kNN model with a higher k less complex than one with a lower k? From the perspective of computational complexity it should be the opposite, because less neighbours have to found, shouldn't it? And of course: Thanks for the great video.
Thank you for your super clear and thorough videos. Do you talk about the validation/dev set in your workflow course? I have a hard time with that concept, but my teachers require that we use it. It gets confusing for me when I try to select a model. Because I've heard you should tune your models to the validation set, and then get an "unbiased" score on the test set. But they say you're supposed to select your model according to the validation score. I don't understand why, because I thought that was biased because we used it to tune the models. Is the validation score the better of the two to use to select the model because if you use the test set to select the model, then we're biased to the test set and then have no unbiased score left? I almost feel like there should be one more split in the data, to create a second test set. So the first test set will be used to choose the model, and the second test set is used to show an unbiased score. There is honestly a lot of conflicting information about this entire topic on the internet. I've seen people say the model should be chosen based on the test set, and the validation set is only present to help tune the model. Cross validation is not always possible, because it can be very slow with big datasets.
Yes, I do talk about this issue in my newest ML course. It's a complicated topic, but the bottom line is that whether or not you "need" to do an extra split (train/test/validation instead of just train/test) depends on whether your goal is only to choose the optimal model or also to estimate that model's performance on out-of-sample data. Also, you need to be mindful that an extra split can have a negative side effect if you have an insufficient amount of training data. If you decide you want to enroll, you can do so here: gumroad.com/l/ML-course?variant=Live%20Course%20%2B%20Advanced%20Course Hope that helps!
Hi Kevin, Are you going to provide any tutorial on Neural Networks? The current publicly available tutorials are great but they mainly use very massive datasets such as mnist and also do not pay enough attention on how to decide about the architecture of the model, for instance, how to decide about the number if inputs, outputs, number of hidden layers and number of nodes in hidden layers! therefore a tutorial using smaller and simpler datasets such as iris, or wine will be very highly appreciated.
I've just finished new video and it was good as always. I got some questions. 1. I understand the concept of CV is getting more accurate model performance by dividing sample dataset into many. But, it can not make model more accurate by CV itself. In fact, there's no relation with CV value and model's accuracy. Am I right? 2. So, it means that only using high CV value does not make model's high accuracy. 3. But, using high CV can make model more general (because model should iterate with small and many dataset) and it means that can avoid overfitting. Am I right? 4. CV can give us more accurate performance but is not automatic way. 5. Using Grid Search CV can give automatic way to us (searching right params, right models...) The aboves are my summary until now. Comment please if I knew wrong. And, I'm very curious about feature selection these days and handling large amount of data like xxGB that cannot be loaded on local pc's memory. Some googling taught me there is a out-of-core way but not all ML algorithms support it.
SungDeuk Park Thanks for your kind comment! In summary, I would say that the goal of cross-validation is to more accurately estimate the out-of-sample performance of the model. Just using cross-validation does not make the model itself more accurate, and varying the number of cross-validation folds does not make the model more accurate. Instead, using cross-validation allows you to choose between models in an intelligent fashion. GridSearchCV does provide some automation, and I will cover it in the next video. However, there are computational limits on how many models and how many sets of parameters can be searched, so it's by no means guaranteed to find you the best model. You are correct that not all models support out-of-core learning. Here's a useful article from scikit-learn on that topic: scikit-learn.org/stable/modules/scaling_strategies.html
Data School Thank you for your kind and quick answer. And, I got to know more clearly about the usage and limitation of CV when I should do. - And, about the feature selection question that I asked, should I approach in somewhat intuitive way? (of course, I have to understand the meaning of every feature and also got to know the system's context...) - And, after reading scikit-learn's scaling strageties page, what should I do when I have to use algorithms cannot support out-of-core or incremental learning. (In fact, I cannot understand the first 2 ways, which are a way to stream instances, a way to extract features from instances in it) Is there any practical way to handle large amount of data? or data summarization from raw dataset into smaller dataset would be only way?
Having problems with the code? I just finished updating the notebooks to use *scikit-learn 0.23* and *Python 3.9* 🎉! You can download the updated notebooks here: github.com/justmarkham/scikit-learn-videos
This videos are so well done, so clear and easy to follow that it makes appear ML a trick for kids. Congratulations, great teaching.
Thanks for your kind words!
can't agree more! best video resource on cross validation on the internet.
i almost gave up python,untill i met your channel. you are my savior
Wow, thank you! Good luck with your Python education! :)
Kevin, I appreciate the slow but thorough walk through, you and StatQuests are awesome people. Thank you.
Thank you so much!
I watched a dozen videos on this topic. I was pretty certain I understood it, but I still had a few questions. You're video cleared those questions up amazingly! Thank you.
That's awesome to hear! 🙌
I took coursera lesson twice but never got what was going on and you bro walked me through like I've never expected! thank you
Awesome! You are very welcome!
I really appreciate how easy this video series is to follow and that the notebooks are available so you can follow along. It would be excellent if the notebooks were updated to reflect that the cross_validation is now deprecated.
Thanks for the suggestion! Right now I'm on paternity leave from Data School, but it's on my to-do list :)
Congratulations!
It's deprecated, but it still works for me in Jupyter Notebook... I literally did the exact same cross validation
I recently updated the code to use Python 3.6 and scikit-learn 0.19.1. The updated code can be found here: github.com/justmarkham/scikit-learn-videos
Domain of liking something always was dominated on BY spontaneity BUT never it was without reasons. I liked all your videos with much enthusiasm and the simple reason is they are just BRILLIANT!
Thank you so much!
Excellent series! This is the first time I've studied machine learning. You are doing an outstanding job of transforming it from a science fiction term into a tangible subject. I really appreciate these videos!
BrothersFreedive You're very welcome! I greatly appreciate your kind comments!
I found a lot and lot of videos about ML and cross validation, I watched them all, I tried to follow but it was very hard understand.
But you, you make it easier, I was very confused with this cross validation and now it's more than clear.
Thank you very much for this video and for your channel
You're very welcome! Glad it was helpful to you!
Amazing video! Very instructive. And the presenter has a very clear voice and pace.
Thank you so much! I'm glad you liked it!
Sir firstly I would like to you thanks a lot, because you spent so much time to make this video ....this is really helpful to initial phase learner's for ML ; keep doing sir , I stopped this video in mid to say thanks , you saved my lots of hour to understand cross validation.....
That's awesome to hear! Thanks so much for letting me know! 🙌
being a beginner to machine learning , this video lecture series are of great help ,providing crystal clear understanding of the concepts presented along the course .
Dear sir please keep up with the good work
That's great to hear! Good luck with your education!
This is the best sklearn tutorial I have come across.
Thank you!
Always eager to learn. You demystified the subject and you even made it easy for a 75 year old brain to comprehend ML methods. :) :) :)
Great to hear!
*Note:* This video was recorded using Python 2.7 and scikit-learn 0.16. Recently, I updated the code to use Python 3.6 and scikit-learn 0.19.1. You can download the updated code here: github.com/justmarkham/scikit-learn-videos
Updated link shows me this message :
Sorry, something went wrong. Reload?
is this still relevant 2020?
I am so glad I found you. I am aspiring an data scientist and I find all your videos extremely useful and better than any documentation since you explain the intricate details very well. Thanks!!
Wow, thanks so much for your kind comments! Good luck in your journey to become a data scientist!
Just like Andrew Ng you are a genius....he gave a clear explanation in theory and you produced a mesmerising implementation techniques in such a simple way that is inexplicable..............Awesome machine learning video i have watched so far...you helped me a lot you could never imagine..thank you sir
Thanks very much for your kind words!
you are most welcome sir
Whenever I needed a references, I always end up with your videos after a long search. That prove you are THE best teacher.
What a nice thing to say! Thank you! :)
Sir, I really appreciate your work. I strongly think that these video series is probably the best I have ever come so far. I truly praise the way you teach, this becomes utmost clear.
Thank you very much for these videos and hope to see more of them.
I'm glad to hear my videos have been helpful to you!
Just chiming in to thank you for the series, really helps demistify and fill in the gaps. Looking forward to working through
You're very welcome!
Good video. Maybe is for my level of english but i dont understand why at 24:00 we compare the accuracy of KNN with Linear Regression when KNN is used for classification and Linear Regression is used for regression. I know Cross Validation works for both but the response variable in both cases should be different, since for KNN should be categorical and for LR should be continuos.
Great series, enjoying them so far. Thanks for the good content :)
I'm comparing KNN with Logistic Regression, which is used for classification. Hope that helps!
Thank you Kevin, This is fantastic material. You made it easy for 60 year old brain to comprehend ML methods. Cross-validation material was excellent. I will try to run it on my own data sets. Looking forward to the next one.
Regards
Richard
Richard Pacholski Awesome! Very glad to hear :) I'm looking forward to making the next one!
i usually don't subscribe any channel but you earned this subs from me...keep going lots of love
Thank you! 🙌
Had to listened to it in 1.5 speed but it is very clear and concise. Thanks.
Thanks!
You are awesome...you teach everything in a simple way....ask for the feedback....and make them much better.....And the best thing is you make everything (I REPEAT EVERYTHING) easy for us....So sweet of you :)
That is so kind of you to say! Thank you so much 😄
It was confusing at first but you made it so clear I wish you will upload many other videos on ML. If you will upload a application on ML with python will be great for every one. Thank you. Love your way of teaching.
Thanks for your kind words! Here is my series on machine learning with Python: th-cam.com/play/PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A.html
Outstanding work once again Kevin. A treasure to newcomers in the area.
Thank you!
I am so happy to follow such kind of lecture because your teaching way is attractive and your language clarity is very excellent so I get knowledge an input for my ML thesis because am doing on classification (prediction) problem
Glad to hear that my videos are helpful to you! Good luck with your thesis!
The best explanation of cross-validation on the internet. Thank you!
Thank you!
i have watched several videos on this subject. this was the only one that has met my expectations
Thanks for your kind words!
HOW DOES THIS VIDEO NOT HAVE LIKE A MILLION VIEWS?!? So good. Thank you, man!
HA! Thank you :)
I have no words to use in order to show you my deep appreciation.
May God continue to uplift your knowledge.
Thank you so much! Glad to hear I was helpful to you!
Kevin, you are very good in explaining. I wish I found you earlier. I just subscribed to receive all future videos and thank you for all the explanations.
Thanks for your kind words! 🙏
This is the best explanation so far ... far better than my professor ... thumbs up ...
Great to hear! :)
You're soooooo good at explaining confusing concepts!!! I'm always wondering about the negative sign in loss function until now!!! Thank you!!!
You're very welcome! You might also want to read this post for an update: www.dataschool.io/how-to-update-your-scikit-learn-code-for-2018/
I love the simplicity in the videos. Thank you. I have learned some things that were confusing before. Especially cross validation
Awesome! That's great to hear.
amazingly explained. Sufficiently Slow for a fresher in Machine learning.. Easily understandable. Keep it up.
Awesome, thank you! :)
Thanks. The notebook helped me a lot. I hope more topics get coverage like this one had. Please do a video on PCA.
Thanks for your suggestion! I'll consider it for the future.
Great videos. Please keep the good work doing. We really need your lectures. Thank you so much.
Thanks for your kind words! I will definitely release more videos! :)
Waiting eagerly for new series because I have completed watching all your videos. Thank you so much for teaching me Python. You have made me educated you are a teacher for me and I respect you. Thank you so much.
Awesome! Thank you for watching and learning! :)
I will always be your audience. Your teaching saves me. lol
Glad to hear! :)
Thanks for the Video. This is the best Video on KNN Cross validation that I have watched. Appreciate your effort...
Thanks! :)
Best Explanation ever done by anyone in Machine Learning Community. Hats Off to your Great work and effort in teaching us. May god bless you. Yes we need more concepts on Scikit learn than pandas. but use pandas functionality when needed
Thanks very much for your kind words!
Thank you v much for the series. Well done. You make the complex simple, with your clear explanations - a true mark of your understanding of your subject. Respect to you Chief. :) :)
+Andrew Kinsella Thank you so much! I have spent a lot of time figuring out how to explain this material clearly :)
Very clear and detailed explanations. Also the links after the videos are very helpful. Thanks.
You're welcome!
Спасибо!
You're very welcome! Thank you so much for your support! 🙏
Your explanations are very straightforward. Thanks a lot.
Thanks!
All about cross validation in one video , THAAAAAAAAAAAAAAAAAAAAAANK YOU
You're very welcome! :)
Great videos! It is well explained, once you understand why (and this is so important) you are using some function or model. Besides, you also have a great resource material (and it shows you have done it with excelence). You are an awesome professor!
Congrats, from Brazil :)
Thanks so much for your kind words! I'm glad the videos have been helpful to you. Good luck with your machine learning education!
Natural born teacher. Bravo! Thank you!
Thank you! :)
Excellent explanation of cross validation and it's wonders - thanks for the improvement recommendations also
Thanks for your kind comment, and you're very welcome!
I actually used your method on my GBM model, something like a 10-fold stratified cross-validation on 80% of the data for hyperparameter tuning e.g. max depth, min rows, etc. in a search grid, and then kept 20% as hold-out set, works quite consistently :)
Great to hear!
The Best explanation I have seen ever.
Wow, thank you so much!
I really really appreciate your efforts..this series is so helpful in learning.
I'm glad to hear the series is helpful to you! :)
Thank you sir for such a detailed explanation. I was struggling with the topic. Then luckily found your video and very bit of it is full of knowledge.
Thanks again for making such informative videos.
Great to hear!
I really love your videos. They are so simple and to the point! Thanks for making such videos. :)
Thanks very much for your kind words!
These are all great material. Thank you very much for uploading these videos. Keep up the great work and know that they are all appreciated! :)
+MJoseph A Excellent! You are very welcome.
SIr your lectures are out of this world.Sir please please please make a Seaborn tutorial Series
Thanks for your kind words, and your suggestion!
Great videos to learn about machine learning! Thanks Kevin for making this avaiable.
You're welcome!
thank you for introducing me to ML, and also for helping me understand Python through your great pandas videos!
You are very welcome!
Thank you so much. Your tutorial and style of explaining is exceptional.
Thank you!
All your videos W.I.N - you’re the best
REALLY well done!! I found this video extremely helpful. Keep up the good work
Thanks for your kind comment! :)
First of all let me thank you at first for the extraordinary work you are doing.........
You are explaining extraordinary things in a very ordinary way..............
I had a query regarding FIT with the cross validation...........
When we do linear regression with a SINGLE test train split data set.........we get a SINGLE FIT to predict over the test data........
Whereas when we do cross validation (for eg cv=10) in linear regression.....we have 10 training datasets...............
BUT DO WE ALSO HAVE 10 FIT MODELS AS WELL FOR EACH TRAINING MODEL ??????
OR
DO WE HAVE AN AVERAGE OF ALL THE 10 FIT MODELS?????
###########################################################################
I am able to get coefficient & intercept of the fit model via single training dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25,random_state=1,)
lm=LinearRegression()
fitting=lm.fit(X_train,y_train)
fitting.coef_
fitting.intercept_
How to get the intercept & coefficient for the fit model via cross validation???????
############################################################################
what is the significance of cross_val_predict???? Does it have any relation my query???
Excellent. Thanks for sharing your expertise.
Thank you!
best explanation, easy to understand. thank you so much
Thank you!
Your video tutorial is very good. Thank you for helping understand these topics
You're welcome!
Hi Kevin, You are such a great teacher. Love your videos. Cant thank you enough!!
Thanks very much for your kind words! You are very welcome :)
Too good sir, its really very helpful. Actually I request you to made a video on "How to select models from various available ones"
Thanks for your suggestion!
Thanks a ton, perfectly explained the concept and the code
Great to hear!
Can't wait to see next lesson ! Bravo
Thanks!
Please consider more lessons on Data Visualization and representation as well. Thank you!
+Rayed Bin Wahed Thanks for the suggestion!
Very well explained!!!! Explanation is very impressive...
Thanks!
Great ! Thank you very much for sharing such a clear explanation 🙌
You're very welcome!
Great videos. Good approach combining thought process and tools.
+Ian Melanson Thanks!
Great job Kevin. You're videos are really helpful
Ogheneovo Dibie Glad you're enjoying them!
Thanks Kevin. Do you know of any resources that make it easy to import and transform my own datasets before using sci kit learn? I have data samples contain both numerical, categorical and boolean features. Thanks Kevin
Ogheneovo Dibie I mostly use Pandas for data reading and transformation. I demonstrate Pandas in this video: th-cam.com/video/3ZWuPVWq7p4/w-d-xo.html
Does that help?
These videos have been very useful and excellent! Thanks!
+Dinesh Kumar Murali Great, thanks for your kind words!
This is really a very good video. Easy to understand
Thanks!
Looking forward to the next video on this!
tabnaka Great! It will come out in about two weeks.
best tutorial in youtube
Thank you!
One of the best explanation. Thanks
Thanks for your kind words!
Awesome explanation....!
Thanks!
I couldn't understand the feature engineering concept 31:22. Any example?
Dude, honestly, this is golden material. Where do you teach? I want to watch more of your tutorials. please send a link or something.
Thanks very much! Currently I teach online only. You can find more of my tutorials here, and also sign up for my newsletter: www.dataschool.io/
I'll be announcing new tutorials, webcasts, and/or courses in the coming months. Stay tuned! :)
Hi Kevin, how are you!! First of all, thanks for your videos, you are awesome for posting them for us, and also because you are a great teacher and very good explaining.
I have two doubts:
1. why is cross validation better than fitting a model (or training it) with all the data?
2. is cross validation a usefull method for timeseries?
Very Informative and well presented. Thanks a lot for sharing
You're very welcome! Hope you enjoy the rest of the series: th-cam.com/play/PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A.html
The difference between the average RMSE with the Newspaper feature and 10-fold cross-validation without the Nespaper feature seems - to me - quite negligible. Could you walk through the logic behind choosing when the difference is actually small enough to keep a feature and when one should decide to drop a feature? Thanks!
And, BTW: your video series is amazing! Keep it up!
Thanks for your kind words!
The simple answer is that you should always prefer a simpler model (less features), unless having more features provides a "meaningful" increase in performance. There's no strict definition for "meaningful", it depends on context. Hope that helps!
Good Vedio, it make concepts clear and great to understand...
Thanks!
Great video! The only line of code that I needed to update is reshaping the data to pass into the binarize function and then flatten the return ndarray.
'''y_pred_class_2 = binarize(y_pred_prob.reshape((192,1)), threshold=0.3).flatten() '''
Great tutorial!! am glued to my seat. Please, I want to be sure that we should make sure all values(scores) are positive even when only one is negative?
Thanks for your kind comment! I can't think of a case when some scores are positive and others are negative. Have you experienced that?
Nice to see your video after long time, I have some confusion, I hope i will be clear after your response.
1. Please correct me that i understood the demerit of Train/Test split is high variance or differences between Training and Testing data will affect the Testing accuracy.
2. I really want to know what does random_state parameter does when you change it form 4 , 3,2,1 and 0.
3. For Classification , you mentioned Stratified Sampling to make K- Fold. How does it affect for the accuracy of the model ( For eg. out of total 5000 rows or observation if have 80% ham and 20 % spam mail or out of total 5000 rows or observation if have 50% ham and 50 % spam mail ) in my collected dataset.
4. Since you have numerical feature only , so you used accuracy as a metrics to select the best feature, what do you suggest if your dataset contains object datatypes like dates format, and string objects like text data.
I am new student of datascience , so i am sorry for long comments
unique raj Great questions! My responses:
1. The disadvantage of train/test split is that the resulting performance estimate (called "testing accuracy") is high variance, meaning that it may change a lot depending upon the random split of the data into training and testing sets.
2. Try removing the random_state parameter, and running train_test_split multiple times. Every time you run it, you will get different splits of the data. Now use random_state=1, and run it multiple times. Every time you run it, you will get the same exact split of the data. Now change it to random_state=2, and run it multiple times. Every time you run it, you will get the same exact split of the data, though it will be different than the split resulting from random_state=1. Thus, the point of using random_state is to introduce reproducibility into your process. It doesn't actually matter whether you use random_state=1 or random_state=9999. What matters is that if you set a random_state, you can reproduce your results.
3. In this context, stratified sampling relates to how the observations are assigned to the cross-validation folds. The reason to use stratified sampling in this context is that it will produce a more reliable estimate of out-of-sample accuracy. It doesn't actually have anything to do with making the model itself more accurate.
4. My choice of classification accuracy as the evaluation metric is not actually related to the data types of the features. Your features in a scikit-learn model will always be numeric. If you have non-numeric values that you want to use as features, you have to transform them into numeric features (which I will cover in a future video).
Hope that helps!
Around 29:40, when you're redoing the cross-validation with just the TV and Radio columns, wouldn't you need to refit the linear model? Because the previous one was fitted using TV, Radio, and Newspaper, but this new one is only involving TV and Radio.
Missed your videos for the past few weeks. Feels good to resume!
I have a question about testing metrics in general:
Once a model is checked with test/train or cross validation, and judged that the accuracy is good enough, do you build the final model on the entire data? Or just use the model built earlier on a subset of the data?
Siddarth Jay Great! The next video will be out later this week :)
Yes, you should build the final model on all of your data, using the tuning parameters you selected via train/test split or cross-validation. Otherwise, you will be discarding valuable training data!
Thanks!
you are amazing. Thank you for creating this course.
You're very welcome! I'm glad it's helpful to you!
@Data School I have a question which is about the minute 29:44, you conclude that we choose the second model because the score is less than the other one. But we already negate and flip the sign of the mean_squared_error, so why we choose the smaller number? I thought it should be the higher one. Thanks!
Any metric that is named "error" is something you want to minimize. Thus we chose the second model because it minimized RMSE. Hope that helps!
Why scoring=accuracy works on iris data and not on sales data? I understand one is categorical problem and other is continous, but how program knows that if you used linear regression for both?
Btw your videos are remarkable, very clear and to the point. Serious teaching talent!
Glad you like the videos!
It's not that scikit-learn knows we used linear regression or another model. It's that certain model evaluation metrics only work for continuous data, and others only work for categorical data.
Hope that helps!
So if you tried accuracy evaluation with sales data, I guess you would get very low percentage right?
Exactly! Possibly an accuracy of zero.
Your explanations are undoubtedly lucid and simple to understand. Thanks for the tutorial series.
During the process of feature engineering which is normally done before cv, the set of features are sampled from the training set (i.e. suppose my model is dealing with linguistic data, in each cross validation set, it should handle unknown words which weren't part of training set. ) But in the KFold method the model will have to learn features k times, from k different training sets, to follow above procedure, which might be expensive. If all the features are learnt prior to KFold from the entire dataset available, it might give the model slight advantage and won't be able to handle unknown features accurately.
Could you give guidelines for effectively handling such cases?
Thank you again!
Glad you like the series! I think I understand most of your question... and this video will help to answer it:
th-cam.com/video/ZiKMIuYidY0/w-d-xo.html
At minute 22:15. Why is a kNN model with a higher k less complex than one with a lower k? From the perspective of computational complexity it should be the opposite, because less neighbours have to found, shouldn't it?
And of course: Thanks for the great video.
Great question! I don't have a short answer, but this article should help explain: scott.fortmann-roe.com/docs/BiasVariance.html
Thank you for your super clear and thorough videos. Do you talk about the validation/dev set in your workflow course? I have a hard time with that concept, but my teachers require that we use it. It gets confusing for me when I try to select a model. Because I've heard you should tune your models to the validation set, and then get an "unbiased" score on the test set. But they say you're supposed to select your model according to the validation score. I don't understand why, because I thought that was biased because we used it to tune the models. Is the validation score the better of the two to use to select the model because if you use the test set to select the model, then we're biased to the test set and then have no unbiased score left? I almost feel like there should be one more split in the data, to create a second test set. So the first test set will be used to choose the model, and the second test set is used to show an unbiased score. There is honestly a lot of conflicting information about this entire topic on the internet. I've seen people say the model should be chosen based on the test set, and the validation set is only present to help tune the model. Cross validation is not always possible, because it can be very slow with big datasets.
Yes, I do talk about this issue in my newest ML course. It's a complicated topic, but the bottom line is that whether or not you "need" to do an extra split (train/test/validation instead of just train/test) depends on whether your goal is only to choose the optimal model or also to estimate that model's performance on out-of-sample data. Also, you need to be mindful that an extra split can have a negative side effect if you have an insufficient amount of training data.
If you decide you want to enroll, you can do so here: gumroad.com/l/ML-course?variant=Live%20Course%20%2B%20Advanced%20Course
Hope that helps!
@@dataschool Thanks for the reply!
Hi Kevin,
Are you going to provide any tutorial on Neural Networks? The current publicly available tutorials are great but they mainly use very massive datasets such as mnist and also do not pay enough attention on how to decide about the architecture of the model, for instance, how to decide about the number if inputs, outputs, number of hidden layers and number of nodes in hidden layers! therefore a tutorial using smaller and simpler datasets such as iris, or wine will be very highly appreciated.
Thanks so much for the suggestion! I will strongly consider creating this in the future.
I've just finished new video and it was good as always.
I got some questions.
1. I understand the concept of CV is getting more accurate model performance by dividing sample dataset into many. But, it can not make model more accurate by CV itself.
In fact, there's no relation with CV value and model's accuracy. Am I right?
2. So, it means that only using high CV value does not make model's high accuracy.
3. But, using high CV can make model more general (because model should iterate with small and many dataset) and it means that can avoid overfitting. Am I right?
4. CV can give us more accurate performance but is not automatic way.
5. Using Grid Search CV can give automatic way to us (searching right params, right models...)
The aboves are my summary until now. Comment please if I knew wrong.
And, I'm very curious about feature selection these days and handling large amount of data like xxGB that cannot be loaded on local pc's memory. Some googling taught me there is a out-of-core way but not all ML algorithms support it.
SungDeuk Park Thanks for your kind comment! In summary, I would say that the goal of cross-validation is to more accurately estimate the out-of-sample performance of the model. Just using cross-validation does not make the model itself more accurate, and varying the number of cross-validation folds does not make the model more accurate. Instead, using cross-validation allows you to choose between models in an intelligent fashion.
GridSearchCV does provide some automation, and I will cover it in the next video. However, there are computational limits on how many models and how many sets of parameters can be searched, so it's by no means guaranteed to find you the best model.
You are correct that not all models support out-of-core learning. Here's a useful article from scikit-learn on that topic: scikit-learn.org/stable/modules/scaling_strategies.html
Data School Thank you for your kind and quick answer. And, I got to know more clearly about the usage and limitation of CV when I should do.
- And, about the feature selection question that I asked, should I approach in somewhat intuitive way? (of course, I have to understand the meaning of every feature and also got to know the system's context...)
- And, after reading scikit-learn's scaling strageties page, what should I do when I have to use algorithms cannot support out-of-core or incremental learning. (In fact, I cannot understand the first 2 ways, which are a way to stream instances, a way to extract features from instances in it)
Is there any practical way to handle large amount of data? or data summarization from raw dataset into smaller dataset would be only way?
SungDeuk Park Sorry, these are very complex questions that can't be answered simply! :)