How to find the best model parameters in scikit-learn

แชร์
ฝัง
  • เผยแพร่เมื่อ 19 ต.ค. 2024

ความคิดเห็น • 344

  • @dataschool
    @dataschool  3 ปีที่แล้ว +11

    Having problems with the code? I just finished updating the notebooks to use *scikit-learn 0.23* and *Python 3.9* 🎉! You can download the updated notebooks here: github.com/justmarkham/scikit-learn-videos

    • @compton8301
      @compton8301 3 ปีที่แล้ว

      Ans thanks for updating it.

  • @MrGuruPuru
    @MrGuruPuru 7 ปีที่แล้ว +81

    just brilliant. Not an inch of fluff in what you teach and you do it so beautifully.

    • @dataschool
      @dataschool  7 ปีที่แล้ว +1

      Wow, thank you so much for your very kind words!

    • @preethamreddy6726
      @preethamreddy6726 5 ปีที่แล้ว +1

      @@dataschool Thank you for such a nice explanation. Can I know this please
      why I can't get the first element of the tuple? Thank you
      print (grid.cv_results_[0])
      KeyError Traceback (most recent call last)
      in
      ----> 1 print (grid.cv_results_[0])
      KeyError: 0

  • @roshanrajsingh4838
    @roshanrajsingh4838 3 ปีที่แล้ว +1

    This is straight tutorial I've ever seen. I've not skipped a second of this video.

    • @dataschool
      @dataschool  3 ปีที่แล้ว

      Glad it's helpful to you! 🙌

  • @AbhisarMohapatra
    @AbhisarMohapatra 8 ปีที่แล้ว +24

    There's a humbleness in the way you teach and deliver the explanations. It helps me understand very very well whatever the topic is.

    • @dataschool
      @dataschool  8 ปีที่แล้ว +1

      Wow, thank you! I really appreciate your comment.

    • @MattyTkernow
      @MattyTkernow 6 ปีที่แล้ว

      That is exactly what I was thinking. If you can't explain it simply, you don't understand it well enough (Einstein). And you sir know what you are taking about!

  • @anefuoche1053
    @anefuoche1053 4 ปีที่แล้ว +1

    At some point my eyes became teary while watching this series. I have never come across such an amazing and passionate teacher. you explain every single thing, even the questions that pop up in my mind it feels like you foresee and address them, what's baffling is that you even tag them as questions before answering them. The additional resources are also pure gold. My God will bless you sir, may you live long and always be happy

  • @mamacita5636
    @mamacita5636 2 ปีที่แล้ว +1

    Pls don’t stop making these videos you’re saving my dissertation 😭😭 thank you so much

    • @dataschool
      @dataschool  2 ปีที่แล้ว

      You're very welcome!

  • @jatingogia4633
    @jatingogia4633 4 ปีที่แล้ว +1

    I like the way you explain everything so slowly and briefly. Thank you for such a quality content on ML!

    • @dataschool
      @dataschool  3 ปีที่แล้ว

      Thanks for your kind words!

  • @kevinalkindy
    @kevinalkindy 3 ปีที่แล้ว +1

    Thank you for making this video. What a crystal clear explanation!

    • @dataschool
      @dataschool  3 ปีที่แล้ว

      You're very welcome! 🙏

  • @dhazra1
    @dhazra1 7 ปีที่แล้ว

    Kevin, you are an awesome teacher. Not only your knowledge; your teaching style is also so good that it can make extremely happy both the new comers into machine learning as well as those are already in this field but need to understand the concepts further. I have gone through quite a few of your videos in youtube and everywhere it's just outstanding !!!!

    • @dataschool
      @dataschool  7 ปีที่แล้ว

      Wow, thank you very much for your kind comments! I'm so glad the videos have been helpful to you!

  • @alexisparenty9445
    @alexisparenty9445 5 ปีที่แล้ว +3

    Whaoo! Brillant. Very well explained, you speak very clearly too. You certainly have a gift for teaching. Continue the good job

    • @dataschool
      @dataschool  5 ปีที่แล้ว

      Thanks very much for your kind words! :)

  • @perevales
    @perevales 9 ปีที่แล้ว +2

    Your videos are excellent. You have had the ability of explaining ML using Python in a very accessible way.
    I usually use R but now thanks to your great tutorials I will start using Python.
    Thanks and please don't stop producing these fantastic tutorials.

    • @dataschool
      @dataschool  9 ปีที่แล้ว

      +Pedro Carmona Ibáñez Thanks for your kind comments! I just published a new tutorial (55 minutes): th-cam.com/video/85dtiMz9tSo/w-d-xo.html
      As well, I'm teaching a new course on Machine Learning with Text: www.dataschool.io/learn/

  • @shahriarrahman6482
    @shahriarrahman6482 6 ปีที่แล้ว

    I watched the whole playlist in 1 day !! This is one of if not the best sklearn tutorials out..

    • @dataschool
      @dataschool  6 ปีที่แล้ว

      Awesome, thank you!

  • @pankajmathur1504
    @pankajmathur1504 9 ปีที่แล้ว

    Just want to say, all the video in Scikit series are excellent. Especially you have made so easy to understand the complex topic of cross validation and optimal parameter tuning. Keep up the good work. Can't wait for the next one...

    • @dataschool
      @dataschool  9 ปีที่แล้ว

      +Pankaj Mathur Awesome! Thanks so much for your kind words.

  • @vl4n7684zt
    @vl4n7684zt 9 ปีที่แล้ว +24

    If you are running Python 3.4 and getting a "Parameter values should be a list" error, then make the following adjustment to the code to properly read in the range:
    k_range = list(range(1,31))
    Also, as always, print statements in 3.4 require parentheses.

    • @dataschool
      @dataschool  9 ปีที่แล้ว

      +Stacy H Thanks for passing along that tip!

    • @Dualphase90
      @Dualphase90 8 ปีที่แล้ว

      Thanks Stacy! :)

    • @chetankv7218
      @chetankv7218 7 ปีที่แล้ว

      We can use numpy.arange(1,31) instead of range.

    • @dataschool
      @dataschool  6 ปีที่แล้ว

      Thanks! I recently updated the code to use Python 3.6 and scikit-learn 0.19.1. The updated code can be found here: github.com/justmarkham/scikit-learn-videos

  • @nehathakar5622
    @nehathakar5622 5 ปีที่แล้ว

    Your explanation makes each and every topic so simple and easy to understand. Really, I would like to thank you for your immense efforts for all your videos which are so informative and added resources help us to dig deeper in the related topics.

    • @dataschool
      @dataschool  4 ปีที่แล้ว

      Thanks very much for your kind words! 😊

  • @dataschool
    @dataschool  6 ปีที่แล้ว +14

    *Note:* This video was recorded using Python 2.7 and scikit-learn 0.16. Recently, I updated the code to use Python 3.6 and scikit-learn 0.19.1. You can download the updated code here: github.com/justmarkham/scikit-learn-videos

  • @musasall5740
    @musasall5740 6 ปีที่แล้ว +2

    You bring it in every video and make it easy to understand!

    • @dataschool
      @dataschool  6 ปีที่แล้ว +1

      Thanks for your kind comment!

  • @evanmiller29
    @evanmiller29 9 ปีที่แล้ว

    Data School thank you so much for these videos. You're showing all the topics that no one has put the time to explain to us plebs. Keep the good work up!

    • @dataschool
      @dataschool  9 ปีที่แล้ว

      Evan Miller You're very welcome!

  • @BobDuCharme
    @BobDuCharme 9 ปีที่แล้ว

    These videos are all really great, and I particularly like how the iPython notebooks serve so well to help me review what was in any particular video.
    I'd like to put in a vote for one of the future videos to cover clustering.

    • @dataschool
      @dataschool  9 ปีที่แล้ว

      Bob DuCharme Great to hear... I spend a lot of time crafting these notebooks! And, thanks for the feedback, I will certainly consider clustering.

  • @sidk5919
    @sidk5919 8 ปีที่แล้ว +14

    U are an amazing teacher...thank you so much.

    • @dataschool
      @dataschool  8 ปีที่แล้ว

      You're welcome! Thanks for your kind words :)

  • @syedmohdsohail3506
    @syedmohdsohail3506 8 ปีที่แล้ว

    Kevin and Data School, you have been angels to me. I learned a hell lot in Machine Learning, thank you so much.

    • @dataschool
      @dataschool  8 ปีที่แล้ว +1

      You're very welcome! I'm glad the videos have been helpful to you!

  • @NB19273
    @NB19273 5 ปีที่แล้ว

    Succinct, self contained, and very clear. The best video on youtube ive found so far explaining parameter search. Thanks so much!

    • @dataschool
      @dataschool  5 ปีที่แล้ว

      Excellent, that's great to hear!

  • @ohserra
    @ohserra 9 ปีที่แล้ว

    A M A Z I N G ! In one day I've learned what I need to get into machine learning in python and scikit-learn. I've been using matlab for a while, and it's overwhelming the possibilities of scikit and the IPython notebook. Thank you again! great job ;) I hope to keep hearing from you!

    • @dataschool
      @dataschool  9 ปีที่แล้ว

      +Diogo Gonçalves Wow! What a kind and thoughtful comment! I greatly appreciate it.

  • @sinanwannous
    @sinanwannous 4 ปีที่แล้ว +1

    Thank you so much, one of the best and clearest explanation I've ever came across!!

    • @dataschool
      @dataschool  4 ปีที่แล้ว +1

      Great to hear!

  • @slowcoding
    @slowcoding 6 ปีที่แล้ว

    Your lectures are very clear and esay to understand. Excellent!!!

    • @dataschool
      @dataschool  6 ปีที่แล้ว

      Thanks for your kind comment!

  • @behnoushpejhanmanesh4353
    @behnoushpejhanmanesh4353 ปีที่แล้ว

    Your teaching and explanations are amazing Kevin! Many thanks for all the effort you have put on preparing these tutorials.
    Do you have any tutorials about "DecisionTreeClassifier", to which you briefly refer in this video?

    • @dataschool
      @dataschool  ปีที่แล้ว

      Thanks for your kind words! I don't have a video tutorial about decision trees, but I do have this lesson notebook: github.com/justmarkham/DAT8/blob/master/notebooks/17_decision_trees.ipynb
      Hope that helps!

  • @imshafay
    @imshafay 6 ปีที่แล้ว +1

    Your the beast mate. I wanted to go through Machine Learning basics Revision, and I'm quite amazed this is the best Video I have ever watched on TH-cam, in fact best from all the reading platforms as well.

    • @dataschool
      @dataschool  6 ปีที่แล้ว

      Awesome! Thank you so much for your kind words! If you want to support Data School, my Patreon campaign might interest you: www.patreon.com/dataschool/overview

  • @betulchamplin3642
    @betulchamplin3642 4 ปีที่แล้ว

    your classes are simply AMAZING, thank you so much for all your efforts putting them together!

  • @compton8301
    @compton8301 3 ปีที่แล้ว +1

    You're amazing. Thanks for the free knowledge.

  • @minchin8041
    @minchin8041 7 ปีที่แล้ว

    Best gridsearch tutorial on youtube. Thanks!

    • @dataschool
      @dataschool  7 ปีที่แล้ว

      You're welcome! Glad it was helpful to you!

  • @abhisheksalvi2438
    @abhisheksalvi2438 5 ปีที่แล้ว

    This is really helpful. Grateful for your efforts taken in creating this resource.

  • @xiangxinzhang1770
    @xiangxinzhang1770 ปีที่แล้ว +1

    In a typical workflow, you would split your original dataset into a training set and a separate test set. Then, during cross-validation, you further divide the training set into multiple subsets (folds) and iteratively train and evaluate the model on these folds. The test set remains untouched and is used for final evaluation after the model selection or parameter tuning process.

    • @dataschool
      @dataschool  9 หลายเดือนก่อน

      Whether or not you need to do that depends on your goals: Whether you are solely trying to select the best model, or whether you also need an accurate estimate of how that best model will perform on out-of-sample data. The workflow you are proposing is a good one if you care about both goals and you have enough data such that the initial split does not compromise the first goal.

  • @mdinesk
    @mdinesk 9 ปีที่แล้ว

    I would love to see more such videos, especially on handling huge datasets with categorical values

    • @dataschool
      @dataschool  9 ปีที่แล้ว

      +Dinesh Kumar Murali Thanks for the suggestion! I'll take it into consideration.

  • @russelllavery2281
    @russelllavery2281 4 ปีที่แล้ว

    Fantastic. We should all join his support group. 5 a month is cheap

  • @AxlRulz666
    @AxlRulz666 7 ปีที่แล้ว

    I like your talking speed. It gives me enough time to absorb the concepts. May be its just me.
    Anyways, Thank you very much !

    • @dataschool
      @dataschool  7 ปีที่แล้ว

      Great to hear - you're very welcome! :)

  • @fernandonogueira2291
    @fernandonogueira2291 3 ปีที่แล้ว

    You´re the master! Thanks for such a well explained video.

  • @RexhepShijaku
    @RexhepShijaku 3 ปีที่แล้ว

    15:08 just to add something, there is another advise like this : it is better to select an odd value for k, since it may avoid the ties caused during classification . In this case [in the example on video] we have best scores when k=13, k=18, k=20, so probably it picked 13 because it is an odd number. I may be wrong but I heard it somewhere :)

  • @nastarankianersi104
    @nastarankianersi104 4 ปีที่แล้ว +1

    Thank you so much for this clear and helpful tutorial and all the effort you put in to your work 🌸^^

    • @dataschool
      @dataschool  4 ปีที่แล้ว

      You're very welcome!

  • @cansu4333
    @cansu4333 2 ปีที่แล้ว +1

    I have a question but I should express my compliments before asking it. The way you explain things is marvelous, and one can understand where things come from clearly. Really, thank you so much.
    The question is: What about splitting the data at the beginning of the whole process. And fit the grid search on x_train, y_train. And with the best parameter found, test the model on X_test. (I did these just to pretend to have real out-of-sample data). When I did it, the result was not that different. Now I wonder why that is so; Is it because of the smallness of the data or because what I did does not change the results so much anyway?

    • @dataschool
      @dataschool  2 ปีที่แล้ว +1

      Thank you for your kind words! 🙏
      As for your question: whether or not you need to split beforehand depends on your goals. It's complicated to explain briefly, I'm sorry! But I'll be covering it in my next ML course. Subscribe here for updates: www.dataschool.io/subscribe/

    • @cansu4333
      @cansu4333 2 ปีที่แล้ว +1

      @@dataschool thank you so much. ☺️

  • @pookiechips5496
    @pookiechips5496 6 ปีที่แล้ว

    Thank you abundantly for everything that you are doing for the world.

    • @dataschool
      @dataschool  6 ปีที่แล้ว

      You are very welcome! :)

  • @brendensong8000
    @brendensong8000 3 ปีที่แล้ว +1

    Another amazing video! Thank you!

  • @philscosta
    @philscosta 4 ปีที่แล้ว

    Very good video, very good explanation. Thank you!

  • @asneogy
    @asneogy 9 ปีที่แล้ว

    hello Kevin - really great videos, love your clear and precise way of explaining concepts yet keeping them accessible. a request - there are not too many seaborn tutorial videos out there. could you consider making some? thanks and keep rocking.

  • @SunilKalmady
    @SunilKalmady 8 ปีที่แล้ว

    Brilliant! You have excellent teaching skills. Please help me understand this better. Like with feature selection, whether finding optimal model (hyper)parameters should also be performed ideally within each cross-validation iteration? If i am not mistaken, you recommend using whole dataset (unsplit) for grid search in this video. Doesn't it constitute as (some form of) training from entire data? I would be happy if you can answer. Anyways, great stuff!!

    • @dataschool
      @dataschool  8 ปีที่แล้ว +1

      +Sunil Kalmady Thanks for your kind words! You are correct that when searching for the optimal model, feature selection and feature engineering should *ideally* occur within each cross-validation fold. However, the question is always whether the added complexity is worth the increase in reliability of your performance estimates. This is a nice, short Q&A on that topic: stats.stackexchange.com/questions/92502/cross-validation-feature-information-outside-the-fold

    • @SunilKalmady
      @SunilKalmady 8 ปีที่แล้ว

      +Data School Thank you for your answer!!

  • @greettheceo395
    @greettheceo395 7 ปีที่แล้ว

    Ahhh so awesome tutorial ,
    struggling with this part for very long , now it got clear
    Thanks
    Regards

  • @AndreasMueller
    @AndreasMueller 8 ปีที่แล้ว

    Very nice video series :) I'd be curious if you used any of my material apart from the video you linked to [everything I do is CC-0 unless published via a publisher so don't fear the lawyers]. I found the continuous talking / walking through for videos really hard, you do a much better job than me, I think.

    • @dataschool
      @dataschool  8 ปีที่แล้ว

      +Andreas Mueller Thanks very much for the compliments - I'm flattered!
      No, I'm pretty sure I didn't use any outside materials, with the exception of some images. I basically created every notebook from scratch.
      Regarding the talking during videos: It takes some getting used to, for sure! I scripted this series (every word and every action), which made the recording easier, but took FOREVER (probably 150 hours of work for a 4-hour series!!)
      I changed tactics for my latest video series (pandas), which is mostly unscripted, and thus I can make videos a lot faster.
      Looking forward to your book! Tentative release date? :)

    • @AndreasMueller
      @AndreasMueller 8 ปีที่แล้ว

      +Data School Summer It was announced for June but I don't think we'll make that. Stay tuned for an early release. I scripted my O'Reilly series (tried without script first, totally failed) and it also to for ever.
      I'll check out your pandas videos.
      Keep up the good work! I'm glad this material is openly available -- mine is not :-/

    • @dataschool
      @dataschool  8 ปีที่แล้ว

      +Andreas Mueller Cool, I'll be on the lookout for the book this summer!

  • @JackSimpsonJBS
    @JackSimpsonJBS 9 ปีที่แล้ว

    Thank-you so much for these amazing tutorials, I've learned so much from them and now I use scikit-learn frequently. Was this the final video or did you have any idea how many you were planning to make in this series?

    • @dataschool
      @dataschool  9 ปีที่แล้ว

      ***** Excellent, glad to hear! I do have more videos planned, though I haven't yet decided how many more I will make. The next one will come out in a few weeks... stay tuned!

  • @DEEPAKKUMAR-sw6sb
    @DEEPAKKUMAR-sw6sb 5 ปีที่แล้ว

    mesmerized by your explanations.
    Thank you so much!

    • @dataschool
      @dataschool  5 ปีที่แล้ว

      You're very welcome!

  • @blocktopians
    @blocktopians 9 ปีที่แล้ว

    Thanks. Great explanation. Your videos are very helpful and easy to understand.

    • @dataschool
      @dataschool  9 ปีที่แล้ว

      Antony Mapfumo You're very welcome!

  • @ramlimbu886
    @ramlimbu886 7 ปีที่แล้ว

    Pure gold! Thank you, Kevin!

    • @dataschool
      @dataschool  7 ปีที่แล้ว

      You're very welcome! Glad this video was helpful to you!

  • @fablapp
    @fablapp 8 ปีที่แล้ว +1

    Hi Kevin, thank you so much for sharing this with us. enjoying any single minute of your videos. One question: receiving a warning message when running the .predict method on the dataset: "Passing 1d arrays as data is deprecated in i0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample."
    Not sure I am fully understanding what is about.
    Thanks again!

    • @dataschool
      @dataschool  8 ปีที่แล้ว +1

      Glad you are enjoying the videos! Regarding the warning message, it's complicated to explain, but I discuss it in detail in this blog comment: www.dataschool.io/linear-regression-in-python/#comment-2521926219
      Hope that helps!

  • @lokeshpaladugula5793
    @lokeshpaladugula5793 5 ปีที่แล้ว +1

    really awesome.... great work man... and thank youuuuu

  • @LonglongFeng
    @LonglongFeng 7 ปีที่แล้ว

    at 19:26, the knn.fit(X, y) is to display the classifier parameters only, since we already instantiate the knn model with the best parameters (1 step before the fitting).
    in another word, the 'fit' method will always give you the same output, no matter you fit(X_train, y_train), or fit(X_test, y_test), or fit(X, y). It just shows the best parameters in the model. just same as print(knn)

    • @dataschool
      @dataschool  7 ปีที่แล้ว

      knn.fit(X, y) is a necessary step in order to train the model with the data. Instantiating the model in the previous step does not train the model, rather it just prepares the model to be trained using certain parameters.

  • @GuRuGeorge03
    @GuRuGeorge03 3 ปีที่แล้ว

    Thank you so much for this. This is literally gold

  • @smaxwell89
    @smaxwell89 9 ปีที่แล้ว

    Hi Kevin, thank you for making these videos as they have served as a very informative introduction to machine learning for me. I've watched each video of the series and studied the code thoroughly. Any recommendations as to what to do next as far as teaching myself machine learning? I'm planning to go back through the videos and study all of the supplemental resources you've provided and then complete the "An Introduction to Statistical Learning with Applications in R" course. Is this a good plan of action or should I focus more so on Python and scikit-learn? My main goal is to use my Physics background along with what I've been teaching myself to become a Data Scientist, but I would greatly appreciate your opinion on the matter. Thank you.

    • @dataschool
      @dataschool  9 ปีที่แล้ว

      +smaxwell89 That sounds like a great plan! Here's some advice that I give my data science students: github.com/justmarkham/DAT7/blob/master/other/advice.md

  • @sandeeps_
    @sandeeps_ 9 ปีที่แล้ว

    I found the video series very useful! Thank you! :) Do you plan to have more videos in the future?

    • @dataschool
      @dataschool  9 ปีที่แล้ว +1

      ***** You're welcome! I will be creating more videos in the series, though it will be a few weeks from now before I have time to make the next one.

  • @MrMmahesh007
    @MrMmahesh007 7 ปีที่แล้ว

    simply superb, you have amazing teaching skills. Great explanations and material suggested.
    I wanted to switch my career to MLearning and have been watching all your playlists. Once I complete all your videos, what is the best way to master?any suggestions.

    • @dataschool
      @dataschool  7 ปีที่แล้ว

      Thanks so much for your kind words! My thoughts are here: www.dataschool.io/launch-your-data-science-career-with-python/

  • @swagatmishra9350
    @swagatmishra9350 4 ปีที่แล้ว

    Thank you very much for such a very beautiful explanation!!!

    • @dataschool
      @dataschool  4 ปีที่แล้ว

      Thanks for appreciating!

  • @augustinemalamsha9251
    @augustinemalamsha9251 4 ปีที่แล้ว

    u r a great teacher, optimazation, gridsearch, exhaustive search were a mist to me, nut now bcz of u, they kind of on my finger tips

  • @001JaNe100
    @001JaNe100 8 ปีที่แล้ว

    For those, like me, who have an issue with the round line I suggest : best_scores.append(round(float(rand.best_score_), 3)) # Maybe due to Python 3.5 ## And of course "Merci Kevin"!

    • @dataschool
      @dataschool  8 ปีที่แล้ว

      +Nebil Jabari That's interesting... rand.best_score_ should already be a float. What error do you get with the original line of code?

    • @001JaNe100
      @001JaNe100 8 ปีที่แล้ว

      Any error message show up. It simply produce the array with all the numbers unrounded. 0.97999999999999998 insteed 0.98 No matter the value of the second argument (3, 4, 2 or 1...)

  • @hocinetedjani7376
    @hocinetedjani7376 7 ปีที่แล้ว

    Hello,
    Thank you for your courses :D they are so clear.
    I have a question :
    Do you think that I can use GridSearchCV with AdaBoostClassifier ? Because I have a dataset with low frequency of y=1.

    • @dataschool
      @dataschool  7 ปีที่แล้ว +1

      Sure, GridSearchCV can be used with any model for parameter tuning. Good luck!

  • @hmscfch.7216
    @hmscfch.7216 6 ปีที่แล้ว

    Hi, Kevin. I noticed that the codes are running slower on my PC (which is a standard one). Can I know what requirements are you recommended on the hardware? Or, any solutions to improve the processing speed? Thank you. And thank you for your 'quick-understanding' tutorials.

    • @dataschool
      @dataschool  6 ปีที่แล้ว

      I'm not sure what hardware recommendations to make, I'm sorry!

  • @eldert1735
    @eldert1735 5 ปีที่แล้ว

    Thanks for the video. I have a question. When you put the parameters in the GridSearchCV, you put the knn as the estimator. 'knn' was instantiated with 'n_neighbors = k', looking at the above code. Does GridSearchCV ignore the things inside the parenthesis of the estimator? Or do we have to put 'n_neighbors = k' inside the KNeighborsClassifier, instead of a specific number to make the code work?

    • @dataschool
      @dataschool  5 ปีที่แล้ว

      The parameter grid passed to GridSearchCV will override the parameters passed to knn during instantiation. Hope that helps!

  • @citizenR1203
    @citizenR1203 7 ปีที่แล้ว

    Great ! Thanks for your ipython notebook.

  • @marathiManus10
    @marathiManus10 4 ปีที่แล้ว

    Love simplicity in your approach! Excuse me for my naive question - So How to determine if cluster size 13 is optimal or 17 is optimal? If one was to use said KNN logic in a TRULY real life situation, should he choose 13 or 17?

    • @dariuszspiewak5624
      @dariuszspiewak5624 4 ปีที่แล้ว +1

      Not sure if I can take on Kevin's role here but I remember from his teachings that you should choose the parameter that makes your model SIMPLER. Since 17 gives you simpler decision boundaries, you should choose 17. In general, in KNN you should use the highest parameter out of the best performing ones.

  • @yangbadminlog
    @yangbadminlog 7 ปีที่แล้ว

    Isn't that using range will create a list?Why should we need to type list(range(1,31)) instead of range(1,31) ?

    • @dataschool
      @dataschool  7 ปีที่แล้ว

      The range function creates a list in Python 2, but it does not in Python 3. By explicitly converting the output to a list, you guarantee that the code will work in both Python 2 and 3.

  • @pradyparyal
    @pradyparyal 8 ปีที่แล้ว

    I like your way of explanation. Thank you so much for your effort. Could you please upload the videos on unsupervised learning.

    • @dataschool
      @dataschool  8 ปีที่แล้ว

      Thanks for your kind words! I'll consider your suggestion for the future.

  • @uniqueraj518
    @uniqueraj518 9 ปีที่แล้ว

    NIce video,
    1. please correct me if i am wrong , with the concept of gridSearch is to select the best parameter and cross - validation is to for generalizing model so that overfitting does not occur.
    2. If u train the SVM model using scikit learn library with the dataset other than iris, training time is really high as compared to other model, and performance is also poor, i am confused what can be the solution.
    3. There the way in IPython that we can use multiple cllient at the same time(Parallel computing), but i am lacking idea how it can be implemented here , i mead in instantiating the model, fitting the model or predicting stage. i would be happy for getting answer. I am waiting for your next video, about unsuvervised learning (clustering)

    • @dataschool
      @dataschool  9 ปีที่แล้ว

      +unique raj 1. Grid search is for parameter tuning, and cross-validation is for estimating how well your model will generalize. 2. Every model takes a different amount of time to train (depending on the model itself, the data, and the tuning parameters), and no particular model is guaranteed to work for any particular problem.

  • @DJH3891
    @DJH3891 3 ปีที่แล้ว

    very good explanation! I still got an issue in understanding your statement at 19:35 - I thought training the model not enirely on all the data available was key, because otherwise you would have no test-set to evaluate or overfit the model??

  • @deboratoshiekohara2723
    @deboratoshiekohara2723 4 ปีที่แล้ว

    Thaanks! Congrats Its a great explanation! I was also checking some other videos of yours and a doubt appeared, I wanted to combine the pipe with the gridsearch. For instance I tried to put down the gridsearch within my pipeline and extract the results as pipe.cv_results_, however could not use the .cv_results with pipe. Could you give any hint about the combination pipe for preprocessing and then search grid? Maybe another topic for your videos Thanks!

  • @silasmurithi4706
    @silasmurithi4706 5 ปีที่แล้ว

    Awesome tutorial, I would like to know what happens if you tune parameters of different classification models to give similar predictions? Is it a good idea or a bad one? And how to go about it?

    • @dataschool
      @dataschool  5 ปีที่แล้ว

      Glad you liked the tutorial! Not sure what you mean by "tuning to give similar predictions"... Instead, if you try tuning another model, your goal is to optimize for your chosen evaluation metric, not to make sure it gives similar predictions to any other model. Does that make sense?

  • @jrabyssdragon
    @jrabyssdragon 8 ปีที่แล้ว

    Hi Kevin! I've a quick question. In this video you explained the SearchCV methods for tuning parameters. Also, I've found that there are some functions to perform feature selection with scikit learn. But I was wondering if there's a way to perform both feature selection and hyperparameter tuning within the same pipeline in scikit-learn. Any insight is appreciated.

    • @dataschool
      @dataschool  8 ปีที่แล้ว +1

      Sure, you can do a GridSearchCV of a Pipeline that contains both a feature selector and a model. This example from the documentation shows something similar: scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html
      Hope that helps!

  • @saudnaeem
    @saudnaeem 5 ปีที่แล้ว

    Learned a lot from this video. thank u

  • @sreemantokesh3999
    @sreemantokesh3999 5 ปีที่แล้ว

    Your tutorials are so so great. Thank you for everything Kevin. I see the GridsearchCV and RandomizedSearch is kind of similar to Gradient Descent and Stochastic Descent. Am I wrong??? Can yo do a video where you explain Stochastic Gradient and it's Variation??

    • @dataschool
      @dataschool  5 ปีที่แล้ว

      Thanks for your suggestion! I'll consider it for the future.

  • @aoife1902
    @aoife1902 7 ปีที่แล้ว

    Absolutely brilliant, thank you so much!

    • @dataschool
      @dataschool  7 ปีที่แล้ว

      You're very welcome!

  • @mallikarjunsuram4913
    @mallikarjunsuram4913 7 ปีที่แล้ว

    Brilliant work dude

  • @balkiprasanna1984
    @balkiprasanna1984 7 ปีที่แล้ว

    Thank you so much for this video. You're just amazing.

    • @dataschool
      @dataschool  7 ปีที่แล้ว

      Wow, thanks so much! I really appreciate it :)

  • @pinkanrout818
    @pinkanrout818 6 ปีที่แล้ว

    simply Awesome!!!Thanks a lot for all your videos.
    I have a question here...Grid search we use to get the best hyper parameter as in this case n_neighbors. But how it will be helpful in linerar/logistic regression . In linear we are not using any tuning parameter (linreg.fit(x,Y)). In deep learning the role of learning rate comes may be that time it will be useful..not sure in scikit how it helps in linear regression.

    • @dataschool
      @dataschool  6 ปีที่แล้ว

      You're right that linear regression has no tuning parameters. However, logistic regression does have a tuning parameter, so grid search is helpful there.

  • @marcelohanones9007
    @marcelohanones9007 6 ปีที่แล้ว

    Great explanation !!!!! Thank You !!! I have a question. Why that GridSearch returned 13 as the best param for n_neighbors while RandomizedGridSearch returned 17 (or 20 as in 08_grid_search.ipynb) ? Looking at the plot It seems one got the lowest possible value (13) and the other the highest(17 or 20).

    • @dataschool
      @dataschool  6 ปีที่แล้ว

      I'm not sure I totally understand your question, sorry! But my general response would be that for such a small dataset, and on such an easy problem, the "best" result can vary slightly. Hope that is at least a bit helpful! :)

  • @chetakabra8
    @chetakabra8 9 ปีที่แล้ว

    one of the best explanation using pathon notebook ..can you please upload video for feature selection

    • @dataschool
      @dataschool  9 ปีที่แล้ว

      +akash kabra I'll consider making one for the future, thanks!

    • @dataschool
      @dataschool  6 ปีที่แล้ว

      You might be interested in my recent video about feature selection: th-cam.com/video/YaKMeAlHgqQ/w-d-xo.html

  • @jesuscortes9110
    @jesuscortes9110 8 ปีที่แล้ว

    This is really a wonderful material and really helpful to be used for teaching. Is there any resource I might use for analytically calculating predicted probabilities ? thanks!

    • @dataschool
      @dataschool  8 ปีที่แล้ว

      Thanks for your kind comments! I'm not sure what you mean by "analytically calculating predicted probabilities". Could you explain in more detail, or give an example? Thanks!

    • @jesuscortes9110
      @jesuscortes9110 8 ปีที่แล้ว

      I mean , for a given classifier, and for a given feature vector, to calculate the predicted probability for each class

    • @dataschool
      @dataschool  8 ปีที่แล้ว +1

      Thanks for clarifying! Yes, most scikit-learn classifiers include a 'predict_proba' method that outputs the predicted probabilities of class membership. I cover that in video 9 in this series: th-cam.com/video/85dtiMz9tSo/w-d-xo.html

  • @JCRMatos
    @JCRMatos 9 ปีที่แล้ว

    Hello,
    Does grid.best_score_ take into account both the mean and the std or just the mean?
    Thanks for the series. Keep it up.
    JM

    • @dataschool
      @dataschool  9 ปีที่แล้ว

      João Matos The documentation says it picks the "highest score", so I assume that means it only takes the mean into account (and not the standard deviation). However, if you find differently, please let me know!

  • @MatthewAds
    @MatthewAds 9 ปีที่แล้ว

    Super helpful videos!
    What do you do when you want to do categorisation on a dataset which consists of many categorical variables? As I understand scikitlearn models expect numeric values always? What is the best way to convert datasets full of categorical data represented by strings into integer representations for use in scikitlearn models? Is that even an approach you would advise?

    • @dataschool
      @dataschool  9 ปีที่แล้ว

      Matt Adshead Great question! Yes, scikit-learn expects numeric values. For unordered categorical features, you generally represent them as dummy variables. For ordered categorical variables, you can represent them using "sensible" numeric values. More details are available in part 3 of this notebook: nbviewer.ipython.org/github/justmarkham/DAT7/blob/master/notebooks/12_advanced_model_evaluation.ipynb

  • @karanjadhav6681
    @karanjadhav6681 5 ปีที่แล้ว

    Hi Mark,
    It's very informative. Great video!!
    Have one doubt.
    I am working on one classification problem. Have used 4 classifiers. Logistic regression, svm, decision tree, knn.
    Initially i used normal train test split to calculate the score it showed logistic with high accuracy.
    Then I used cross validation score it also gave results with logistic regression being highly accurate compared to others.
    But, when I did the parameter tuning using gridsearchcv and then calculated the cross val score it showed me svm with high accuracy then logistic or knn.
    Is this expected? . Please help me with this doubt Mark.

    • @dataschool
      @dataschool  5 ปีที่แล้ว

      Results before tuning can be different from results after tuning. Does that help?

  • @tirthprakashbal3612
    @tirthprakashbal3612 8 ปีที่แล้ว

    Hi Kevin,
    Thanks a lot for this awesome tutorial series. Can you suggest some datasets from Kaggle competitions on which I can practice this knowledge ?

    • @dataschool
      @dataschool  8 ปีที่แล้ว

      I don't have any particular suggestions, but there are lots of past competitions in which the datasets are still available. I'd recommend choosing a competition in which the topic interests you!

  • @thisaintarf
    @thisaintarf 4 ปีที่แล้ว

    thankyu very much sir, this video helps me a lot

  • @gracezhao3168
    @gracezhao3168 6 ปีที่แล้ว

    You are super helpful. Have to give you a thumb up!

  • @josephkarpinski9586
    @josephkarpinski9586 5 ปีที่แล้ว

    For those using Python 3.7, range(1, 31) may give you issues. If you print k_range and see range(1, 31) instead of a list of numbers,
    Import numpy as np and replace range(1, 31) with np.arange(1, 31).

    • @dataschool
      @dataschool  5 ปีที่แล้ว

      Thanks! You can also just wrap range in list: list(range(1, 31))
      You can find my code (which has been updated for Python 3) here: github.com/justmarkham/scikit-learn-videos

  • @kwenevdavidapine4661
    @kwenevdavidapine4661 3 ปีที่แล้ว

    this video is very useful, thank you

    • @dataschool
      @dataschool  3 ปีที่แล้ว

      Glad it was helpful!

  • @nathankong8732
    @nathankong8732 2 ปีที่แล้ว

    Hi, quick question, why don’t you split the data before grid search? Also how would you do a confusion matrix using this model?

  • @suryagaur7440
    @suryagaur7440 5 ปีที่แล้ว

    Brilliant as usual. Can you please make video on nested and non-nested Cross validation

    • @dataschool
      @dataschool  5 ปีที่แล้ว

      Thanks for your suggestion!

  • @NR_Tutorials
    @NR_Tutorials 5 ปีที่แล้ว

    thanks for ur videos and nice voice over it ,, we love u sir thanks

  • @pengyan1906
    @pengyan1906 4 ปีที่แล้ว

    You are doing great!

  • @mikemyers9524
    @mikemyers9524 6 ปีที่แล้ว

    many many thanks. Just one question. If I do gridsearch, say, with an SVR with the parameter C [0.001, 0.01, 0.1, 1, 10, 100] and finally find an optimal parameter C: Can I reduce the parameter space to this optimal parameter C when I would like to do permutations for statistical testing ? Or do I loose some information during the permutation procedure ? THX!

    • @dataschool
      @dataschool  6 ปีที่แล้ว

      It's a great question, but I'm not sure of the answer... sorry!

  • @JCRMatos
    @JCRMatos 9 ปีที่แล้ว +1

    Hello,
    Can we save the gridsearch "knowledge" to predict at a later time?
    Isn't it strange that the RandomizedSearch repeated the combo 17, distance?
    It's assumed that it selects random combos, but it shouldn't select duplicates, should it?
    Thanks,
    JM

    • @dataschool
      @dataschool  9 ปีที่แล้ว

      João Matos Great questions! Here is scikit-learn's documentation on model persistence, which should allow you to save the best model (grid.best_estimator_): scikit-learn.org/stable/modules/model_persistence.html
      Regarding duplicates, here is what the documentation says: "If all parameters are presented as a list, sampling without replacement is performed. If at least one parameter is given as a distribution, sampling with replacement is used." Since none of my parameters in that example were distributions, I would think that we shouldn't see any duplicates. However, either the documentation is incorrect on that point, or I'm misunderstanding what the documentation is trying to say.

  • @ElectronicsInside
    @ElectronicsInside 5 ปีที่แล้ว

    Do I have to split the data using the train_test_split function in cross-validation technique ??
    I have seen people using cross_val_score(SVC() ,x=x_train, y=y_train,cv=10)
    But in the video we do like cross_val_score(SVC() ,x, y,cv=10)

    • @dataschool
      @dataschool  5 ปีที่แล้ว +1

      I teach the latter, but it depends on your model evaluation process. There is not one "right" evaluation process, but some are better than others. Hope that helps!

    • @ElectronicsInside
      @ElectronicsInside 5 ปีที่แล้ว

      @@dataschool I also want to know that we have to scale our data before train/test split or after train/test split.Some people told me that you should scale your data after train/test split

    • @dataschool
      @dataschool  5 ปีที่แล้ว

      After

  • @prakashyadav008
    @prakashyadav008 8 ปีที่แล้ว

    amazing teaching..thanks

  • @ChandraShekhar-rn9ty
    @ChandraShekhar-rn9ty 7 ปีที่แล้ว

    Hi Kevin:
    Thanks for all your awesome videos. I had an issue using pytz module. I was wondering if you could help me out here.
    I am writing my code in IDLE where I mention import pytz at the top. However, I get an error saying: 'ImportError: No module named 'pytz' '. I checked the status of pytz module in my command prompt and it says requirement already satisfied. Interestingly, this import pytz work on my jupyter notebook.
    Any help will be highly appreciated.

    • @dataschool
      @dataschool  7 ปีที่แล้ว

      Perhaps jupyter notebook and IDLE are accessing different Python instances, and one of them has pytz installed whereas the other does not? I'm not sure... good luck!

  • @sribastavrajguru304
    @sribastavrajguru304 6 ปีที่แล้ว

    Great, thank you for your great work, its really helpfullllll

    • @dataschool
      @dataschool  6 ปีที่แล้ว +1

      Great to hear!

  • @user-cc8kb
    @user-cc8kb 6 ปีที่แล้ว

    Nice. Thank you very much! :)

    • @dataschool
      @dataschool  6 ปีที่แล้ว +1

      You're welcome! :)