Classification Trees in Python from Start to Finish

แชร์
ฝัง
  • เผยแพร่เมื่อ 30 ก.ค. 2024
  • NOTE: You can support StatQuest by purchasing the Jupyter Notebook and Python code seen in this video here: statquest.gumroad.com/l/tzxoh
    This webinar was recorded 20200528 at 11:00am (New York time).
    NOTE: This StatQuest assumes are already familiar with:
    Decision Trees: • StatQuest: Decision Trees
    Cross Validation: • Machine Learning Funda...
    Confusion Matrices: • Machine Learning Funda...
    Cost Complexity Pruning: • How to Prune Regressio...
    Bias and Variance and Overfitting: • Machine Learning Funda...
    For a complete index of all the StatQuest videos, check out:
    statquest.org/video-index/
    If you'd like to support StatQuest, please consider...
    Buying my book, The StatQuest Illustrated Guide to Machine Learning:
    PDF - statquest.gumroad.com/l/wvtmc
    Paperback - www.amazon.com/dp/B09ZCKR4H6
    Kindle eBook - www.amazon.com/dp/B09ZG79HXC
    Patreon: / statquest
    ...or...
    TH-cam Membership: / @statquest
    ...a cool StatQuest t-shirt or sweatshirt:
    shop.spreadshirt.com/statques...
    ...buying one or two of my songs (or go large and get a whole album!)
    joshuastarmer.bandcamp.com/
    ...or just donating to StatQuest!
    www.paypal.me/statquest
    Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
    / joshuastarmer
    0:00 Awesome song and introduction
    5:23 Import Modules
    7:40 Import Data
    11:18 Missing Data Part 1: Identifying
    15:57 Missing Data Part 2: Dealing with it
    21:16 Format Data Part 1: X and y
    23:33 Format Data Part 2: One-Hot Encoding
    37:29 Build Preliminary Tree
    46:31 Pruning Part 1: Visualize Alpha
    51:22 Pruning Part 2: Cross Validation
    56:46 Build and Draw Final Tree
    #StatQuest #ML #ClassificationTrees

ความคิดเห็น • 582

  • @statquest
    @statquest  4 ปีที่แล้ว +26

    NOTE: You can support StatQuest by purchasing the Jupyter Notebook and Python code seen in this video here: statquest.gumroad.com/l/tzxoh
    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

    • @ezzouaouia.r1127
      @ezzouaouia.r1127 4 ปีที่แล้ว

      The site is offline. 11/07 12:00

    • @statquest
      @statquest  4 ปีที่แล้ว

      Thanks for the note. It's back up.

    • @ezzouaouia.r1127
      @ezzouaouia.r1127 4 ปีที่แล้ว

      @@statquest Thanks very much .

    • @dfinance2260
      @dfinance2260 3 ปีที่แล้ว

      Still offline unfortunately. Would love to check the code.

    • @statquest
      @statquest  3 ปีที่แล้ว

      @@dfinance2260 It should be back up now.

  • @funnyclipsutd
    @funnyclipsutd 4 ปีที่แล้ว +67

    BAM! My best decision this year was to follow your channel.

    • @statquest
      @statquest  4 ปีที่แล้ว +5

      BAM! :)

    • @mike19558
      @mike19558 3 ปีที่แล้ว

      Mood, been so useful!

  • @renekokoschka707
    @renekokoschka707 3 ปีที่แล้ว +7

    I just started my bachelor thesis and i really wanted to thank you!
    Your videos are helping me so much.
    You are a LEGEND!!!!!

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      Thank you and good luck! :)

  • @ccuny1
    @ccuny1 4 ปีที่แล้ว +2

    I have already commented but I watched the video again and I have to say I am even more impressed than before. truly fantastic tutorial, not too verbose but with every action clarified and commented in the code, beautifully presented (I have to work on my markdown; there are quite a few markdown formats you use that I cannot replicate...to study when I get the notebook). So all in all, one of the very top ML tuts I have ever watched (including paid for training courses). Can't wait for today's or tomorrows webinars. Can't join in real time as based in Europe, but will definitely pick it up here and get the accompanying study guides/code.

    • @statquest
      @statquest  4 ปีที่แล้ว

      Hooray!!! Thank you very much!!!

  • @montserratramirez4824
    @montserratramirez4824 4 ปีที่แล้ว +7

    I love your content! Definitely my favorite channel this year
    Regards from Mexico!

    • @statquest
      @statquest  4 ปีที่แล้ว +2

      Wow, thanks! Muchas gracias! :)

  • @1988soumya
    @1988soumya 4 ปีที่แล้ว +3

    Hey Josh, it’s so good to see you are doing this, I am preparing for some interviews, it will help a lot

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      Good luck! :)

  • @jahanvi9429
    @jahanvi9429 ปีที่แล้ว +5

    You are so so helpful!! I am a data science major and your videos saved my academics. Thank you!!

    • @statquest
      @statquest  ปีที่แล้ว

      Happy to help!

  • @robertmitru7234
    @robertmitru7234 3 ปีที่แล้ว +1

    Awesome StatQuest! Great channel! Make more videos like this one for the other topics. Thank you for your time!

    • @statquest
      @statquest  3 ปีที่แล้ว

      Thanks! Will do!

  • @dhruvishah9077
    @dhruvishah9077 3 ปีที่แล้ว +2

    I'm absolute beginner and this is what i was looking. Thank you so much for this. Much appreciated sir!!

    • @statquest
      @statquest  3 ปีที่แล้ว

      Glad it was helpful! :)

  • @3ombieautopilot
    @3ombieautopilot 4 ปีที่แล้ว +2

    Thank you very much for this one! You're channel is incredible! Hats off to you

  • @ccuny1
    @ccuny1 4 ปีที่แล้ว +1

    Another hit for me. I will be getting the Jupyter notebook and some if not all of you study guides (I only just realised they existed).

    • @statquest
      @statquest  4 ปีที่แล้ว

      BAM! :) Thank you very much! :)

  • @fuckooo
    @fuckooo 3 ปีที่แล้ว +1

    Love your videos Josh, the notebook missing values sounds like a great one to do!

  • @beebee_0136
    @beebee_0136 2 ปีที่แล้ว

    I'd like to thank you so much for making this stream cast available!

  • @nataliatenoriomaia1635
    @nataliatenoriomaia1635 3 ปีที่แล้ว +1

    Great video, Josh! Thanks for sharing it with us. And I have to say: the Brazilian shirt looks great on you! ;-)

  • @jefferyg3504
    @jefferyg3504 3 ปีที่แล้ว +1

    You explain things in a way that is easy to understand. Bravo!

    • @statquest
      @statquest  3 ปีที่แล้ว

      Thank you! :)

  • @kaimueric9390
    @kaimueric9390 4 ปีที่แล้ว +6

    I actually think it can be great if you created more videos for other ML algorithms. After teaching us almost every aspect of machine learning algorithms as far as the mechanics and the related fundamentals are concerned, I feel it is high time to see those in action, and Python is, of course, the best way to go.

    • @statquest
      @statquest  4 ปีที่แล้ว +4

      I'm working on them!!! :)

  • @liranzaidman1610
    @liranzaidman1610 4 ปีที่แล้ว +10

    Josh,
    this is really great.
    Can you upload videos with some insights on your personal research and which methods did you use?
    And some examples of why you prefer to use one method instead of the other? I mean, not only because you get a better result in RUC/AUC but is there a "biological" reasoning for using a specific method?

    • @statquest
      @statquest  4 ปีที่แล้ว +2

      Great suggestion!

  • @ozzyfromspace
    @ozzyfromspace 3 ปีที่แล้ว +2

    I dunno how I stumbled on your channel a few videos ago, but you've really got me interested in statistics. Nice Work sir 😃

  • @xiolee7597
    @xiolee7597 4 ปีที่แล้ว +4

    Really enjoy all the videos! Can you do a series about mixed models as well, random effects, choosing models, interpretation etc. ?

    • @statquest
      @statquest  4 ปีที่แล้ว +4

      It's on the to-do list.

  • @aryamohan7533
    @aryamohan7533 3 ปีที่แล้ว +1

    This entire video is a triple bam! Thank you for all your content, I would be lost without it :)

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      Glad you enjoyed it!

    • @lawrencegayundato8398
      @lawrencegayundato8398 3 ปีที่แล้ว +1

      @@statquest This is Quadruple BAM!!!! Thank you Mr. Josh :)

  • @jonastrex05
    @jonastrex05 2 ปีที่แล้ว +1

    Amazing video! One of the best out there for this Education! Thank you Josh

  • @anishchhabra5313
    @anishchhabra5313 2 ปีที่แล้ว +1

    This is legen..... wait for it
    ....dary!! 😎
    This detailed coding explanation of Decision Tree is hard to find but Josh you are brilliant. Thank you for such a great video.

    • @statquest
      @statquest  2 ปีที่แล้ว +1

      Glad you liked it!

  • @rhn122
    @rhn122 3 ปีที่แล้ว +6

    Great tutorial! One question, by looking at the features included in the final tree, does it mean that only those 4 features are considered for prediction, i.e., we don't need the rest so we could drop those columns for further usage?

    • @statquest
      @statquest  3 ปีที่แล้ว

      That is correct.

  • @DANstudiosable
    @DANstudiosable 4 ปีที่แล้ว +5

    OMG... I thought you'd ignore when i asked you to post this webinar on youtube. Am glad you posted it. Thank you!

  • @Mohamm-ed
    @Mohamm-ed 3 ปีที่แล้ว +2

    This voice remembering me when I listening to radio in UK. Love that. I want to go again

  • @bayesian7404
    @bayesian7404 4 หลายเดือนก่อน +1

    You are fantastic! I'm hooked on your videos. Thank you for all your work.

    • @statquest
      @statquest  4 หลายเดือนก่อน

      Glad you like them!

  • @ericwr4965
    @ericwr4965 4 ปีที่แล้ว +1

    I absolutely love your videos and I love your channel. Thanks for this.

  • @Kenwei02
    @Kenwei02 2 ปีที่แล้ว +1

    Thank you so much for this tutorial! This has helped me out a lot!

    • @statquest
      @statquest  2 ปีที่แล้ว

      Glad it helped!

  • @ravi_krishna_reddy
    @ravi_krishna_reddy 3 ปีที่แล้ว +4

    I was searching for a tutorial related to statistics and landed here. At first, I thought this is just one among many low quality content tutorials out there, but I was wrong. This is one of the best statistics and data science related channels I have seen so far, wonderful explanation by Josh. Addicted to this channel and subscribed. Thank you Josh for sharing your knowledge and making us learn in a constructive way.

    • @statquest
      @statquest  3 ปีที่แล้ว

      Thank you very much! :)

  • @gbchrs
    @gbchrs 2 ปีที่แล้ว +1

    your channel is the best at explaining complex machine learning algorithm step by step. please make more videos

    • @statquest
      @statquest  2 ปีที่แล้ว

      Thank you very much!!! Hooray! :)

  • @user-lc8gc6vb3j
    @user-lc8gc6vb3j 10 หลายเดือนก่อน +2

    Thank you, this video helped me a lot! For anyone else following along in 2023, the way the confusion matrix is drawn here didn't work for me anymore. I replaced it with the following code:
    cm = confusion_matrix(y_test, clf_dt_pruned.predict(x_test), labels = clf_dt_pruned.classes_)
    disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels=['Does not have HD', "Has HD"])
    disp.plot()
    plt.show()

    • @statquest
      @statquest  10 หลายเดือนก่อน

      BAM! Thank you. Also, I updated the jupyter notebook.

  • @creativeo91
    @creativeo91 3 ปีที่แล้ว +4

    This video helped me a lot for my Data Mining assignment.. Thank you..

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      Glad it helped!

  • @utkarshsingh2675
    @utkarshsingh2675 2 ปีที่แล้ว +1

    this is what I have been looking for on youtube...thanks alot sir!!

  • @joaomanoellins2219
    @joaomanoellins2219 4 ปีที่แล้ว +25

    I loved your Brazil polo shirt! Triple bam!!! Thank you for your videos. Regards from Brazil!

    • @statquest
      @statquest  4 ปีที่แล้ว +20

      Muito obrigado!!!

    • @cindinishimoto9528
      @cindinishimoto9528 4 ปีที่แล้ว +2

      @@statquest paying homage to Brazil!!

    • @statquest
      @statquest  4 ปีที่แล้ว +5

      @@cindinishimoto9528 Eu amo do Brasil!

  • @umairkazi5537
    @umairkazi5537 4 ปีที่แล้ว +1

    Thank you very much . This video is very helpful and clears a lot of concepts for me

  • @juniotomas8563
    @juniotomas8563 4 หลายเดือนก่อน +1

    Come on, Buddy! I've just saw a recommendation to your channel and on the first video I see you with a Brazilian t-shirt. Nice surprise!

    • @statquest
      @statquest  4 หลายเดือนก่อน

      Muito obrigado! :)

  • @JoRoCaRa
    @JoRoCaRa ปีที่แล้ว +1

    brooo... this is insane!! thanks so much! this is amazing saving me so many headaches

    • @statquest
      @statquest  ปีที่แล้ว

      Glad it helped!

  • @filosofiadetalhista
    @filosofiadetalhista 2 ปีที่แล้ว +1

    Loved it. I am working on Decision Trees on my job this week.

  • @liranzaidman1610
    @liranzaidman1610 4 ปีที่แล้ว +2

    Fantastic, this is exactly what I needed

  • @douglasaraujo9763
    @douglasaraujo9763 4 ปีที่แล้ว +1

    Your videos are always very good. But today I’ll have to commend you on your fashion choice as well. Great-looking shirt! I hope you have had the opportunity to visit Brazil.

    • @statquest
      @statquest  4 ปีที่แล้ว

      Muito obrigado! Eu amo do Brasil! :)

  • @naveenagrawal_nice
    @naveenagrawal_nice 6 หลายเดือนก่อน +1

    Love this channel, Thank you Josh

    • @statquest
      @statquest  6 หลายเดือนก่อน

      Glad you enjoy it!

  • @rajatjain7465
    @rajatjain7465 ปีที่แล้ว +1

    wowowowwo the best course ever, even better than all those paid quests thank you @josh stramer for these materials

  • @sameepshah3835
    @sameepshah3835 หลายเดือนก่อน +1

    I love you so much Josh. Thank you so much for everything.

    • @statquest
      @statquest  หลายเดือนก่อน

      Thanks!

  • @bessa0
    @bessa0 2 ปีที่แล้ว +1

    Kind Regards from Brazil. Loved your book!

  • @magtazeum4071
    @magtazeum4071 4 ปีที่แล้ว +2

    BAM...!!! I'm getting notifications from your channel again

  • @jihowoo9667
    @jihowoo9667 4 ปีที่แล้ว +1

    I really love your video, it helps me a lot!! Regards from China.

    • @statquest
      @statquest  4 ปีที่แล้ว

      Awesome! Thank you!

  • @junaidmalik9593
    @junaidmalik9593 3 ปีที่แล้ว

    Hi Josh, one amazing thing about the playlist is the song u sing before starting the video, that refreshes me. u know how to keep the listener awake for the next video. hehe. and really thanks for the amazing explanation.

    • @statquest
      @statquest  3 ปีที่แล้ว

      Awesome thank you!

  • @chaitanyasharma6270
    @chaitanyasharma6270 3 ปีที่แล้ว +1

    i loved your video support vector machines in python from start to finish and this one too!!! can you make more on different algorithms?

  • @simaykazc1508
    @simaykazc1508 3 ปีที่แล้ว +1

    Josh is the best. I learned a lot from him!

    • @statquest
      @statquest  3 ปีที่แล้ว

      Wow! Thank you!

  • @ramendrachaudhary9784
    @ramendrachaudhary9784 3 ปีที่แล้ว +2

    We need to see you play some tabla to one of your songs. Double BAM!! Great content btw :)

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      Maybe one day!

  • @pratyushmisra2516
    @pratyushmisra2516 4 ปีที่แล้ว +4

    My intro song for this channel:
    " It's like Josh has got his hands on python right,
    He teaches Ml and AI really Well and tight ---- STAT QUEST"
    btw thanks Brother for so much wonderful content for free.....

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      Thank you! :)

  • @alexyuan1622
    @alexyuan1622 3 ปีที่แล้ว +1

    Hi Josh, thank you so much for this awesome posting! Quick question, when doing the cross validation, should the cross_val_score() using [X_train, y_train] or the [X_encoded, y]? I'm wondering if the point of doing cross validation is to let each chunk of data set being testing data, should we then use the full data set X_encoded an y for the cross validation? Thank you!!

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      There are different ideas about how to do this, and they depend on how much data you have. If you have a lot of data, it is common to hold out a portion of the data to only be used for the final evaluation of the model (after optimizing and cross validation) as demonstrated here. When you have less data, it might make sense to use all of the data for cross validation.

    • @alexyuan1622
      @alexyuan1622 3 ปีที่แล้ว +1

      @@statquest Thanks for the quick response. That makes perfect sense.

  • @sharmakartikeya
    @sharmakartikeya 3 ปีที่แล้ว +1

    Hurray! I saw your face for the first time! Nice to see one of those whom I have subscribed

  • @josephgan1262
    @josephgan1262 2 ปีที่แล้ว

    Hi Josh, Thanks for the video again!!. I have some questions hope you don't mind to clarify in regards to pruning in general hyperparameter tuning. I see that in general the video has done the following to find the best alpha.
    1) After train test split, find the best alpha after comparison between test and training (single split). @50:32
    2) Rechecking the best alpha by doing CV @52:33. It is checked that that is huge variation in the accuracy, and this implies that alpha is sensitive to different training set.
    3) Redo the CV for to find the best alpha by taking the mean of accuracy for each alpha.
    a) At step two, do we still need to plot the training set accuracy to check for overfitting? (it is always mention that we should compare training & testing set accuracy to check for overfitting) but there is an debate on this as well. ( Where other party mentioned that for a model-A of training/test accuracy of 99/90% vs another model-B : 85/85%. We should pick model-A with 99/90% accuracy because 90% testing accuracy is higher than 85% even though the model-B has no gap (overfitting) between train & test. What's your thought on this?
    b) What if I don't do step 1) and 2) and straight to step 3) is this a bad practice? do i still need to plot the training accuracy to compare with test accuracy if I skip step 1 and step 2? Thanks.
    c) I always see that the final hyper parameter is decided on highest mean of accuracy of all K-folds. Do we need to consider the impact of variance in K-fold? surely we don't want our accuracy to jump all over the place if taken into production. if yes, what is general rule of thumb if the variance in accuracy is consider bad.
    Sorry for the long posting. Thanks!

    • @statquest
      @statquest  2 ปีที่แล้ว

      a) Ultimately the optimal model depends on a lot of things - and often domain knowledge is one of those things - so there are no hard rules and you have to be flexible about the model you pick.
      b) You can skip the first two steps - those were just there to illustrate the need for using cross validation.
      c) It's probably a good idea to also look at the variation.

  • @avramdagoat
    @avramdagoat 9 หลายเดือนก่อน +1

    great insight and refresher, thank you for documenting

    • @statquest
      @statquest  9 หลายเดือนก่อน

      Glad you enjoyed it!

  • @teetanrobotics5363
    @teetanrobotics5363 3 ปีที่แล้ว +1

    Amazing man. I love your channel. Could you please reorder this video , SVMs and Xgboost in the correct order in the playlist ?

  • @SaurabhKumar-mr7lx
    @SaurabhKumar-mr7lx 4 ปีที่แล้ว +1

    Hi Josh, I see in Sklearn all the tree based ensembled algorithms has ccp_alpha as tuning parameter. Is it advisable to do so, rather is it feasible to do so for hundreds of trees (especially when trees are randomly created) or should we tune standard parameters like learning rate, no. of trees, loss function etc.

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      In this video I tune ccp_alpha (starting at 46:31 ). It spares us the agony of tuning a lot of separate parameters.

    • @SaurabhKumar-mr7lx
      @SaurabhKumar-mr7lx 4 ปีที่แล้ว

      @@statquest Just wondering is it possible to tune this for random forest. Since we are creating 100's of trees with randomly selected features for every tree. As far as I understood, ccp is a tree specific parameter. Please give some insight of this in your next session. Hope so my query is relevant 🙂

    • @statquest
      @statquest  4 ปีที่แล้ว +2

      @@SaurabhKumar-mr7lx With Random Forests, the goal for each tree is different than when we just want a single decision tree. For Random Forest trees, we actually do not want an optimal tree. We only want something that gets it correct a little more than 50% of the time. So in this case, we just limit the tree depth to 3 or 4 or something that, rather than optimize each tree with cost complexity pruning.

    • @SaurabhKumar-mr7lx
      @SaurabhKumar-mr7lx 4 ปีที่แล้ว +1

      @@statquest got it ....... Thanks for explaining this Josh.

  • @Nico.75
    @Nico.75 3 ปีที่แล้ว

    Hi Josh, such an awesome helpful video, again! May I ask you a basic question? When I'm doing an initial decision tree model building using train/test split and evaluate training and test accuracy scores and then start over doing k-fold cross validation on the same training set and evaluate it on the same test set as in the initial step -> is that a proper method? Because I used the same test set for evaluation twice, first on the initial train/test split method and second using the crossvalidation method? I read you should us your test (or hold out) set only once… Last question: Should you use the exactly same training/test set for comparing different algorithms (decision trees, random Forests, logistic Regression, kNN, etc...)? Thanks so much for a short feedback and quest on! Thanks and BAM!!!

    • @statquest
      @statquest  3 ปีที่แล้ว

      Yes, I think it's OK to use the same testing set to compare the model before optimization and after optimization.
      Ideally, if you are comparing different algorithms, you will use cross validation and pick the one that has the best, on average, score. Think of picking an algorithm like picking a hyperparameter.

  • @mcmiloy3322
    @mcmiloy3322 3 ปีที่แล้ว

    Really nice video. I thought you were actually going to implement the tree classifier itself, which would have been a real bonus but I guess that would have taken a lot longer.

  • @estebannantes8567
    @estebannantes8567 4 ปีที่แล้ว +1

    Hi Josh. Loved this video. I have two questions: 1- Is there any way to save our final decision tree model to use it later in unseen data without having to train it all again? 2- Once you have decided on your final alpha: why not training your tree on a full-unsplit dataset. I know you will not be able to generate a confusion matrix, but wouldn't your final tree be better if it is trained with all the examples?

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      Yes and yes. You can write the decision tree to a file if you don't want to keep it in memory (or want to back it up). See: scikit-learn.org/stable/modules/model_persistence.html

  • @amalsakr1381
    @amalsakr1381 5 หลายเดือนก่อน +1

    Thank you for your powerful tutrial

    • @statquest
      @statquest  5 หลายเดือนก่อน

      Glad it was helpful!

  • @srmsagargupta
    @srmsagargupta 3 ปีที่แล้ว +1

    Thank you Sir for this wonderful webinar

  • @bardhrushiti184
    @bardhrushiti184 4 ปีที่แล้ว

    Great video - thanks for sharing such valuable content.
    I have a question regarding the alpha/accuracy graph: In my dataset, the training and testing accuracy are relatively close (~100% and ~98%, respectively) and after plotting Accuracy vs Alpha for training and testing, it seems that as the alpha increases, the accuracy decreases as well. At alpha = 0, the accuracy (train = ~100% and test = ~98%), at alpha = 0.011, the accuracy (train = ~92.5% and test = ~92.1%), and it decreases. Should I still consider doing pruning with alpha, even though it seems that the model is doing okay?
    Thank you in advance!
    Keep posting awesome videoes !

    • @statquest
      @statquest  4 ปีที่แล้ว

      If the full sized tree performs best on your testing data, then you don't need to prune.

  • @_ahahahahaha9326
    @_ahahahahaha9326 2 ปีที่แล้ว +1

    Really learn a lot from you

  • @fernandosicos
    @fernandosicos 2 ปีที่แล้ว +1

    greatings from Brazil!

    • @statquest
      @statquest  2 ปีที่แล้ว

      Muito obrigado! :)

  • @amc9520
    @amc9520 ปีที่แล้ว +1

    Thanks for making my life easy.

  • @danielw7626
    @danielw7626 3 ปีที่แล้ว

    Hi Josh, thanks for your clear explanation. it's very helpful. One quick question, do we need to delete one column after perform OneHotEncoding to avoid the dummy variable trap? Thank you in advance if you could clarify this for me as I only start learning ML for 1 month. Cheers

    • @statquest
      @statquest  3 ปีที่แล้ว

      Not in this case.

  • @mahdimj6594
    @mahdimj6594 4 ปีที่แล้ว +1

    Neural Network Pleaseee, Bayesian and LARS as well. And Thank you. You actually make things much easier to understand.

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      Thanks! :)

  • @pfunknoondawg
    @pfunknoondawg 3 ปีที่แล้ว +1

    Wow, this is super helpful!

    • @statquest
      @statquest  3 ปีที่แล้ว

      Glad you think so!

  • @floral7448
    @floral7448 3 ปีที่แล้ว +1

    Finally have the honor to see Josh :)

  • @aalaptube
    @aalaptube ปีที่แล้ว

    You mentioned sklearn is not great for lot of data. In terms of size of data, how much is a lot? 1, 10, 100GB? For those cases, what are the options?
    Also, what does the function cost_complexity_pruning_path do? How did it build the array of the value of alphas? In the other StatQuest video of using α, we just checked some specific values...

    • @statquest
      @statquest  ปีที่แล้ว

      The answer to the first question depends on how much time you have on your hands. As for the second question, see: scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html

  • @abdelrhmansayed5436
    @abdelrhmansayed5436 2 ปีที่แล้ว +1

    thank you for your great effort and simple explanation, i have only one question that is why did you split the data into X_train and y_trrain and then give it to cross_val_score , shouldn't coss validtion works on all X ?

    • @statquest
      @statquest  2 ปีที่แล้ว

      In theory we are trying to save some data for a final validation of the model.

  • @anbusatheshkumarpalanisamy8798
    @anbusatheshkumarpalanisamy8798 4 ปีที่แล้ว

    Hi Josh, how are we getting 132 sample in the left node of the final tree, shouldn't be 118 from the root node?

    • @statquest
      @statquest  4 ปีที่แล้ว

      132 of the samples (both those with heart disease and those without heart disease) have ca values 0.5 go to the right. For more details, see: th-cam.com/video/7VeUPuFGJHk/w-d-xo.html

  • @paulovinicius5833
    @paulovinicius5833 3 ปีที่แล้ว +1

    I know I'll love all the content, but I start liking the video immediatly bc of the music! haha

    • @statquest
      @statquest  3 ปีที่แล้ว

      Thank you! :)

  • @sabyasachidas142
    @sabyasachidas142 2 ปีที่แล้ว

    Thanks Josh for the awesome tutorial. I've one question. While one hot encoding, we also pass drop_first=True as an argument to avoid multicollinearity while performing regression. But we didn't do it for this classification problem. Is it not required?

    • @statquest
      @statquest  2 ปีที่แล้ว

      Multicollinearity is a problem with regression, but not with tree based methods.

  • @Theviswanath57
    @Theviswanath57 3 ปีที่แล้ว

    In the accompanying theory videos you mentioned to compute ccp_alphas we are supposed to use full data ?

    • @statquest
      @statquest  3 ปีที่แล้ว

      We use the full testing dataset.

  • @dineshmuniandy9519
    @dineshmuniandy9519 4 ปีที่แล้ว

    Hi Josh, is it possible to apply Cost Complexity Pruning to Regression problems (where the predicted target in Continuous) ? What are the modifications required to the code ?

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      Cost complexity pruning works great with regression trees. Here's a video that describes it: th-cam.com/video/D0efHEJsfHo/w-d-xo.html

    • @dineshmuniandy9519
      @dineshmuniandy9519 4 ปีที่แล้ว

      @@statquest Just a quick question. What should this line be changed to:
      df = pd.DataFrame(data={'tree': range(5), 'accuracy': scores})
      If the target is not a class of 5 different ordinal values, but rather continuous values ?

  • @korcankomili7398
    @korcankomili7398 ปีที่แล้ว +1

    I wish you were my uncle Josh or something.
    I could imagine how hard I would have had discussions with my parents to spend time with my TRIPLE cool uncle.

  • @beibeima524
    @beibeima524 2 ปีที่แล้ว

    Hi Josh, Thanks so much for the video! My Question is should we do one hot encoding before or after splitting the data into training and testing set? Thanks!

    • @statquest
      @statquest  2 ปีที่แล้ว

      As long as all categories are in both sets, it doesn't matter.

  • @michelchaghoury870
    @michelchaghoury870 2 ปีที่แล้ว +1

    MANNNN so usefull please keep going

  • @khashayarsalehi6779
    @khashayarsalehi6779 2 ปีที่แล้ว

    Thanks for this great tutorial! I have a question though, I tried decision tree regressor but at the end the pruned tree returns the same high value too far out of the range for all the inputs! Also the accuracy for train and test sets decreases by increasing of alpha! Can you help me to understand how the tree is returning the same unreasonable value for all the inputs?

    • @statquest
      @statquest  2 ปีที่แล้ว

      Unfortunately I don't have time to help you with your code... :(

    • @khashayarsalehi6779
      @khashayarsalehi6779 2 ปีที่แล้ว +1

      @@statquest You've already helped me with your awesome tutorial! It's OK :) Triple BAM!

  • @krishanudebnath1959
    @krishanudebnath1959 2 ปีที่แล้ว +1

    love the tabla and ur content

    • @statquest
      @statquest  2 ปีที่แล้ว

      Thanks! My father used to teach at IIT-Madras so I spent a lot of time there when I was young.

  • @TheKukun123
    @TheKukun123 3 ปีที่แล้ว

    when i want to plot the confusion matrix, the following error occurs at the import library stage : ImportError: cannot import name 'plot_confusion_matrix' from 'sklearn.metrics' (C:\Users\hp\Anaconda3\lib\site-packages\sklearn\metrics\__init__.py)..
    what do i do to rectify this?

    • @statquest
      @statquest  3 ปีที่แล้ว

      Presumably you need to install sklearn.metrics. you can do this with: "conda install scikit-learn"

  • @cageman301
    @cageman301 2 ปีที่แล้ว

    I understand that Logistic Regression requires continuous variables to be separated into bins and coarse classed to ensure that the final model is created with binary variables only, does this apply to Decision Trees as well?

    • @statquest
      @statquest  2 ปีที่แล้ว

      If you want to learn more about logistic regression, and how it works, see: th-cam.com/play/PLblh5JKOoLUKxzEP5HA2d-Li7IJkHfXSe.html and if you'd like to learn more about decision trees, see: th-cam.com/video/_L39rN6gz7Y/w-d-xo.html

  • @xenofon939
    @xenofon939 2 ปีที่แล้ว

    Can we hyper tune the parameters(not only alpha) with gridsearch ? (extra cross val) maybe we can optimize the tree to work even better?
    So we can have the cross val for alpha, and we can add a gridsearch for max_leafs ,gini,entropy,samples
    Thanks in advance

  • @shindepratibha31
    @shindepratibha31 4 ปีที่แล้ว

    I have almost completed the Machine learning playlist and it was really helpful. One request, can you please make a short video on 'handling the imbalanced dataset'?

    • @statquest
      @statquest  4 ปีที่แล้ว

      I've got a rough draft on that topic here: th-cam.com/video/iTxzRVLoTQ0/w-d-xo.html

  • @GokulSKumar-uz9dy
    @GokulSKumar-uz9dy 4 ปีที่แล้ว +1

    Great video sir.:)
    I just have a doubt in one part. At 52:14 instead of using X_train and y_train, arent we supposed to use the entire dataset(i.e. X_encoded and y) while implementing cross-validation?
    Also later in the video at 52:54, the value for alpha was found by using only X_train and y_train data in the cross-validation.

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      There are different schools of thought about what datasets you should use for cross validation. The most common one, however, is to do it as presented in this video.

    • @GokulSKumar-uz9dy
      @GokulSKumar-uz9dy 4 ปีที่แล้ว

      @@statquest Thanks a lot!
      Just in case if it might be useful, I tried using the entire dataset in cross-validation before splitting it into train and test. I could see from the corresponding confusion matrix that the model predicted correctly 90% of people not having a heart disease whereas, there was no increase in the percentage of people having heart disease.
      Again loved the video a lot. Waiting for the next webinar.:)

    • @KeigoEdits
      @KeigoEdits 2 ปีที่แล้ว

      @@statquest Hey Josh sir, actually after reading this comment I really went for cross-validation with the whole dataset as in above comment I also read that you mentioned that we should take whole dataset in case of small datasets and what I personally think is 297 datapoints dataset can be called small. This gave me better results at alpha=0.021097 and it was varying from 0.74 and 0.88. What are your views on it?

    • @statquest
      @statquest  2 ปีที่แล้ว

      @@GokulSKumar-uz9dy It really depends on how noisy your data is and what you hope to do with it.

  • @hanaj4870
    @hanaj4870 3 ปีที่แล้ว +1

    Thank you sir!! Best ever!!!! BAM!!

    • @statquest
      @statquest  3 ปีที่แล้ว

      Thank you very much! :)

  • @bressanini
    @bressanini 2 ปีที่แล้ว +1

    Hey Josh, follow this equation: You + Brazilian Flag Polo Shirt + Awesome Content = TRIPPLE BAM!!!

    • @statquest
      @statquest  2 ปีที่แล้ว +1

      Muito bem! :)

  • @prnv5
    @prnv5 2 ปีที่แล้ว

    Hi Josh! I'm a HS student trying to learn ML algorithms and your videos are genuinely my saving grace. They're so concise, information heavy and educational. I understand concepts perfectly through your statquests, and I'm really grateful for that.
    One quick question: The algorithm used in this case to build a decision tree: is it the CART algorithm? I'm writing a paper on the CART algorithm and would hence like to confirm the same. Thanks again!

    • @statquest
      @statquest  2 ปีที่แล้ว +1

      Yes, this is the "classification tree" in CART.

    • @prnv5
      @prnv5 2 ปีที่แล้ว +1

      @@statquest Thank you so much 🥰

  • @Phil36ful
    @Phil36ful 2 ปีที่แล้ว

    Hi
    which algorithm was used here to make a tree?
    Is it ID3?
    Is there any guide for c4.5?

    • @statquest
      @statquest  2 ปีที่แล้ว

      This used CART, which I describe here: th-cam.com/video/_L39rN6gz7Y/w-d-xo.html

  • @6223086
    @6223086 3 ปีที่แล้ว

    Hi Josh, I have a question, at 1:01:03 , if we interpret the tree, on the right split from the root node, we first went from a node with Gini Score of 0.346 (cp_4.0

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      For each split we calculate the Weighted Average of the individual Gini scores for each leaf and we pick the one with the lowest weighted average. In this case, although the leaf on the left has a higher Gini score than the node above it, it has fewer samples, 31, than the leaf on the right, which has a much lower Gini score, 0.126, and more samples, 59. If we calculate the weighted average of the Gini scores for these two leaves it will be lower than the node above them.

  • @TalesLimaFonseca
    @TalesLimaFonseca 2 ปีที่แล้ว +1

    Man, you are awesome! Vai BRASIL!!!

    • @statquest
      @statquest  2 ปีที่แล้ว

      Muito obrigado!

  • @BeSharpInCSharp
    @BeSharpInCSharp 4 ปีที่แล้ว

    Do we have any video on k-fold to generate random training and testing set?

    • @statquest
      @statquest  4 ปีที่แล้ว

      I'm not sure what you're asking, but we have a video on cross validation: th-cam.com/video/fSytzGwwBVw/w-d-xo.html

  • @bjornlarsson1037
    @bjornlarsson1037 4 ปีที่แล้ว

    Absolutely amazing work Josh! You are definitely the best guy on the internet teaching this stuff! Just a question on reproducibility when using get_dummies vs. other methods of enconding. I used make_column_transformer together with make_pipeline. My pruned tree was different in that the node "variables" were different, but the numbers (cutoffs, ginis, samples, values, class) were identical. I also got small differences at other places compared with your result. Given that I have followed along with your code (and used the same random states as you did), should I get exactly the same results as you did (under the assumption that I haven't made any error of course) or is it possible that the results may differ between methods? Thanks again Josh!

    • @statquest
      @statquest  4 ปีที่แล้ว

      It should be the same. Hmm... This is an interesting problem.

    • @bjornlarsson1037
      @bjornlarsson1037 4 ปีที่แล้ว

      @@statquest Okay, I have now at least figured out why the pruned tree is different, and that was because the column names were out of order because apparently make_column_transformer puts the dummy columns at the beginning of the dataset instead of at the end as with get_dummies. But there are still differences in that the last confusion matrix is identical to yours, but the first cofusion matrix is slighlty different, even though I called the methods in exactly the same way on both of them. But since you said on your reply that we should get identical results, it most be something I have done differently than you on the first one, but I can't really see what right now

  • @patite3103
    @patite3103 3 ปีที่แล้ว

    thank you for this video! Would it be possible to do a similar video with random forest and regression trees?

    • @statquest
      @statquest  3 ปีที่แล้ว

      I don't like the random forest implementation in Python. Instead, if you're going to use random forests, you should do it in R. And I have a video for that: th-cam.com/video/6EXPYzbfLCE/w-d-xo.html

  • @julescesar4779
    @julescesar4779 3 ปีที่แล้ว +1

    thank you so much sir for sharing

  • @user-qo7bz5em3u
    @user-qo7bz5em3u ปีที่แล้ว

    Great tutorial! But unfortunately, I´m struggling at min 48. How could it be, that I get a negative ccp_alpha of -2.168404344971009e-19? y values are 0 or 1 and all X values are positive? Have someone an idea what´s the reason for?

    • @statquest
      @statquest  ปีที่แล้ว

      Are you using my data and my code?

  • @mrlfcynwa
    @mrlfcynwa 3 ปีที่แล้ว

    Thanks for this! I just have a quick feedback that it would've been great had you touched upon how to interpret the leaves of the decision tree

  • @abdelrazzaqabuhejleh6625
    @abdelrazzaqabuhejleh6625 7 หลายเดือนก่อน

    Thank you for this valuable explanation :D
    I have a question tho, what do we learn from the graph in 51:48?

    • @statquest
      @statquest  7 หลายเดือนก่อน

      This shows how different trees trained with different subsets of data have different accuracies.