Support Vector Machines in Python from Start to Finish.

แชร์
ฝัง
  • เผยแพร่เมื่อ 7 พ.ย. 2024

ความคิดเห็น • 375

  • @statquest
    @statquest  4 ปีที่แล้ว +23

    NOTE: At 31:25 we should use the mean and standard deviation from the training dataset to center and scale the testing data. The updated jupyter notebook reflects this change.
    ALSO NOTE: You can support StatQuest by purchasing the Jupyter Notebook and Python code seen in this video here: statquest.gumroad.com/l/iulnea
    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

    • @Dani-hh3qd
      @Dani-hh3qd 2 ปีที่แล้ว

      By scaling do you mean data normalization?

    • @statquest
      @statquest  2 ปีที่แล้ว

      @@Dani-hh3qd Normalization is a specific type of scaling.

  • @baomiTV
    @baomiTV 3 ปีที่แล้ว +64

    After eight years of employment after graduation, I got laid off in 2020. I went back to school to pursue my second master in Data Science. I was still confused after machine learning classes, but after I watched your videos which were the same topics as the ones in my classes, you led me into a totally different world. Same concepts were taught by you in much easier way. BAM!!!

    • @statquest
      @statquest  3 ปีที่แล้ว +7

      I'm glad my videos are helpful! :)

    • @aurkom
      @aurkom 3 ปีที่แล้ว +1

      A 2nd master?
      How much has the curriculum changed in the past 8 years?

  • @mucahitdemirc
    @mucahitdemirc 3 ปีที่แล้ว +48

    I will definitely donate to this channel as soon as I got a job! Thanks.

    • @statquest
      @statquest  3 ปีที่แล้ว +2

      Thank you very much! :)

  • @ivnesapple479
    @ivnesapple479 3 ปีที่แล้ว +5

    Really appreciate for your slow speaking speed ,which makes it possible for not a English speaker ,like me ,a Chinese,to learn.

  • @kenricktan5271
    @kenricktan5271 3 ปีที่แล้ว +11

    I'm so happy to find out that saying BAM + DOUBLE BAM comes naturally to you (and was not just for the videos). Amazing walkthrough as usual, Josh!

    • @statquest
      @statquest  3 ปีที่แล้ว +7

      Triple bam! :)

  • @t.t.cooperphd5389
    @t.t.cooperphd5389 4 ปีที่แล้ว +60

    455 likes and 0 dislikes.... that's a double BAM!

  • @randommcranderson5155
    @randommcranderson5155 4 ปีที่แล้ว +29

    You're a pretty amazing nerd, I love it. This is an amazing tutorial.

    • @statquest
      @statquest  4 ปีที่แล้ว

      Thanks! 😃

  • @jack.1.
    @jack.1. 3 ปีที่แล้ว +4

    Really amazing video, I've been in and around data science and ML for a while but this is the first time I feel like I've gone the full way from mathematical concept -> working program (using medium complexity ML methods) -> insight/ question answered.

    • @statquest
      @statquest  3 ปีที่แล้ว

      Glad you enjoyed it!

  • @lucianotarsia9985
    @lucianotarsia9985 4 ปีที่แล้ว +4

    Hi from Argentina.
    Great video! It really was from start to finish, it covers every step with dedication.
    Thanks for sharing your knowledge!

    • @statquest
      @statquest  4 ปีที่แล้ว

      Thank you very much! :)

  • @forrest404
    @forrest404 ปีที่แล้ว +2

    I love this kind of webinar where you teach in real time and go through concrete examples. Just purchased the material package and can't wait to go through them with you. I hope you'll make more content like this in the future 😊(I love the short and sweet vids too but I learn by doing so this helps solidify all the theory stuff!)

    • @statquest
      @statquest  ปีที่แล้ว +1

      Thank you, and thank you for your support!

  • @ionut5316
    @ionut5316 3 ปีที่แล้ว +1

    I purchased the notebook and I also watched the whole ad so you can make more money.

    • @statquest
      @statquest  3 ปีที่แล้ว

      Thank you so much for your support! It means a lot to me. BAM! :)

  • @ashishgoyal4958
    @ashishgoyal4958 3 ปีที่แล้ว +8

    Thank you so much for making this amazing code-walkthrough for SVM. Looking forward for more code walkthroughs like this.

    • @statquest
      @statquest  3 ปีที่แล้ว +2

      You're very welcome!

  • @RahulEdvin
    @RahulEdvin 4 ปีที่แล้ว +6

    Josh, you are Phenomenal! Love and Respect from Madras !

    • @anjalivijay9577
      @anjalivijay9577 4 ปีที่แล้ว +2

      Respect from kerala too

    • @sidharths9416
      @sidharths9416 4 ปีที่แล้ว +2

      @@anjalivijay9577 adhaan💥💪

    • @statquest
      @statquest  4 ปีที่แล้ว +3

      Hooray!!! Thanks! :)

    • @anjalivijay9577
      @anjalivijay9577 4 ปีที่แล้ว +2

      @@statquest 🤩🤩🤩🤩

    • @sidharths9416
      @sidharths9416 4 ปีที่แล้ว +2

      @@statquest BAAAAM

  • @khaganieynullazada2794
    @khaganieynullazada2794 3 ปีที่แล้ว +4

    Again great work Josh, thanks so much. I actually worked at UNC-Chapel Hill, but I discovered you after moving to another University. Hope will meet you one day to thank you in person for the amazing content you are creating.

    • @statquest
      @statquest  3 ปีที่แล้ว

      Wow! Thank you very much! :)

  • @NaumRusomarov
    @NaumRusomarov ปีที่แล้ว

    svm are kinda my favourite thing in ML. very simple and mathematically concise yet highly usable.

  • @moona5454
    @moona5454 4 ปีที่แล้ว +1

    I am not an expert but a small help for everyone here ^_^ , if you want to find the missing values very easily, you can type
    dataframe.isnull().sum() ; dataframe is the name of the object containing the data.
    And thank you Josh for the amazing webinar ♥

  • @swalehomar3753
    @swalehomar3753 3 ปีที่แล้ว +3

    This is amazing! Am in love with your approach of handling these stuff. Very clear and concise.

    • @statquest
      @statquest  3 ปีที่แล้ว

      Thank you! :)

  • @roni123467
    @roni123467 4 ปีที่แล้ว +5

    Really detailed and nice lesson! I liked how detailed the explanations were, It is definitely DOUBLE BAM worthy!
    Thank you.

    • @statquest
      @statquest  4 ปีที่แล้ว

      Glad you enjoyed it!

  • @aktasberk7
    @aktasberk7 ปีที่แล้ว +1

    Thank you Josh, this taught me a good lesson on both PCA and SVM. Great work!

  • @liangke4276
    @liangke4276 3 ปีที่แล้ว +1

    you video deserves to be translated into more languages so people don't speak English can also learn from your amazing content

  • @tolga1292
    @tolga1292 2 ปีที่แล้ว +11

    You Sir are an outstanding educator.

    • @statquest
      @statquest  2 ปีที่แล้ว +1

      Thank you!

  • @anikar1302
    @anikar1302 3 ปีที่แล้ว +3

    i always love those musical intros

  • @bosepukur
    @bosepukur 4 ปีที่แล้ว +1

    Josh u r an inspiration in teaching...Plz keep it up

    • @statquest
      @statquest  4 ปีที่แล้ว

      Thank you! :)

  • @alkapandey1008
    @alkapandey1008 4 ปีที่แล้ว +2

    You are amazing. Keep posting. Best wishes from India.

    • @statquest
      @statquest  4 ปีที่แล้ว

      Thank you very much! :)

  • @cool_sword
    @cool_sword 3 ปีที่แล้ว +1

    You already get a lot of love, but I have to add to it and tell you how great these are. No joke, I've had nights when I plan on watching some TV or some movies and I decide to check out some 'Quests instead!

    • @statquest
      @statquest  3 ปีที่แล้ว

      BAM! Thank you very much! :)

  • @kappa7072
    @kappa7072 3 ปีที่แล้ว +2

    Josh, you are wonderful! Thanks a million form Italy!

    • @statquest
      @statquest  3 ปีที่แล้ว

      Thank you very much!!!

  • @md.nazrulislamsiddique7492
    @md.nazrulislamsiddique7492 2 ปีที่แล้ว

    Your video is so awesome. Everything related to SVM in one video, BAM.

    • @statquest
      @statquest  2 ปีที่แล้ว

      Glad you liked it!

  • @konstantinlevin8651
    @konstantinlevin8651 ปีที่แล้ว +1

    I've reread the "hitchhikers guide to galaxy" again (first time I read I was 12) and now it makes a lot more sense why the random state is 42 :)))

  • @nick9198
    @nick9198 2 ปีที่แล้ว +1

    Your dedication is unreal, you replied to all the comments. Wow!
    p.s. thanks for the video

  • @GaMiNGYT-dc2cf
    @GaMiNGYT-dc2cf 3 ปีที่แล้ว +1

    This guy doesn't deserve the dislike button to be in his videos...what a clear explanation!!!

    • @statquest
      @statquest  3 ปีที่แล้ว

      Awesome! Thank you very much! :)

  • @vram11
    @vram11 3 ปีที่แล้ว +1

    Precise and to the point. Luv this and I am def going to extend my support to you

    • @statquest
      @statquest  3 ปีที่แล้ว

      Thank you! :)

  • @elvsrbad2
    @elvsrbad2 4 ปีที่แล้ว +1

    This video came out the same week I decided to learn this. Get out of my head!

  • @Jannerparejagutierrez
    @Jannerparejagutierrez 6 หลายเดือนก่อน +1

    Thank you very much for the video!
    I have a question, in SVM should the variables only be numeric or does it also support text?
    Thank you!

    • @statquest
      @statquest  6 หลายเดือนก่อน

      Only numeric

    • @statquest
      @statquest  6 หลายเดือนก่อน

      Hooray! :)

  • @KomangWahyuTrisna
    @KomangWahyuTrisna 4 ปีที่แล้ว +1

    I learned a lot from your channel. I am a big fan of you. Looking forward for your Deep learning and NLP tutorial with python

    • @statquest
      @statquest  4 ปีที่แล้ว

      Awesome, thank you!

  • @cmpunk3367
    @cmpunk3367 2 ปีที่แล้ว

    Thanks for the brilliant tutorial Josh! You are truly an inspiration.
    I just had two questions here :-
    1) You applied a regularization technique here by finding the right value for C. What kind of regularization is this? L1, L2 or L1&L2?
    2) Is it possible to apply L1, L2, and elastic net regularization on SVMs? If yes, how should I do it?

    • @statquest
      @statquest  2 ปีที่แล้ว

      C controls L2 penalty. I think that might be the only regularization you can use with scikit-learn svm.

    • @cmpunk3367
      @cmpunk3367 2 ปีที่แล้ว +2

      @@statquest Yes I read the documentation of scikit-learn svm and the only other penalty allowed is L1.

  • @joxa6119
    @joxa6119 2 ปีที่แล้ว

    The most perfect guide for SVM in TH-cam. Will donate after I get my first job! Thank you so much.
    Btw, I have question, why don't you use PCA before doing the modelling part? Are PCA only been use for visualization?

    • @statquest
      @statquest  2 ปีที่แล้ว

      In this case, we only use PCA for visualization.

    • @joxa6119
      @joxa6119 2 ปีที่แล้ว

      @@statquest I see but, so far what I know it will reduce the accuracy, but will help to avoid multicollinearity. But because of we have done OneHotEncoder, multicollinearity will be not occur. Am I right?

    • @statquest
      @statquest  2 ปีที่แล้ว

      @@joxa6119 Using PCA first would definitely reduce multicollinearity if that was something we thought we needed to deal with. Multicollinearity usually means that we have 2 or more highly correlated features (also called variables), and thus, they are somewhat redundant. One-hot-encoding will not change the fact that those variables are redundant.

  • @ramakdixit8648
    @ramakdixit8648 4 ปีที่แล้ว +1

    Wow. Thanks Josh . Your videos are always a go to resource

  • @steelcitysi
    @steelcitysi 4 ปีที่แล้ว +1

    You are awesome. I hope you do something on NLP (tf idf, word2vec, etc.), for some reason your style was made for my brain

  • @shwetaredkar734
    @shwetaredkar734 2 ปีที่แล้ว +1

    Triple BAM!! Guess What?? You are the best teacher I've ever come across. My life is saved. Good to know you play Tabla too.

    • @statquest
      @statquest  2 ปีที่แล้ว

      Thank you very much!!! :)

  • @deojeetsarkar2006
    @deojeetsarkar2006 4 ปีที่แล้ว +1

    Good to see no haters for the saintly man.

  • @pareshnavalakha7127
    @pareshnavalakha7127 4 ปีที่แล้ว +5

    Hope to listen to the Tabla's behind you at the start of your training one day.

  • @irmaktekin3287
    @irmaktekin3287 4 ปีที่แล้ว +2

    Thanks! I really like the way you explain things: calm and simple :)

    • @statquest
      @statquest  4 ปีที่แล้ว

      Thank you! :)

  • @zahrasoltani8630
    @zahrasoltani8630 4 ปีที่แล้ว +2

    Can you explain why you used 'x_test_pca =pca.transform( x_train_scaled) when you wanted to transform test data with PCA?

    • @statquest
      @statquest  4 ปีที่แล้ว +2

      I decided it was interesting to draw two different PCA versions: 1) of the training data - so we can see the classifier with respect to the data it was trained on and 2) of the testing data - so we can see the classifier with respect to the data it was tested with. So the code has both versions, however, one of them (the latter) is commented out. However, you can swap which line is commented out and draw the latter.

    • @zahrasoltani8630
      @zahrasoltani8630 4 ปีที่แล้ว +1

      @@statquest Thank you so much

  • @martinparidon9056
    @martinparidon9056 2 ปีที่แล้ว +1

    Thanks a bunc. Helping me a lot getting started with my SVM. Regards

    • @statquest
      @statquest  2 ปีที่แล้ว

      Happy to help!

  • @שניגולדפרב
    @שניגולדפרב 3 ปีที่แล้ว +1

    Your videos are amazing !!!! I am soo happy u clearly explain many of the the topics I need!! :)
    (p.s. do u receive requests? I would really love a StatQuest on AR,MA,ARIMA,SARIMA models)

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      I'll keep those topics in mind.

  • @alexandremondaini
    @alexandremondaini 4 ปีที่แล้ว +1

    Hi Josh,
    Thank you very much for your lessons ! you explain very well unlike many teachers. I just have one doubt, when you scale(X_train) and scale(X_test) you're actually scaling the encoded 'categorical' variables. Thus the sparse encoded matrix of 0 and 1 encoded by the features ['SEX','MARRIAGE',....] will be scaled as well, is that correct ? Shouldn't be only the numerical features to get scaled ? Thanks a lot for your lessons

    • @statquest
      @statquest  4 ปีที่แล้ว +2

      It doesn't really matter if you scale binary variables or not: stats.stackexchange.com/questions/59392/should-you-ever-standardise-binary-variables

    • @alexandremondaini
      @alexandremondaini 4 ปีที่แล้ว +1

      @@statquest thanks for the reply! BAM

  • @Mustistics
    @Mustistics 2 ปีที่แล้ว +1

    One final question (I swear!): At the final code segment, you type
    X_test_pca = pca.transform(X_train_scaled)
    Isn't that supposed to be X_test_scaled?

    • @statquest
      @statquest  2 ปีที่แล้ว +1

      Hmm....I'm actually on vacation right now and can't dig through this code. Can you re-ask this question in a few weeks?

  • @engrmuhammadumar
    @engrmuhammadumar 6 หลายเดือนก่อน

    Those who are facing error can update the code as follow.
    clf = SVC(random_state=0)
    clf.fit(X_train_scaled, y_train)
    predictions = clf.predict(X_test_scaled)
    cm = confusion_matrix(y_test, predictions, labels=clf.classes_)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm,display_labels=clf.classes_)
    disp.plot()

    • @statquest
      @statquest  6 หลายเดือนก่อน

      Yep. The notebook has been updated.

  • @bytesizebiotech
    @bytesizebiotech 4 ปีที่แล้ว +1

    So, although the publishing company is elsevier, they are not the ones who did the research. If you ever want to read a paper, you can send an email to the primary investigator (the last author of the paper) or any of the first authors really, and they will freely give you the article to read

    • @statquest
      @statquest  4 ปีที่แล้ว

      That's a great idea! :)

  • @drpkmath12345
    @drpkmath12345 4 ปีที่แล้ว +4

    Love python! Been using R much lately! Would love to have some of R videos

    • @statquest
      @statquest  4 ปีที่แล้ว

      Yes, I'm going to cover all of these topics (and more) in R. For example, R does a much better job with Random Forests than Python.

    • @drpkmath12345
      @drpkmath12345 4 ปีที่แล้ว +1

      StatQuest with Josh Starmer I totally agree! Expect videos to come~

  • @ilducedimas
    @ilducedimas 2 ปีที่แล้ว +1

    what a lovable smart man, thanks for the great work!

  • @omidforoqi4163
    @omidforoqi4163 3 ปีที่แล้ว +1

    I love StatQuest. please continue to make video with python =)

    • @statquest
      @statquest  3 ปีที่แล้ว

      Thank you! :)

  • @yeyuan4235
    @yeyuan4235 3 ปีที่แล้ว +1

    Josh - Thanks for the video and it is super helpful!! A couple of questions though:
    1. Under "Transform the test dataset with the PCA...", should we use the code that you commented out - i.e. X_test_pca=pca.transform(X_test_scaled), instead of X_test_pca=pca.transform(X_train_scaled)? didn't get why we applied the PCA transformation on train dataset to derive testing data.
    2. Noticed that 1,000 defaults and 1,000 non-defaults were selected to construct the training sample. Do the numbers of two classes have to be equal for SVM? If not, would this cause any bias as the ratio seems a lot different from the original data?
    Thank you!

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      1) Because the SVM was fit to the training data, I wanted to show how it "looked" relative to the training data. However, you can also "see" how the boundary applies to the testing data. It's up to you.
      2) Typically it's a good idea to have "balanced" data - data with an equal number of both classes. However, this is not a requirement for SVM - and, whether or not you need it depends on how you want the SVM to perform. For more details, see: th-cam.com/video/iTxzRVLoTQ0/w-d-xo.html

  • @sebastioncornejo4440
    @sebastioncornejo4440 2 ปีที่แล้ว +1

    Haha the double bam at 31:22had me dying lol. Great content! And love your channel!

    • @statquest
      @statquest  2 ปีที่แล้ว

      Thank you so much! :)

  • @JoaoVictor-sw9go
    @JoaoVictor-sw9go 2 ปีที่แล้ว

    Josh, this video has helped me out a lot in my studies, but I have a question. When we scale the data, we should also include the categorical variables? Shouldn't we just scale all the data excluding the categorical ones?

    • @statquest
      @statquest  2 ปีที่แล้ว +1

      Because the categorical variables are one-hot-encoded, we can scale them. All of the 0s will stay the same and the 1s will all turn into another constant value. In other words, when one-hot-encoding, 1 is arbitrarily chosen to begin with, so it doesn't hurt to turn it into another arbitrary number.

    • @JoaoVictor-sw9go
      @JoaoVictor-sw9go 2 ปีที่แล้ว +1

      @@statquest Got it Josh, thanks for responding

  • @bjornlarsson1037
    @bjornlarsson1037 4 ปีที่แล้ว +1

    Great tutorial Josh! You must truly have one of the highest thumbs up to thumbs down ratios on youtube. Just two questions.
    1) Right now you are using standarscaler on all of your variables, including the ones you have encoded. What is your reasoning for this instead of just scaling the continous variables, or maybe it doesn't affect the result?
    2) What are your thoughts on onehotencoding before vs after splitting the data? Obviously right now, when your doing get_dummies your are doing it before splitting the data. From what I have understood, whether to do it before or after splitting is a pretty heated topic and I have found several questions on stack exchange where half the people say do it before and the other half say that doing it before is absolutely wrong and that it instead should be done after. In this dataset it will have an effect, because using your random states will produce a train test that on some variables have fewer categories than the test data does, which would mean that those observations should be dropped if onehotenconding is done after splitting. If I instead used onehotencoding before splitting, they would not be dropped. Would love to hear your thoughts on that topic, because I have found no real consenus on what is the right approach.
    Thanks again Josh!

    • @statquest
      @statquest  4 ปีที่แล้ว +2

      1) For support vector machines, I'm pretty sure it does not effect the result. However, I have not tried it both ways.
      2) I think there is a fear that if you one-hot-encode before splitting the data, then there will be data leakage. With most transformations, this is a problem, but for one-hot-encoding this is not the case. If a value in one dataset does not occur in the other dataset, then the column representing that value will be full of zeros and not have an effect on classification. In fact, the preferred method for industrial pipelines is "ColumnTransformer()", which keeps track of the values during the initial one-hot-encoding and when a testing set has new values, it throws an error.

    • @bjornlarsson1037
      @bjornlarsson1037 4 ปีที่แล้ว +1

      @@statquest Thanks for your insights Josh! Really appreciate it

    • @causticmonster
      @causticmonster ปีที่แล้ว

      @@statquest is it the same for K-means cluster analysis also ?

    • @statquest
      @statquest  ปีที่แล้ว

      @@causticmonster Presumably if you use ColumnTransformer().

  • @imdadood5705
    @imdadood5705 3 ปีที่แล้ว +1

    How it started:
    df
    How it is going:
    df_23_without_missingdata_scaled_with_magic_powers

  • @zahrasoltani8630
    @zahrasoltani8630 4 ปีที่แล้ว +2

    Hello Josh, Do you have any lecture about support vector data description (SVDD) as well. Actually, your way of describing problems is amazing.

  • @tanphan3970
    @tanphan3970 3 ปีที่แล้ว

    Dear Josh,
    My understanding, n_components hyperparameter in PCA() is the number of dimensions that we want to reduce down to. Therefore, I make some confusion.
    1. If we use PCA() with no reference any n_components, so what exactly is the number of components in this case?
    2. In other tutorials, n_components can set in floating (0.0 to 1.0), it is not make sense if we understand as a dimension number.
    Thanks, have a nice week!

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      The number of components is explained here: th-cam.com/video/oRvgq966yZg/w-d-xo.html

    • @tanphan3970
      @tanphan3970 3 ปีที่แล้ว

      @@statquest Thank for your recommend video. I understanding in this way.
      when we use PCA() with no n_components hyperparameter, the program will calculate all PCs of data. n_components in this situation is equal to all dimensions of data (pca.explained_variance_ratio_.shape[0])
      and when we use PCA(n_components=2) that we only take care 2 first PCs.
      Sorrry if this question make inconvenience from you. I am only want to sure that my understanding is correct.

    • @statquest
      @statquest  3 ปีที่แล้ว

      @@tanphan3970 Yes, that is correct.

  • @martinparidon9056
    @martinparidon9056 2 ปีที่แล้ว

    I have a request. You explain brilliantly (also with your background info in other videos) how to create and optimize your SVM.
    Could you also make a video about how to actually use your svm in a target system? That would make sense I think.
    Because I think that this would necessitate saving the scaler during creation of the SVM and loading it at runtime. Regards.

  • @midhileshmomidi2434
    @midhileshmomidi2434 4 ปีที่แล้ว +1

    The man behind the voice

  • @jovanagluhovic3139
    @jovanagluhovic3139 3 ปีที่แล้ว +1

    It helped a lot! Thank You on shared time and knowladge.

    • @statquest
      @statquest  3 ปีที่แล้ว

      Thank you! :)

  • @anuragsharma-os3vj
    @anuragsharma-os3vj 3 ปีที่แล้ว +1

    Your videos are so informative as always. The way you explain the topics are on another level. But I see a Tabla(twin hand drums) behind you. Do you play that?
    I also loves to play Tabla. Double BAM!!!! :D

    • @statquest
      @statquest  3 ปีที่แล้ว

      I used to play Tabla a lot. I spent a lot of time in Chennai when I was a kid because my dad taught at the IIT there. When I was there I took lessons on tabla and veena.

  • @deepanjan1234
    @deepanjan1234 4 ปีที่แล้ว +1

    You are just awesome. I just love your videos as they are really amazing. Stay safe .

    • @statquest
      @statquest  4 ปีที่แล้ว

      Thank you! You too!

  • @andreabvtt
    @andreabvtt 3 ปีที่แล้ว +1

    Amazing content! How do I know when you have a webinar planned? and where do you stream it? Thanks!!

    • @statquest
      @statquest  3 ปีที่แล้ว

      If you subscribe, you can find out about webinars.

    • @andreabvtt
      @andreabvtt 3 ปีที่แล้ว +1

      @@statquest excellent, will do!

  • @abir95571
    @abir95571 3 ปีที่แล้ว +5

    929 likes and 0 dislikes ... that's a triple BAM !!

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      Hooray! :)

  • @samanvafadar7719
    @samanvafadar7719 3 ปีที่แล้ว

    Again great video , Thanks.
    just 1 question , hope you answer..
    is there any thing like "model importance" in Rstudio ?
    i need those independent variable influence ..

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      I'm sure there is. See: cran.r-project.org/web/packages/shapr/vignettes/understanding_shapr.html

    • @samanvafadar7719
      @samanvafadar7719 3 ปีที่แล้ว

      @@statquest thank you so much.. but i meant in Python..
      i am running svm and looking for that code in python.. i wanted to obtain variables importance after classification

    • @statquest
      @statquest  3 ปีที่แล้ว

      @@samanvafadar7719 See: shap.readthedocs.io/en/latest/

  • @jamesvalencia3298
    @jamesvalencia3298 4 ปีที่แล้ว +1

    I always ser your videos! Please continue this series of videos and surely I will purchase a notebook soon.

    • @statquest
      @statquest  4 ปีที่แล้ว

      Thank you very much! :)

  • @LeafHandy
    @LeafHandy 8 หลายเดือนก่อน

    Hi i have a question, aren't we supposed to split the data even more, and then use the validation dataset for hyperparameter tuning, we can pass it to grid_search, e.g. grid_search(x_validation,y_validation) instead of using the training dataset again?

    • @statquest
      @statquest  8 หลายเดือนก่อน

      You can definitely do that.

  • @ivnesapple479
    @ivnesapple479 3 ปีที่แล้ว

    Hi,Josh.
    Is SVM sensitive to the correlation between the features?
    I think Marriage_1 and Marriage_2 and Marriage_3 is correlated.(the sum is 1)

    • @statquest
      @statquest  3 ปีที่แล้ว

      No. We can use one-hot encoding with SVM.

    • @ivnesapple479
      @ivnesapple479 3 ปีที่แล้ว +1

      @@statquest Thanks!!

  • @sahilpandita2964
    @sahilpandita2964 3 ปีที่แล้ว +1

    When Josh said 'OH NO!!', I was waiting for the line 'Terminology Alert!!!'.

  • @annapeng88
    @annapeng88 4 ปีที่แล้ว +6

    I feel a bit starstruck finally seeing your face... :p Love your videos as always!

    • @statquest
      @statquest  4 ปีที่แล้ว

      😊 thank you

  • @lucaslai6782
    @lucaslai6782 4 ปีที่แล้ว

    Hello Josh, from a statistical perspective, how do you deal with "weird data"? As an illustration, for this dataset,
    EDUCATION, Category
    1 = graduate school
    2 = university
    3 = high school
    4 = others
    However, df['EDUCATION'].unique()
    array([2, 1, 3, 5, 4, 6, 0], dtype=int64)
    How do you deal with "5 and 6"? They are not in the category. Do you treat them as "missing values'?
    Also, how about some data values which are out of range? They are definitely wrong.

    • @statquest
      @statquest  4 ปีที่แล้ว

      I would treat them as missing data.

  • @YanqingSong-f6b
    @YanqingSong-f6b 5 หลายเดือนก่อน

    why do you use 1:1 resampling instead of stratified resampling? The dataset contains 3.5 no_default:1 default. Does this affect SVM results?

    • @statquest
      @statquest  5 หลายเดือนก่อน

      What time point, minutes and seconds, are you asking about?

  • @leonisaacs7231
    @leonisaacs7231 4 ปีที่แล้ว

    Hi Josh, really great content, learning a lot.
    Out of curiosity when doing One Hot Encoding, is there a reason why you did not say drop-first=True to avoid Multi-collinearity?

    • @statquest
      @statquest  4 ปีที่แล้ว

      Yes, this is different from a linear model.

  • @leebradbury8879
    @leebradbury8879 2 ปีที่แล้ว

    Another great video, I wish I had found this channel years ago!
    I am assuming the way you have coded for the optimising of Parameters could be used as the basis code for other models like Random Forest and it will just be the parameters changing dependent on the model that is being optimised?

    • @statquest
      @statquest  2 ปีที่แล้ว +1

      Yes. However, the scikit-learn implementation of random forests is terrible...

  • @RaviRajput-mq2ew
    @RaviRajput-mq2ew 3 ปีที่แล้ว +1

    This is really great. Thank You Sir for this great effort!!

    • @statquest
      @statquest  3 ปีที่แล้ว

      Glad you liked it!

  • @Mustistics
    @Mustistics 2 ปีที่แล้ว

    Hey Josh, thanks for the video.
    One question: you drop the ID column right from the start. In real life, once you made sure your model is valid and accurate, you would actually need to match those IDs to the probabilities of default. How would you do that? Put the ID in a list before dropping and then adding the list as a column to the predict proba?

    • @statquest
      @statquest  2 ปีที่แล้ว

      We don't really need to re-add the IDs for the training data. However, when we get new data, we can just keep track of the ID by hand.

  • @hrdyam865
    @hrdyam865 4 ปีที่แล้ว

    Thank you very much.. In the radial basis function video, only hyperparameter gamma was involved.. regularization parameter C was not there in the radial kernel function.. Are we using different radial kernel function here or the same one which was shown in radial kernel video? Thanks again.. your videos are great help ..

    • @statquest
      @statquest  4 ปีที่แล้ว

      We are using the same kernel - so the only kernel parameter that we are optimizing is gamma. However, most, if not all, machine learning implementations also include regularization in one form or another. So we'll be talking about that as well.

  • @causticmonster
    @causticmonster ปีที่แล้ว

    Are you supposed to scale the one-hot encoded variables as well?

    • @statquest
      @statquest  ปีที่แล้ว

      I think it can go either way.

  • @TD-in5qe
    @TD-in5qe 3 ปีที่แล้ว +1

    This is amazing. Thank you, Josh!

  • @tanphan3970
    @tanphan3970 3 ปีที่แล้ว

    Dear Josh,
    I do not understanding your decision in Confusion Matrix.
    Do not default : 79% and defaulted: 61% --> not awesome.
    Why?
    This number is small, isnt it? or do you have a threshold in your decision?

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      It's subjective.

  • @liuchen6870
    @liuchen6870 4 ปีที่แล้ว

    Hi Josh!
    What if our dataset has 【continuous columns】 & 【"categorical number" columns】 at the same time, should we start with getting dummies first to convert our categorical columns to continuous columns AND Standardscaler the rest continuous columns in order to give the data 0 mean? Is there any correlation between "get_dummies" & "encoder" ?
    I really appreciate any answers you would share with US, cheers!

    • @statquest
      @statquest  4 ปีที่แล้ว

      XGBoost works well with sparse data (data with lots of zeros), so it is probably a good idea to only one-hot-encode the categorical data. Do not standardize them as well.

  • @tanphan3970
    @tanphan3970 3 ปีที่แล้ว

    Hello Josh Starmer,
    Can you explain more about some hyperparameter in resample?
    replace=False --> we will not change any data in original data (df_default) and if True mean original df_default will be changed?
    random_state --> help others can get the same result with you? So how many people can get same result to you? 42???
    Thanks

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      1) Yes 2) We are setting the seed for the random generator to the number 42, this ensures that everyone will get the same results. In other words, the random number generator generates a sequence of random numbers based on a starting value. If we all set the starting value to the same number (in this case, 42) then we will all get the same sequence of random numbers.

  • @Mustistics
    @Mustistics 2 ปีที่แล้ว

    One more question: when you're defining the param_grid, you have a comma after the last curly brackets. It actually works with or without that comma. I don't get why it isn't throwing an "error" in there, since that comma isn't supposed to be there. 🤔

    • @statquest
      @statquest  2 ปีที่แล้ว

      Python is sometimes a mystery to me.... :)

  • @iunknown563
    @iunknown563 2 ปีที่แล้ว +1

    Very approachable!

  • @tanphan3970
    @tanphan3970 3 ปีที่แล้ว

    Dear Josh,
    I fell not logic when you set the C hyperparameter in 2-time(when apply to X_train_scaled and pca_train_scaled) you define the param_grid. The first, C= 1000 is not in your list, the second C = 1000 is adding and it is becoming the best parameter in grid-search.
    Any ideal in this step?
    Thanks and have a nice week!

    • @statquest
      @statquest  3 ปีที่แล้ว

      It might have been originally but I forgot to add it back in.

  • @lprashanthi7298
    @lprashanthi7298 3 ปีที่แล้ว

    How do we set values for C and gamma especially the penalty Parameter C .. is it only by Hit and trial?

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      Pretty much. You just test a bunch with values with cross validation and see which is best.

  • @DarkLemon4321
    @DarkLemon4321 4 ปีที่แล้ว

    Great lecture ;) but anyway I have one question - is it correct to standardize X_train and X_test separately? I mean, shouldn't the standardization parameters be the same for both datasets?
    In the current approach, the data are not comparable, as if they were from a completely different world. Am I correct?

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      In a pinned comment, I wrote "At 31:25 we should use the mean and standard deviation from the training dataset to center and scale the testing data. The updated jupyter notebook reflects this change."

    • @DarkLemon4321
      @DarkLemon4321 4 ปีที่แล้ว +1

      @@statquest Thank for answer :)

  • @will6403
    @will6403 3 ปีที่แล้ว

    Why are you fitting your GridSearchCV on only the training data? Shouldn't you pass your entire X dataset when doing GridSearchCV with cross validation?

    • @statquest
      @statquest  3 ปีที่แล้ว

      That would overfit the model

  • @Bilal-sz8pk
    @Bilal-sz8pk 4 ปีที่แล้ว

    Hi Josh,
    I have a question. in 32:30 ,we scale the X_test and X_train, but i think that they didnt scaled same way. Bc They are not in same sample and their standard dev and means are different from eachother.
    I tried with this tiny sets to check if i think correct, and looks like scaling process little wrong?
    xxx = [1, 4, 400, 10000, 100000]
    yyy = [1,4,400,10000,11]
    scale(xxx)
    scale(yyy)
    Can u check and write me, did i think wrong?

    • @statquest
      @statquest  4 ปีที่แล้ว

      In a pinned comment I wrote: At 31:25 we should use the mean and standard deviation from the training dataset to center and scale the testing data. The updated jupyter notebook reflects this change.

    • @Bilal-sz8pk
      @Bilal-sz8pk 4 ปีที่แล้ว +1

      ​@@statquest I didnt realized, sorry. Thank you for the reply.
      I wanna thank you so much. There could be too much informative people on internet but you are the best. Thank you for having fun while teaching!!

  • @amanuel2135
    @amanuel2135 5 หลายเดือนก่อน

    Is there a reason why you're not using LDA(multi-class) rather than PCA?

    • @statquest
      @statquest  5 หลายเดือนก่อน

      What time point, minutes and seconds, are you asking about?

  • @EvandroSegundo
    @EvandroSegundo 4 ปีที่แล้ว

    Great tutorial! In fact, all your videos are great. I have just on question: When looking for the best value for C, the algorithm went for the upper limit. Shouldn't we try again with higher values as suggestions? I haven't tried myself so I really don't know what would happen.

    • @statquest
      @statquest  4 ปีที่แล้ว

      Yes, we should probably try higher values.

  • @michal.tomczyk
    @michal.tomczyk 3 ปีที่แล้ว

    Say, in the original data set, we had a ratio of 30:70 of defaulted to non-defaulted credit accounts. Is it obligatory to have a balanced down-sampled data frame before we proceed with the analysis?

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      It's not obligatory.

  • @mainakray6452
    @mainakray6452 4 ปีที่แล้ว +1

    gr8 experience, looking for ANN.

  • @vegaarcturus509
    @vegaarcturus509 ปีที่แล้ว

    Correct me if im wrong but when people think of machine learning, they think of ai self improvement but SVM is just finding correlations between data sets?

    • @statquest
      @statquest  ปีที่แล้ว

      When I think of machine learning, I think of classifying things and making predictions. SVM can be used to classify things.

  • @beautyisinmind2163
    @beautyisinmind2163 2 ปีที่แล้ว

    why different split has different accuracy like 66:33, 70:30, 80:20?

    • @statquest
      @statquest  2 ปีที่แล้ว

      What time point, minutes and seconds, are you asking about?

  • @jackjakie6076
    @jackjakie6076 2 ปีที่แล้ว

    Hello, I am a student from China. When can I support payment by Alipay or wechat?

    • @statquest
      @statquest  2 ปีที่แล้ว

      Thanks for your support. I'll look into it! :)

  • @DatascienceConcepts
    @DatascienceConcepts 4 ปีที่แล้ว +1

    Awesome teaching! Very interesting lectures.

    • @statquest
      @statquest  4 ปีที่แล้ว

      Thank you! :)

  • @arjunmallick4901
    @arjunmallick4901 4 ปีที่แล้ว +1

    Aahhhh....Something that I was stuck with...thanks a lot❣