Machine Learning Classification How to Deal with Imbalanced Data ❌ Practical ML Project with Python

แชร์
ฝัง
  • เผยแพร่เมื่อ 27 ต.ค. 2024

ความคิดเห็น • 44

  • @amansamsonmogos9608
    @amansamsonmogos9608 2 ปีที่แล้ว +6

    I am not sure if this is a best way to deal with data imbalance and it won't work in a real case. You have used SMOTE to balance the dataset and used your test dataset from the oversampled data which is synthetic. To make sure your model is working well, you have to save part of the original imbalance dataset as your test dataset and then apply SMOTE on the rest. In this way your test dataset is a perfect representation of the original data. I am sure you f1-sccore will be very small. One of the best methods are One Class Support Vector Machine (OCSVM), Generalized One-class Discriminative Sub-spaces (GODS), One Class CNN (OCCNN) and Deep SVDD (DSVDD)

  • @philwebb59
    @philwebb59 2 ปีที่แล้ว +2

    You do realize that in your pipeline, once you run the oversample step, you have 6 perfectly balanced groups with 900 samples in each group. There's no real majority class to sample from. When you then undersample from a perfectly balanced dataset, it appears to leave group intact and resamples the others. If you plot the data, it will look essentially the same as before, when you only oversampled, with some samples missing and other samples duplicated. The scores will be similar as well.

    • @brandonbosire211
      @brandonbosire211 2 ปีที่แล้ว

      What do you recommend

    • @philwebb59
      @philwebb59 2 ปีที่แล้ว

      @@brandonbosire211 Do one or the other, not both, one after the other.

    • @thangtran145
      @thangtran145 2 ปีที่แล้ว

      Hello Mr. Webb. Dude in the video does SMOTE on the whole dataset, which I think will cause data leakage generating overly-optimistic accuracy. However, he says that by using cross-validation, SMOTE on the whole dataset wouldn't be a problem. Do you concur?

    • @philwebb59
      @philwebb59 2 ปีที่แล้ว +1

      @@thangtran145 Yes, anytime you do something to the entire dataset, you get data leakage. Anything, including SMOTE. Synthesizing data, then including some of that synthetic data in your test set will lead to overly optimistic results. The usual recommendation is to use a pipeline. SMOTE -> scale -> regression on the training set only. Leave the test set alone. Problem with unbalanced data is getting it distributed evenly. You might need to bucket the data manually. Also, you may not have enough of the minority class in your test set to tell if your model is a good predictor. You may want to score the two classes separately.

  • @ammarkamran4908
    @ammarkamran4908 3 ปีที่แล้ว +8

    Hello thanks for the video.
    However I noticed that your did SMOTE before running the train test split. I am afraid that this might be causing the results to improve drastically since the the upsampled observations from the minority class might have entered the testing dataset. So basically your model learned and test on pretty much the same variable which caused the results to improve.
    Let me know what you think.

    • @DecisionForest
      @DecisionForest  3 ปีที่แล้ว +1

      Hi Ammar, yes in a normal train/test split there is a risk that one of the synthetic samples generated by the original data points might be present in the test set. But since I generated synthetic data and used crossvalidation I decided to handle the imbalance on the whole set, it was faster for what I wanted to show. You can as well split the data first and then perform crossvalidation on the train set and test on the out of sample test set afterwards. Hope you found it useful.

    • @iliaskatsabalos
      @iliaskatsabalos 3 ปีที่แล้ว +1

      I was confused at this point as well. I believe that altering the distribution of the test set is not a great practice, since in reality this is the distribution we expect if we deploy our model to production. The real question is how this model performs on the original test set. However, great video on SMOTE. Thank you!

    • @mahmoudshobair9718
      @mahmoudshobair9718 3 ปีที่แล้ว +1

      @@iliaskatsabalos spot on point. Hold out set validation is needed.

    • @thangtran145
      @thangtran145 2 ปีที่แล้ว

      ​@@DecisionForest So you're saying data leakage will not happen if we combine SMOTE of whole set with cross-validation? In other words, it's okay to SMOTE the entire dataset as long as we use cross-validation? Thank you

  • @subhajit20111
    @subhajit20111 ปีที่แล้ว

    Unfortunately, your website link and notebook link are not available here. Any suggestion?

  • @nickpgr10
    @nickpgr10 3 ปีที่แล้ว +1

    Can you suggest any techniques to solve imbalanced image dataset??
    Thank you..

  • @TrainingDay2001
    @TrainingDay2001 ปีที่แล้ว

    In order to truly evaluate you need to test on an IMBALANCED test set. :) you can train on a balanced train set but hold out needs to be on a true imbalanced set . Because in the real world the data you encounter will have the same imbalanced-ness and that’s what your performance metric needs to measure: how well you score on unseen imbalanced data.

  • @fatimak6440
    @fatimak6440 3 ปีที่แล้ว +1

    one of the best channels for ML!

  • @kar2194
    @kar2194 3 ปีที่แล้ว

    Hi thanks for the content!
    I am confused that, instead of applying this method for the y variable, can I apply this technique for imbalanced predictors that have levels with large differences in sample size?
    For example, class A: 900, class B:100, class C: 2
    Thanks!

  • @rahuldey6369
    @rahuldey6369 3 ปีที่แล้ว

    When should we use under_sampling? As I see there's a potential risk of losing information

  • @Mustistics
    @Mustistics 3 ปีที่แล้ว

    I don't understand how you apply under and oversampling at the same time. One of them will balance the data, and the other one has nothing left to do...

    • @DecisionForest
      @DecisionForest  3 ปีที่แล้ว

      Good observation, it was just to show how you apply both.

  • @titow7417
    @titow7417 4 ปีที่แล้ว

    You said to deal with 'multi-class classification problems'. But what if we have imbalanced data and binary classification?

    • @DecisionForest
      @DecisionForest  4 ปีที่แล้ว

      For binary it’s the same when it comes to imbalanced datasets. Just gave here the example of multiclass.

    • @titow7417
      @titow7417 4 ปีที่แล้ว +1

      @@DecisionForest Yeah.
      I thought so. Thanks!

  • @wliiliammitiku3996
    @wliiliammitiku3996 4 ปีที่แล้ว

    It is not clear sir, and I have a question, what is the technique u have used in sorting the problem class imbalance?

    • @DecisionForest
      @DecisionForest  4 ปีที่แล้ว

      I used SMOTE for oversampling and RandomUnderSampler for undersampling.

  • @itstoufique47
    @itstoufique47 3 ปีที่แล้ว

    Excuse me, is there any way to find the original Notebook file? Can't open the one in the description. Thank you.

    • @DecisionForest
      @DecisionForest  3 ปีที่แล้ว

      Just checked now, it should work.

  • @tahirullah4786
    @tahirullah4786 3 ปีที่แล้ว

    Please how can we get the jupyter notebook code?

  • @flamboyantperson5936
    @flamboyantperson5936 4 ปีที่แล้ว

    Great one

  • @flamboyantperson5936
    @flamboyantperson5936 4 ปีที่แล้ว

    where do you edit your data? davinci?

    • @DecisionForest
      @DecisionForest  4 ปีที่แล้ว +1

      As in DaVinci Resolve? I use Adobe Premiere for editing, think their products are the best for creative work.

    • @flamboyantperson5936
      @flamboyantperson5936 4 ปีที่แล้ว

      @@DecisionForest Yeah DaVinci resolve. Adobe Premiere must be paid I guess.

    • @DecisionForest
      @DecisionForest  4 ปีที่แล้ว

      @@flamboyantperson5936 Yes, it's part of Creative Cloud. If you use other apps then a subscription is worth it.

    • @flamboyantperson5936
      @flamboyantperson5936 4 ปีที่แล้ว +1

      @@DecisionForest Will try. Thank you for useful information.

  • @oluwapelumiabimbola3280
    @oluwapelumiabimbola3280 4 ปีที่แล้ว +1

    Can you apply SMOTE to text data

    • @DecisionForest
      @DecisionForest  4 ปีที่แล้ว +1

      For oversampling text data I'd suggest you use pre-trained word embeddings. For example you can generate new samples by doing permutations of similar words.

    • @oluwapelumiabimbola3280
      @oluwapelumiabimbola3280 4 ปีที่แล้ว

      Yes...Thanks

  • @mahdimed775
    @mahdimed775 3 ปีที่แล้ว

    Thanks for this share.
    Please Could you send me this code?
    I need it .

  • @farisocta7466
    @farisocta7466 3 ปีที่แล้ว

    are u sure that the undersampling method is work? the number still same 900.

    • @DecisionForest
      @DecisionForest  3 ปีที่แล้ว

      It should yes. Could you detail please?

    • @farisocta7466
      @farisocta7466 3 ปีที่แล้ว +1

      I think the pipeline just balancing the classes by oversampling automatically. and then because the class already balance, the undersampling didn't do anything

    • @DecisionForest
      @DecisionForest  3 ปีที่แล้ว +1

      Oh yes, normally you either undersample of oversample as if we do both only the first is actually implemented. I did both sorry for the confusion, should’ve mentioned it.

    • @farisocta7466
      @farisocta7466 3 ปีที่แล้ว

      @@DecisionForest it's okay I'm still learning too. If you want to do both, you must specify sampling strategy as parameter. otherwise they just autobalancing the classes.