148 - 7 techniques to work with imbalanced data for machine learning in python

แชร์
ฝัง
  • เผยแพร่เมื่อ 21 ส.ค. 2024

ความคิดเห็น • 35

  • @mayukh_
    @mayukh_ 4 ปีที่แล้ว +17

    Sorry if I am saying someting wrong but I found some logical errors in the code. Generally upsampling causes an highly overfit model and specially when using it with RandomForest. Here what you have is something called data leakage. Seen that in the code of lot of experienced developers. Let me explain..
    What you are doing is upstamplig on the actual dataframe and then splitting it into train and test.
    What you should have done is first split the data into train and test and then perform all the usampling on the train and create a model and then test it on test data. You don't touch or modify the test data.
    This way you can actually prevent the data leakage. Now because you have upsampled first and then split, obviously your model would be better for every class. But I dont think it is a true generalized representation. You can use stratified splitting before splitting the data to represent all the class labels.

    • @DigitalSreeni
      @DigitalSreeni  4 ปีที่แล้ว +8

      Please don't feel sorry for helping people by correcting any mistakes in my videos. I do rely on viewers like you to shed some new perspective on my topics. Balancing classes after splitting does make sense. I should admit I never performed that experiment but it makes logical sense. Thanks for the tip.

  • @mingzhang4200
    @mingzhang4200 2 ปีที่แล้ว

    Fantastic presentation on how to handle imbalanced data in modeling!

  • @huanwangyang4458
    @huanwangyang4458 2 ปีที่แล้ว +3

    Hi, this is a good video, but I have to say the model with sampling data is wrong. You should resample the X_train & y_train, then train the model, and finally apply this model to the untouched original data (X_test).

  • @himanshu8006
    @himanshu8006 3 ปีที่แล้ว

    thanks a lot, you were quick and on the spot through out the video

  • @imadsaddik
    @imadsaddik 3 หลายเดือนก่อน

    Thank you

  • @hik381
    @hik381 3 ปีที่แล้ว +2

    Hey, i have a multi label classification task (images of different landscapes). Im wondering how i can (or should) balance my data set. Some of the labels are in gerneral more common than the others, as well as various combinations, e.g. (mountains and desert) appear more often than (forest and desert) as well as combinations of three labels, e.g. (mountains, forest, snow). By the way i want to use traditional machine learning algorithmes (knnClassifier, Logistic Regression with sklearn.MultiPutputClassification). Do you have any advice ?

  • @user-gb6py4re3o
    @user-gb6py4re3o 4 ปีที่แล้ว

    Thanks for explaining easily.

  • @JesseThings
    @JesseThings 2 ปีที่แล้ว +1

    Hello! Thanks for the good video! I have a question about the train test split with the resampling. In this example you have the train_test_split after resampling. Wouldn't that create bias in testing and same with validation. Shouldn't I do the train test split first and then resample only the training data? Thanks!

    • @GARUDA1992152
      @GARUDA1992152 2 ปีที่แล้ว

      From my experience, you are right.. we should only resample the train data, and keep the validation data as is

    • @hEmZoRz
      @hEmZoRz 2 ปีที่แล้ว

      This was bugging me as well, and you're 100% correct. Performing a simple upsampling before the train-test split is particularly problematic, as your test set will (most likely) have same exact samples that were used for training the model. Not surprisingly, this will increase the model performance. A similar logic applies to using SMOTE prior to the train-test split: your test set will likely have synthetic samples that are affected by samples in the train set. That is, information has leaked once again from the train set to the test set. A simple undersampling of the majority class will not have such issues, though.

  • @zeeshankhanyousafzai5229
    @zeeshankhanyousafzai5229 ปีที่แล้ว

    Please make a video on Data Augmentation with GANs

  • @yuepengliu664
    @yuepengliu664 2 ปีที่แล้ว

    Help a lot.

  • @smartitbyeng.nareman1573
    @smartitbyeng.nareman1573 3 ปีที่แล้ว

    thank you so much

  • @junaidlatif2881
    @junaidlatif2881 ปีที่แล้ว

    Amazing... Can we use at regression?

  • @matancadeporco
    @matancadeporco 3 ปีที่แล้ว

    hi sir, ty in advice for all those videos, its helping me a lot..
    i'm trying to do semantic segmentation on leaf images, that contain on labels, 0 for backgorund, 1 for leaf and 2 for the plague..
    i'm trying to apply class_weight but i'm getting the error when fitting the model (ValueError: `class_weight` not supported for 3+ dimensional targets.)
    how can i overcome this, already tried sample weights, and defining weights manually, but still receiving this error...
    ty

  • @stevenmr1215
    @stevenmr1215 4 ปีที่แล้ว

    Thanks for sharing. We know training first, then predict a new sample.
    Here you use a classifier training an image's all pixels with segmentation label. Then you use this trained model to predict another image? or just compare acc of different classifiers here?

    • @DigitalSreeni
      @DigitalSreeni  4 ปีที่แล้ว +1

      I use a few labels to train a model and predict a single image accurately. If it is not accurate, I will add more labels. Once I am happy with my segmented image then I use it to train another model and then segment a few more images. If I am happy with them then I use them as masks to segment even more images. Finally, I use all these segmented images as training masks for deep learning.
      If you want to annotate your labels: www.apeer.com/annotate
      (It is free)

  • @nafaszareee6871
    @nafaszareee6871 2 ปีที่แล้ว

    hi, thanks for video.
    My image datasets are imbalanced. And I used the method of deep learning, but I get this error.ValueError: `class_weight` not supported for 3+ dimensional targets.
    Help me solve the problem😓😓

  • @mihretdesta9153
    @mihretdesta9153 ปีที่แล้ว

    sir, data augmentation is the best technique for imbalanced data.?

  • @user-sh5hn2gn1k
    @user-sh5hn2gn1k 9 หลายเดือนก่อน

    @DigitalSreeni
    Hi There!
    Does these libraries (SMOTE, etc.) work for the Image (Computer Vision) Data?

    • @DigitalSreeni
      @DigitalSreeni  8 หลายเดือนก่อน +1

      SMOTE is for structured data, not for image data. For images, you can try image augmentation methods.

    • @user-sh5hn2gn1k
      @user-sh5hn2gn1k 8 หลายเดือนก่อน

      @@DigitalSreeni Thanks!

  • @eventhatsme
    @eventhatsme 3 ปีที่แล้ว

    I have seen that stratified k-fold cross validation is used for imbalanced datasets with traditional machine learning. How will this work out of the box compared to these techniques? Would you say that a combination of over- and undersampling will work the best also for deep learning models?

  • @muazimran
    @muazimran 3 ปีที่แล้ว

    how can we do this with 3 RGB channels instead of gray?

  • @surajshah4317
    @surajshah4317 4 ปีที่แล้ว

    Dear sir
    I have a question
    How to classify gray mask image into 5 classes and label them for multiclass segmentation ..

    • @DigitalSreeni
      @DigitalSreeni  4 ปีที่แล้ว

      I have videos on multiclass, not sure if you are referring to multiclass. I suspect you are interested in segmenting every pixel of an image into one of the 5 regions. If that is the case look for Unet segmentation for deep learning approach. But if you do not have many labels then checkout my videos 67 and 67b.
      For labeling, you can use www.apeer.com/annotate It is easy and free.

    • @surajshah4317
      @surajshah4317 4 ปีที่แล้ว

      @@DigitalSreeni thank you so much sir
      Yes you understand my concern
      I have mask image with value from 0-4 in each pixel
      Should I use one hot encode

  • @arienugroho430
    @arienugroho430 2 ปีที่แล้ว

    I would like to ask, Why n_samples equals to 400000 ? Why not more? Thanks

    • @DigitalSreeni
      @DigitalSreeni  2 ปีที่แล้ว +1

      Why not less?

    • @arienugroho430
      @arienugroho430 2 ปีที่แล้ว

      @@DigitalSreeni is there a rule or a way to determine n_samples ? or just match to average of class ??