How to handle imbalanced datasets in Python

แชร์
ฝัง
  • เผยแพร่เมื่อ 21 ส.ค. 2024
  • In this video, you will be learning about how you can handle imbalanced datasets. Particularly, your class labels for your classification model is imbalanced (one class is significantly larger than the other which essentially gives rise to a majority class and minority class). Here, we will use the imbalanced-learn Python library to perform random undersampling and random oversampling so that you can address this issue of imbalanced datasets.
    🌟 Download Kite for FREE www.kite.com/g...
    Code: github.com/dat...
    ⭕ Support my work:
    🌟 Subscribe to the Coding Professor channel / @codingprofessor
    🌟 Subscribe to the Data Professor www.youtube.co...
    🌟 Join the Newsletter of Data Professor newsletter.data...
    🌟 Buy me a coffee www.buymeacoff...
    ⭕ Recommended Books:
    🌟kit.co/datapro...
    ✅ Python Basics: A Practical Introduction to Python 3 amzn.to/3awdWgm
    ✅ Learn Python Programming (The no-nonsense, beginner's guide) amzn.to/2RFpSpn
    ✅ Learn to Program with Minecraft amzn.to/3x2MujZ
    ✅ Automate the Boring Stuff with Python, 2nd Edition: Practical Programming for Total Beginners amzn.to/2QzkyDs
    ⭕ Disclaimer:
    Recommended books and tools are affiliate links that gives me a portion of sales at no cost to you, which will contribute to the improvement of this channel's contents.
    ⭕ Stock photos, graphics and videos used on this channel:
    ✅ 1.envato.marke...
    #python #data #datascience #dataprofessor

ความคิดเห็น • 97

  • @DataProfessor
    @DataProfessor  3 ปีที่แล้ว +2

    🌟Check out my second TH-cam channel (Coding Professor) th-cam.com/channels/JzlfIoF8nmWqJIv_iWQVRw.html
    🌟 Download Kite for FREE www.kite.com/get-kite/?

    • @tahabihaouline2333
      @tahabihaouline2333 3 ปีที่แล้ว

      nice video, i just want to know, how can i train this to get training data and testing data.
      an example will be really good

    • @dadan.dahman.w
      @dadan.dahman.w 2 ปีที่แล้ว

      Hallo prof, how to handle imbalance dataset in multilabel classification data text?

    • @dadan.dahman.w
      @dadan.dahman.w 2 ปีที่แล้ว

      Hallo prof, how to handle imbalance dataset in multilabel classification data text?

  • @alexioannides3305
    @alexioannides3305 3 ปีที่แล้ว +25

    It would have been nice to demonstrate the impact these resampling methods have on the test metrics of some benchmark model (especially one that can use class weights in the loss function). In my experience, resampling can sometimes make a model perform worse and it can be better to use models with class-weighted loss functions.

  • @caioglech
    @caioglech 3 ปีที่แล้ว +16

    Great example. Perhaps you could make another video showing the oversampling on training data. Lots of people (myself included) start doing the oversampling on the whole dataset, which leads to data leakage... which is a mistake.

    • @naveenkumarmangal9653
      @naveenkumarmangal9653 2 ปีที่แล้ว

      Thanks very much for this comment.

    • @xin2668
      @xin2668 ปีที่แล้ว

      Really helpful comment, thank you

  • @michellpayano5051
    @michellpayano5051 3 ปีที่แล้ว +3

    This is a clear and simple guide to get started, thanks for sharing! About your last question, I am curious what would be your answer, which approach do you prefer from your experience?

    • @DataProfessor
      @DataProfessor  3 ปีที่แล้ว +1

      Hi, I prefer undersampling

    • @michellpayano5051
      @michellpayano5051 3 ปีที่แล้ว

      @@DataProfessor Could you please tell some reasons why?

    • @DataProfessor
      @DataProfessor  3 ปีที่แล้ว +3

      @@michellpayano5051 I prefer to use actual data and thus undersampling. Oversampling introduces artificial data upon balancing.

    • @michellpayano5051
      @michellpayano5051 3 ปีที่แล้ว +2

      @@DataProfessorI understand , thank you!!

  • @TinaHuang1
    @TinaHuang1 3 ปีที่แล้ว +1

    Ooo awesome tutorial! Love how clear it is

  • @thinamG
    @thinamG 3 ปีที่แล้ว +2

    It's helpful for me and many more. Great tutorial, Chanin. Thank you so much for sharing with us.

    • @DataProfessor
      @DataProfessor  3 ปีที่แล้ว +1

      Happy to hear that! Thanks Thinam!

  • @rattaponinsawangwong5482
    @rattaponinsawangwong5482 3 ปีที่แล้ว +1

    Oh, I seem to be the first guy here. As a rookie DS, I have to deal with the imbalanced dataset, too. My curiosity is we should perform undersampling or oversampling within the pipeline of cross-validation (say, K-fold cv) or should we do it before cross validation?

    • @DataProfessor
      @DataProfessor  3 ปีที่แล้ว +2

      Hi, You can apply this prior to CV.

  • @samuelbaba5406
    @samuelbaba5406 3 ปีที่แล้ว +2

    Very great job professor ! Thank you so much for this clear video . By the way , do you think that after applying oversampling for example and after training a model (like XGBoost ) on the data , it would be interesting to use the Matthews Correlation Coefficient as a KPI to measure the efficiency of the model ? Or do you think it is not necessary? Thank you 🙏🏽

    • @DataProfessor
      @DataProfessor  3 ปีที่แล้ว

      Yes, definitely, MCC is a great way to measure the performance of classification models, the plus side is that it is also more resistant to imbalanced data than that of accuracy.

  • @eduardodimperio
    @eduardodimperio 2 ปีที่แล้ว +1

    Why do undersampling instead slice the dataset do take the same amount of results?

  • @sericthueksuban9151
    @sericthueksuban9151 3 ปีที่แล้ว

    I've been following your channel since the collab with Ken Jee without realizing your name. Now you're inspiring me to pursue Data science even more! Thank you krub Ajarn Chanin! 🙏😂

  • @akbaraliotakhanov1221
    @akbaraliotakhanov1221 3 ปีที่แล้ว +1

    I cama here through Notification, thanks Professor. We will wait for new and interesting videos

    • @DataProfessor
      @DataProfessor  3 ปีที่แล้ว

      Awesome, glad to hear and thanks for supporting the channel!

  • @aashishmalhotra
    @aashishmalhotra 2 ปีที่แล้ว +1

    Can u explain how does logistics regression behave with imbalanced dataset

  • @user-zu9xf1cn9d
    @user-zu9xf1cn9d 3 ปีที่แล้ว +1

    Thanks for the lesson, professor! I'd like to ask one question if you don't mind. Should we always over/undersample to 1:1 ratio? I guess, in case the initial ratio of majority and minority classes is 99:1, it can cause some problems while modelling.

    • @DataProfessor
      @DataProfessor  3 ปีที่แล้ว +1

      Hi, the practice of addressing data balancing for a wide range of scenarios is a topic for research and experimentation. It might be worthwhile to check out published paper on the topic for various use case. Please feel free to share what you find.

    • @user-zu9xf1cn9d
      @user-zu9xf1cn9d 3 ปีที่แล้ว +1

      Thank you for your response! I will definitely research on this topic :D

  • @gunjankumar2267
    @gunjankumar2267 2 ปีที่แล้ว

    thanks for this quick guide to overcoming the imbalance issue. I like to know, before applying these oversampling or undersampling techniques.. do i need to like standardize my dataset, or I can go with the original form of the data set?

  • @minicorefacility
    @minicorefacility 3 ปีที่แล้ว

    Thank you so so much. This is something that I am looking for. I struggled with this step in R-language for many months. I understand that by randomly sampling the overweight samples to mix with the underweight samples, just one time and further do model developing -- would create a poor model. Thus, my question is 1. How many times should I randomly sample 2. Does the distribution of both overweight and underweight samples affect times that we have to sample? Could you please share your thoughts?

  • @Ibraheem_ElAnsari
    @Ibraheem_ElAnsari 3 ปีที่แล้ว +2

    Great tutorial Prof ! I could see how someone would use this in a test dataset, does it have other usecases ? Thanks a lot !

    • @DataProfessor
      @DataProfessor  3 ปีที่แล้ว +1

      Hi, thanks for watching Ibraheem. Actually, we could use it in the training set in order to obtain a balanced model.

    • @kvdsagar
      @kvdsagar 3 ปีที่แล้ว

      Professor can you share your contact details

  • @Ghasforing2
    @Ghasforing2 3 ปีที่แล้ว +1

    Great tutorial as usual. Thanks for sharing, Professor!

  • @amaransi4900
    @amaransi4900 3 ปีที่แล้ว +3

    Thanks a lot. I am looking for your explain protein ligand interaction through AI.

    • @muhammaddanial4549
      @muhammaddanial4549 3 ปีที่แล้ว

      Hey, @amar I am also working on Machine learning-based virtual screening almost completed ML Models for VS. If u have any publications on it i need some help thanks.

    • @amaransi4900
      @amaransi4900 3 ปีที่แล้ว +1

      @@muhammaddanial4549 hi i am on the beginning

  • @sherifarafa90
    @sherifarafa90 3 ปีที่แล้ว +1

    I want to thank you for the Bioinformatics Project from Scratch.. I managed to apply it on AChE and I am willing to apply it to other target. Thanks so much and waiting for other Models 😁

    • @DataProfessor
      @DataProfessor  3 ปีที่แล้ว

      Fantastic! Glad to hear that.

    • @sherifarafa90
      @sherifarafa90 3 ปีที่แล้ว

      @@DataProfessor Can you do a tutorial on how to implement Neural networks on drug discovery?

    • @muhammaddanial4549
      @muhammaddanial4549 3 ปีที่แล้ว +1

      @sherif Arafa can I get the link of these scratches?

    • @muhammaddanial4549
      @muhammaddanial4549 3 ปีที่แล้ว

      I am also working on AChe and BChe

    • @DataProfessor
      @DataProfessor  3 ปีที่แล้ว

      @@muhammaddanial4549 Awesome, sure the link is here th-cam.com/play/PLtqF5YXg7GLlQJUv9XJ3RWdd5VYGwBHrP.html

  • @ifeanyiedward2789
    @ifeanyiedward2789 ปีที่แล้ว +1

    Thanks alot . very precise and easy to understand

  • @ahmedjamel421
    @ahmedjamel421 ปีที่แล้ว

    Great tutorial Sir, When you split the data into X and Y and performed the resampling method, how can you make a concatenation with each other later?

  • @hubbiemid6209
    @hubbiemid6209 3 ปีที่แล้ว +2

    in my data science course, we used the stratification parameter from train_test_split() from sklearn, how do they differ?

    • @DataProfessor
      @DataProfessor  3 ปีที่แล้ว +3

      That's a great question! Thanks for bring it up. Stratification maintains the ratio of the classes such that they train/test splits have roughly the same ratio of the classes (it does nothing with the class imbalance). On the other hand, data balancing will either bring up or bring down the minority or majority class, respectively, in order to make both to be the same size.

  • @negusuworkugebrmichael3856
    @negusuworkugebrmichael3856 5 หลายเดือนก่อน +1

    Thank you Prof. Very helpful

  • @allanmarzuki5534
    @allanmarzuki5534 3 ปีที่แล้ว

    What the side effect if we use synthetic data when handling the imbalance for building the models? And what if we have a lot of data, should we use oversample or undersample? Thank you prof

  • @robinsonflores6482
    @robinsonflores6482 2 ปีที่แล้ว +1

    Great video. Thanks for sharing!!

    • @DataProfessor
      @DataProfessor  2 ปีที่แล้ว

      It’s my pleasure, thank you 😊

  • @aashishmalhotra
    @aashishmalhotra 2 ปีที่แล้ว

    Awesome explained every line of code lot helpful for Novice in understanding ipynb

  • @sebastiancastro4126
    @sebastiancastro4126 3 ปีที่แล้ว +1

    I think that in this case oversampling would be the right approuch due to the low number of compounds. Is this correct?

    • @DataProfessor
      @DataProfessor  3 ปีที่แล้ว

      Both are valid approaches, it is subjective, depending on the practitioner. Personally, I like to use undersampling.

    • @nikhilwagle8466
      @nikhilwagle8466 2 ปีที่แล้ว

      @@DataProfessor undersampling should only be done when the when the data is in millions or thousands. orelse the accuracy will get reduced.

  • @budisantosa9892
    @budisantosa9892 3 หลายเดือนก่อน +1

    do we not need to split the data into test and train before balancing?

  • @sanam6866
    @sanam6866 ปีที่แล้ว

    Should we calculate the molecular descriptors and then balance the data?

  • @aryasarkar1692
    @aryasarkar1692 3 ปีที่แล้ว +1

    Hi! I have a doubt should we prefer undersampling or oversampling

    • @DataProfessor
      @DataProfessor  3 ปีที่แล้ว +1

      Hi, both are valid approaches and depends on the practitioner. Personally, I like to use undersampling.

  • @mukeshkund4465
    @mukeshkund4465 3 ปีที่แล้ว

    I think there are some scenarios where we can use this technique differently..Can you tell us the different scenarios where we can perform oversampling, undersampling or random sampling

  • @caiyu538
    @caiyu538 ปีที่แล้ว

    I have a question. I have a lot of negative samples which means that the data are unlabeled. Their number is much bigger than labeled data. I must include them. In this situation, how to handle this kind of imbalance?

  • @anandodayil6081
    @anandodayil6081 ปีที่แล้ว

    How to know if we should use oversampling or undersampling?

  • @cozyfootball
    @cozyfootball ปีที่แล้ว +1

    Helpful, thx

  • @muhammaddanial4549
    @muhammaddanial4549 3 ปีที่แล้ว +1

    Hello sir I calculate the descriptors of ligand it 10k, use recursive features elimination then SVM and KNN model but my accuracy is low 0.82 and 0.83 how can I improve the accuracy(low mean paper published om same enzyme having accuracy 0.88...I tried correlation and drop the negative columns but it's not working. Need your help please.

    • @DataProfessor
      @DataProfessor  3 ปีที่แล้ว

      Hi, there's no sure path for achieving high model performance. Several factors come into play (descriptor type, feature selection approach, learning algorithm, parameter optimization, data splitting, etc.), which is a part of research. I would recommend to try out addressing the different factors mentioned previously. Hope this helps.

  • @donrachelteo9451
    @donrachelteo9451 3 ปีที่แล้ว +1

    Thanks Data Professor; may I also know if this method is applicable to imbalance datasets in text classification model? Thanks

    • @DataProfessor
      @DataProfessor  3 ปีที่แล้ว +1

      Yes this is application to imbalanced classes for classification model.

    • @donrachelteo9451
      @donrachelteo9451 3 ปีที่แล้ว +1

      @@DataProfessor thanks for your reply professor 👍🏻

  • @hasankuluk685
    @hasankuluk685 5 หลายเดือนก่อน

    There is a flaw, we should apply the methos on train set, not all data

  • @user-cj4pf5qj4h
    @user-cj4pf5qj4h 3 ปีที่แล้ว

    What step for fix imbalance Before splits data or after splits in train set only

  • @joeyng7366
    @joeyng7366 2 ปีที่แล้ว

    Hi professor, I am trying to do binary classification on advertising conversions using Markov Chain but I'm not sure how should I implement it. Do you have any suggestions on this?

  • @kaustavdas6550
    @kaustavdas6550 2 ปีที่แล้ว

    What do we do if there are more than 2 classes which are imbalanced?

  • @kl8801
    @kl8801 3 ปีที่แล้ว +1

    Thanks for the video but where is the notebook?

    • @DataProfessor
      @DataProfessor  3 ปีที่แล้ว

      Thanks for the reminder, the link is now in the video description.

  • @debatradas1597
    @debatradas1597 2 ปีที่แล้ว +1

    Thank you so much

  • @farahilyana9964
    @farahilyana9964 2 ปีที่แล้ว

    prof, thankyou for the nice video. But, i want to ask, how to show the balance data after had do SMOTE?

  • @tahabihaouline2333
    @tahabihaouline2333 3 ปีที่แล้ว +1

    nice video, i just want to know, how can i train this to get training data and testing data

    • @DataProfessor
      @DataProfessor  3 ปีที่แล้ว +1

      Hi, once the data is balanced, you can take the balanced data to perform data splitting to train and test data using the train_test_split function.

  • @juanmiranda4054
    @juanmiranda4054 ปีที่แล้ว

    Te amo señor extraño mi modelo despegó

  • @rafael_l0321
    @rafael_l0321 3 ปีที่แล้ว

    Thank you for the explanation! What is your opinion on creating decoys, that is, artificial data derived from the least represented class, for balancing? Do you know if this functionality is available in some library?

  • @karmanyakumar8295
    @karmanyakumar8295 2 ปีที่แล้ว

    Data is missing. Link is not working for input

  • @KhadejaAl-nashad
    @KhadejaAl-nashad ปีที่แล้ว

    how I appreciate a balance between more than two categories......example of diabetic retinopathy's classification is 5 categories and two balance

  • @gamingdudes...7575
    @gamingdudes...7575 ปีที่แล้ว +1

    hi, how should i save this in the form of csv file

    • @DataProfessor
      @DataProfessor  ปีที่แล้ว +1

      You can use the to_csv function from pandas.

    • @gamingdudes...7575
      @gamingdudes...7575 ปีที่แล้ว

      @@DataProfessor when i handle my dataset using under sampling my accuracy is decreasing by 20 percent what should i do so,

  • @jairovillamizar6588
    @jairovillamizar6588 2 ปีที่แล้ว +1

    You are a great professor!! Thanks a lot

  • @datasciencezj3303
    @datasciencezj3303 3 ปีที่แล้ว +1

    It's not been talked about: why is imbalance an issue?

    • @DataProfessor
      @DataProfessor  3 ปีที่แล้ว

      Yes, you're right. Here goes. Imagine we have a dataset consisting of 1000 samples. 800 belongs to class A and 200 belongs to class B. As class A has 4 times higher samples than class B, there is a high possibility that the model may be biased towards class A. To avoid such scenario, we can either perform undersampling where 800 is reduced to 200. Or we can perform oversampling where 200 is resampled to 800 samples. In both cases, the samples are balanced for both classes.

    • @datasciencezj3303
      @datasciencezj3303 3 ปีที่แล้ว

      @@DataProfessor wil be "biased"? only if you use the accuracy as a measure.

    • @datasciencezj3303
      @datasciencezj3303 3 ปีที่แล้ว

      use AUC to measure

  • @samdaniazad3043
    @samdaniazad3043 2 ปีที่แล้ว

    Over sampling

  • @budisantosa9892
    @budisantosa9892 3 หลายเดือนก่อน

    do we not need to split the data into test and train before balancing?