5 ways to work with imbalanced data | Imbalanced dataset machine learning | Imbalanced data

แชร์
ฝัง
  • เผยแพร่เมื่อ 13 เม.ย. 2022
  • 5 ways to work with imbalanced data | Imbalanced dataset machine learning | Imbalanced data
    #ImbalancedDataClassification #UnfoldDataScience
    Hello ,
    My name is Aman and I am a Data Scientist.
    About this video:
    In this video, I explain how to work with imbalanced data in machine learning classification use case. I explain multiple ways in which we can take care of imbalanced data and train a better machine learning model.
    Below topics are explained in this video:
    1. 5 ways to work with imbalanced data
    2. Imbalanced dataset machine learning
    3. Imbalanced data in classification
    4. Undersample and oversample
    5. Undersample majority class
    6. smote meaning
    7. smote in python
    imblearn page link - imbalanced-learn.org/stable/r...
    About Unfold Data science: This channel is to help people understand basics of data science through simple examples in easy way. Anybody without having prior knowledge of computer programming or statistics or machine learning and artificial intelligence can get an understanding of data science at high level through this channel. The videos uploaded will not be very technical in nature and hence it can be easily grasped by viewers from different background as well.
    If you need Data Science training from scratch . Please fill this form (Please Note: Training is chargeable)
    docs.google.com/forms/d/1Acua...
    Book recommendation for Data Science:
    Category 1 - Must Read For Every Data Scientist:
    The Elements of Statistical Learning by Trevor Hastie - amzn.to/37wMo9H
    Python Data Science Handbook - amzn.to/31UCScm
    Business Statistics By Ken Black - amzn.to/2LObAA5
    Hands-On Machine Learning with Scikit Learn, Keras, and TensorFlow by Aurelien Geron - amzn.to/3gV8sO9
    Ctaegory 2 - Overall Data Science:
    The Art of Data Science By Roger D. Peng - amzn.to/2KD75aD
    Predictive Analytics By By Eric Siegel - amzn.to/3nsQftV
    Data Science for Business By Foster Provost - amzn.to/3ajN8QZ
    Category 3 - Statistics and Mathematics:
    Naked Statistics By Charles Wheelan - amzn.to/3gXLdmp
    Practical Statistics for Data Scientist By Peter Bruce - amzn.to/37wL9Y5
    Category 4 - Machine Learning:
    Introduction to machine learning by Andreas C Muller - amzn.to/3oZ3X7T
    The Hundred Page Machine Learning Book by Andriy Burkov - amzn.to/3pdqCxJ
    Category 5 - Programming:
    The Pragmatic Programmer by David Thomas - amzn.to/2WqWXVj
    Clean Code by Robert C. Martin - amzn.to/3oYOdlt
    My Studio Setup:
    My Camera : amzn.to/3mwXI9I
    My Mic : amzn.to/34phfD0
    My Tripod : amzn.to/3r4HeJA
    My Ring Light : amzn.to/3gZz00F
    Join Facebook group :
    groups/41022...
    Follow on medium : / amanrai77
    Follow on quora: www.quora.com/profile/Aman-Ku...
    Follow on twitter : @unfoldds
    Get connected on LinkedIn : / aman-kumar-b4881440
    Follow on Instagram : unfolddatascience
    Watch Introduction to Data Science full playlist here : • Data Science In 15 Min...
    Watch python for data science playlist here:
    • Python Basics For Data...
    Watch statistics and mathematics playlist here :
    • Measures of Central Te...
    Watch End to End Implementation of a simple machine learning model in Python here:
    • How Does Machine Learn...
    Learn Ensemble Model, Bagging and Boosting here:
    • Introduction to Ensemb...
    Build Career in Data Science Playlist:
    • Channel updates - Unfo...
    Artificial Neural Network and Deep Learning Playlist:
    • Intuition behind neura...
    Natural langugae Processing playlist:
    • Natural Language Proce...
    Understanding and building recommendation system:
    • Recommendation System ...
    Access all my codes here:
    drive.google.com/drive/folder...
    Have a different question for me? Ask me here : docs.google.com/forms/d/1ccgl...
    My Music: www.bensound.com/royalty-free...

ความคิดเห็น • 44

  • @enchanted_swiftie
    @enchanted_swiftie 2 ปีที่แล้ว +8

    I was at the same problem for the imbalance in dataset and by then I researched for different methods to take on. Here I am presenting my shortlist that I have created which might help you somewhere.
    Possible Solutions:
    1. Make some changes in the algorithm
    • Adjust the class weight so it becomes sensitive to the minority class
    • Adjust the decision threshold (we can check by PR curve)
    • Penalize the algorithms by putting class_weight='balanced'
    2. Discard the minority examples and treat all classes as one
    • Here we can treat the problem as the "anomaly detection" problem instead of classification
    For anomaly detection "Isolation forest" tend to give promising results
    3. Balance the dataset by sampling
    • Undersample
    • Oversample & SMOTE
    4. Ensemble learning by downsampling
    • It bootstraps different samples and each time it will balance the classes by undersampling
    the majority classes and then aggregates the results for voting
    5. Usage other techniques
    • Algorithms such as Tomek links (which removes k nearest majority pair to increase division)
    • Focal loss
    I have also tried to look for the kaggle notebooks there people have also found out that XGBoost slightly outperforms other algorithms even it would require to give different class weights.
    -
    This was my cheat sheet of the 5 ways. Share your thoughts!!

    • @UnfoldDataScience
      @UnfoldDataScience  2 ปีที่แล้ว +3

      Very good explanation and thanks for putting the learning here. I will pin this comment on top for others benefit.
      My view - Data Science is all about trying/experimenting/failing and learning. Then something very good comes up.

    • @enchanted_swiftie
      @enchanted_swiftie 2 ปีที่แล้ว +3

      @@UnfoldDataScience Won't lie, but when I started watching your videos, your explanations made things much simpler. You know, I was used to freak out (sorry for the words) by listening DBSCAN, Hierarchical Clustering and what not, but when I see those topics explained by you I feel so comfortable that now I would understand this. How simply but accurately you explain without missing the important things.
      PS: I was introduced to assumptions of linear regression by your channel. Before that I knew the model, came to know that there is something called "assumptions" and how important are they!! Totally missed by the instructions on online courses! Your channel is a huge contribution to the data science community on YT.

  • @ayushparihar5989
    @ayushparihar5989 ปีที่แล้ว

    Good explanation

  • @KastijitBabar
    @KastijitBabar หลายเดือนก่อน

    You are the best Data Science And Machine Learning Teacher I have ever seen. Thanks a lot!!

  • @dd3371
    @dd3371 2 ปีที่แล้ว

    Thanks very much for sharing and explaining. What's your thought on logistic regression? Would imbalanced data still a problem if you build the model in GLM using logistic regression?

  • @karthebans248
    @karthebans248 2 ปีที่แล้ว

    Learned new things about the balancing of data sets for Imbalanced data sets. Thanks.

  • @zahedinima732
    @zahedinima732 2 ปีที่แล้ว

    Such a clear and concise explanation. Thank you, Aman!

  • @nivednambiar6845
    @nivednambiar6845 2 ปีที่แล้ว

    An important concept when dealing with classification
    Thanks for sharing Aman 👍👍

  • @atod2572
    @atod2572 ปีที่แล้ว

    Awesome explanation. Can you please tell us when we use which technique? I mean with an example of dataset and selection of sampling technique.

  • @bijaynayak6473
    @bijaynayak6473 ปีที่แล้ว

    Very Nice explanation kudos

  • @younesgasmi8518
    @younesgasmi8518 6 หลายเดือนก่อน

    Can I use oversampling or undersampling before Splitting the dataset into training and testing ?

  • @avikdinda7827
    @avikdinda7827 10 วันที่ผ่านมา

    If oversampling gives data leakage issues in total data? Or if I use smote in train data after the train test split it is giving poor precision to the minority however recall is ok...so what do I do to improve the precision of the minority class?

  • @NeeRaja_Sweet_Home
    @NeeRaja_Sweet_Home ปีที่แล้ว

    Hi Aman,
    In most of videos we could see imbalanced Dataset for classification problems but how to check and Handle imbalanced Dataset for regression problem.
    Thanks,

  • @dilshadmuhammed8224
    @dilshadmuhammed8224 7 หลายเดือนก่อน

    in my case i have more than 2 classes and those classes are in text ,for eg- well being , business analytics etc
    how will balance such classes

  • @sadhnarai8757
    @sadhnarai8757 2 ปีที่แล้ว

    Very nice Aman

  • @maasahebbiustad8514
    @maasahebbiustad8514 ปีที่แล้ว

    Hello sir, How to solve A Classification problem in which training data has only one class? 'This solver needs samples of at least 2 classes in the data, but the data contains only one class: 1', please help me out

  • @mamataparab9803
    @mamataparab9803 2 ปีที่แล้ว

    Hello Aman, this is the third time I have watched this video, simply to learn your way of explaining things. Is it possible for you to create a video or give us some notes so we can find all the important questions for ensembling techniques?

    • @UnfoldDataScience
      @UnfoldDataScience  2 ปีที่แล้ว

      Thanks Mamata, I do keep sharing on Instagram, please follow "unfolddatascience" On Instagram.

    • @mamataparab9803
      @mamataparab9803 2 ปีที่แล้ว

      Sure, Aman. Thank you

  • @snehalvaidya5843
    @snehalvaidya5843 2 ปีที่แล้ว

    Thanks for sharing knowledge 🙂, plz share how to explain PCA in front of interviewer..

  • @dhanushraj3697
    @dhanushraj3697 ปีที่แล้ว

    The video was good but i request to add some extra information and explanation for each methods.

  • @riva.4484
    @riva.4484 ปีที่แล้ว

    Thank you so much! This video help me a lot.
    I have a question, how can we choose and decide which way is the best fit for our imbalance dataset?

  • @tharindumadusanka3038
    @tharindumadusanka3038 2 ปีที่แล้ว

    i am doing MBA using apriori algorithm by using google colab. the problem is when i use more than 20 rows in csv transaction data it displays error. if the no of rows is less than 20 expected result come.

    • @UnfoldDataScience
      @UnfoldDataScience  2 ปีที่แล้ว

      Thats not number of rows problem, some hidden issue may be there with row number 21 probably. I am just guessing.

  • @nagarajsundar7931
    @nagarajsundar7931 2 ปีที่แล้ว

    Hi Aman, Thanks for explaining various method. One question, when to use which method ?

    • @UnfoldDataScience
      @UnfoldDataScience  2 ปีที่แล้ว

      Thanks Naga, cant have like one to one go for rule. some pointers are there which I can cover in different video, thanks for asking

  • @mihretdesta9153
    @mihretdesta9153 ปีที่แล้ว

    hey sir, how about imbalanced image data for deep learning?

  • @chalmerilexus2072
    @chalmerilexus2072 2 ปีที่แล้ว +1

    Which method is preferable?

  • @ratnajyotibhowmick9801
    @ratnajyotibhowmick9801 2 ปีที่แล้ว

    Please share the source of the notebook. Thanks.

    • @UnfoldDataScience
      @UnfoldDataScience  2 ปีที่แล้ว +1

      drive.google.com/drive/u/0/folders/13pZrCIqk1XN6W4I95A07bK8YRHBB3btt

  • @hasantalib6254
    @hasantalib6254 10 หลายเดือนก่อน

    Hello
    I’m irritated to know from you how can deal with unbalanced penal data ? How can i transform the data when there is missing year ??

  • @PalaSheshu111
    @PalaSheshu111 ปีที่แล้ว

    github link