How to deal with Imbalanced Datasets in PyTorch - Weighted Random Sampler Tutorial

แชร์
ฝัง
  • เผยแพร่เมื่อ 21 ส.ค. 2024
  • In this video we take a look at how to solve the super common problem of having an imbalanced or skewed dataset, specifically we look at two methods namely oversampling and class weighting and how to do them both in PyTorch.
    Toy dataset used in video:
    www.kaggle.com...
    ❤️ Support the channel ❤️
    / @aladdinpersson
    Paid Courses I recommend for learning (affiliate links, no extra cost for you):
    ⭐ Machine Learning Specialization bit.ly/3hjTBBt
    ⭐ Deep Learning Specialization bit.ly/3YcUkoI
    📘 MLOps Specialization bit.ly/3wibaWy
    📘 GAN Specialization bit.ly/3FmnZDl
    📘 NLP Specialization bit.ly/3GXoQuP
    ✨ Free Resources that are great:
    NLP: web.stanford.e...
    CV: cs231n.stanford...
    Deployment: fullstackdeepl...
    FastAI: www.fast.ai/
    💻 My Deep Learning Setup and Recording Setup:
    www.amazon.com...
    GitHub Repository:
    github.com/ala...
    ✅ One-Time Donations:
    Paypal: bit.ly/3buoRYH
    ▶️ You Can Connect with me on:
    Twitter - / aladdinpersson
    LinkedIn - / aladdin-persson-a95384153
    Github - github.com/ala...

ความคิดเห็น • 66

  • @AladdinPersson
    @AladdinPersson  3 ปีที่แล้ว +14

    A tip that I didn't mention in the video is when you're iterating through the dataset to create the sample weights is to iterate through dataset.imgs rather than just dataset. This will run much faster because we are not resizing, performing transformations and so on which we do not need to do when we are only interested in the labels of the examples.

    • @amnesie148
      @amnesie148 3 ปีที่แล้ว

      Hi Aladdin , First of all thank you for the super good video, but I didn't figure out which part should be " iterate through dataset.imgs ", can you point it out?

  • @sahasamanecheppali547
    @sahasamanecheppali547 3 ปีที่แล้ว +14

    You sir, deserve way more subscribers for the consistency and the diversity of the topics you choose. Keep up the good work.

  • @wolfisraging
    @wolfisraging 3 ปีที่แล้ว +5

    Awesome video bro, this has been really helpful. I'd like to share one trick of mine: Applying more random and strong data augmentation to the examples that are limited, and less random augmentation on examples that are quiet enough, and then making sure that each batch to the model receives an equal number of examples for each unique label. The only side effect here would be that you'd have to write your own custom dataloader that does that 🥱, and to be honest, it's not easy 😂, but once you set it up then its just a matter of copying and pasting for next projects :)
    Thanks again for the video.

    • @AladdinPersson
      @AladdinPersson  3 ปีที่แล้ว +1

      Hey Wolf! :) That sounds interesting, I would like to check it out if you could you share an example of such a custom dataloader?

  • @imveryhungry112
    @imveryhungry112 ปีที่แล้ว +1

    the pytorch weighted random sampler is an amazing pytorch feature. Thanks for talking about it here.

  • @jacoblee6246
    @jacoblee6246 3 ปีที่แล้ว +3

    Great video!A small bug: The orders of traversal between os.walk and datasets.ImageFolder are different. In the github code, We cannot guarantee that the smaller number of samples will get a greater sampling weight.

  • @mochametmachmout4467
    @mochametmachmout4467 3 ปีที่แล้ว

    To tackle the imbalance one can also use the Focal Loss function. The kornia library has it.

  • @kirankharel929
    @kirankharel929 3 ปีที่แล้ว

    Thanks Aladdin for all your videos, they are really awesome and informative

  • @sebastianamaruescalantecco7916
    @sebastianamaruescalantecco7916 2 ปีที่แล้ว +1

    There is a bug in the program that can cause you a lot of trouble. After line 22 add this line of code -> subdir.sort()
    os.walk() won't traverse your folders in alphabetical order so you have to sort the generator before appending the calculated weights. The could should look something like this:
    for root, subdir, files in os.walk(root_dir):
    subdir.sort()
    if len(files) > 0:
    ....

    • @bryanyan8346
      @bryanyan8346 2 ปีที่แล้ว

      The same problem arises for me, thanks for pointing out!

  • @0730pleomax
    @0730pleomax 3 ปีที่แล้ว +1

    Thanks Aladdin! Would you mind recommend us your learning resources? I mean most of your teaching content are pretty rare in any ML/DL book.

    • @AladdinPersson
      @AladdinPersson  3 ปีที่แล้ว +2

      I'm just sharing solutions to problems I face. Googling, reading docs and github, no good learning resource there unfortunately

  • @yashrunwal5111
    @yashrunwal5111 3 ปีที่แล้ว +1

    Hi, how can we use the WeightedRandomSampler for Object Detection task?

  • @yanfeng5519
    @yanfeng5519 ปีที่แล้ว

    Great lecture.

  • @clariolee
    @clariolee ปีที่แล้ว

    人像加入视频时候舒服多了!!不知道看哪的时候就看脸哈哈哈。

  • @frankrobert9199
    @frankrobert9199 2 ปีที่แล้ว

    Great lectures.

  • @sahil-7473
    @sahil-7473 3 ปีที่แล้ว

    Superb!!
    I implemented AugMix DataAugmentation myself to increase the minority label's samples. My question is should we stick around one data augmentation technique which is state of the art OR we should try all of the others technique?
    Thanks

  • @mustafabuyuk6425
    @mustafabuyuk6425 3 ปีที่แล้ว

    Randaugment is one of the best augmentation method, it will improve your model performance and I was using weight normalization like this
    nSamples = [346,168,106] # class samples
    normedWeights = [1 - (x / sum(nSamples)) for x in nSamples]
    normedWeights = torch.FloatTensor(normedWeights).to(device)
    print(normedWeights)
    nn.CrossEntropyLoss(weight=normedWeights)
    is your second method different than this

    • @AladdinPersson
      @AladdinPersson  3 ปีที่แล้ว

      I prefer oversampling (and you can still use RandAugment) rather than class weighting as it seems you're doing in the example

  • @rosacanina674
    @rosacanina674 2 หลายเดือนก่อน

    Thank you so much for your great content! I was wondering if the loader in the video is the train_loader? do you also apply oversampling on dev_train and test dataloader?

  • @nishantyadav6341
    @nishantyadav6341 3 ปีที่แล้ว

    Aladdin, you deserve more subscribers. And you need to charge more :) Just joined as a member.

  • @kirtipandya4618
    @kirtipandya4618 3 ปีที่แล้ว

    How can you use pytorch and tensorflow so well. You must have invested so many hours of practice. How do you do that? Which API you like more for ML? In Keras you handle images differently and in pytorch you handle differently. It amazing to see that you have mastery over both. Well less people can do that. Usually people like to work with only one API.

    • @AladdinPersson
      @AladdinPersson  3 ปีที่แล้ว

      I like PyTorch more. I think you overestimate me for sure, I'm way worse than what you think and need to google everything. You just see the refined version in the video and not the horrible mistakes I make :)

    • @jimmychen4796
      @jimmychen4796 3 ปีที่แล้ว +1

      @@AladdinPersson You are too humble bro! A nice lesson learned: ), good job, and keep going!

    • @shaharweksler1203
      @shaharweksler1203 3 ปีที่แล้ว

      @@AladdinPersson maybe you can make a video about how you get stuck and your train of thoughts and google queries to solve it

  • @thantyarzarhein5459
    @thantyarzarhein5459 3 ปีที่แล้ว

    Awesome video and this channel is so underrated in DL community. I would like to know if there will be paper implementation tutorials in the future ?

    • @AladdinPersson
      @AladdinPersson  3 ปีที่แล้ว

      Yeah for sure. Any paper in particular you'd wanna see?

    • @aishwaryaagarwal3540
      @aishwaryaagarwal3540 3 ปีที่แล้ว

      @@AladdinPersson arxiv.org/pdf/1905.05908.pdf This one if possible. Can you make some tutorials on compositional zero-shot learning and implement some papers in that field.

  • @TheAcujlGamer
    @TheAcujlGamer 3 ปีที่แล้ว

    Great video!

  • @rafaelmahammadli667
    @rafaelmahammadli667 2 ปีที่แล้ว

    Hello, How can we apply under-sampling on your case? Oversampling and undersampling can be applied together?
    Thank you

  • @ZulkaifAhmed1
    @ZulkaifAhmed1 2 ปีที่แล้ว

    Dude you are awesome. I like your pytorch tutorials. But would love if you could use google colab for next ones.

  • @kelixoderamirez
    @kelixoderamirez ปีที่แล้ว

    permisiion to learn sir. thank you

  • @hassanrevel
    @hassanrevel ปีที่แล้ว

    Thanks man

  • @robinranabhat3125
    @robinranabhat3125 ปีที่แล้ว

    Recap at 9:00

  • @jijie133
    @jijie133 3 ปีที่แล้ว

    Great video!

  • @user-co6pu8zv3v
    @user-co6pu8zv3v 3 ปีที่แล้ว

    Thanks!

  • @Zulle863
    @Zulle863 ปีที่แล้ว

    Does we have to call get data loader for once for train and once for test set or just a single time for the entire dataset?

  • @fasolya99
    @fasolya99 3 ปีที่แล้ว

    why did you multiply the sample_weights (which is zero) by len(dataset) ?

  • @Tripdin
    @Tripdin 3 ปีที่แล้ว

    Thank you

  • @erdi749
    @erdi749 2 ปีที่แล้ว

    Another amazing tutorial! I wonder if you have a patreon page where we can support you, it is well deserved. I have a question as well. I have a large imbalanced dataset. I need to call getloader function to get train and test loaders for hyperparamter tuning. Scanning the entire trainset in each function call makes the code slower. Do you suggest a work around? Thank you!

  • @mattiagatti1200
    @mattiagatti1200 2 ปีที่แล้ว

    Hello, I still don't understand how replacement = True differs from replacement = False, may you explain me it please? Thank you :)

  • @hamidmahmoodpour3659
    @hamidmahmoodpour3659 2 ปีที่แล้ว

    why we can not use shuffle while using sampler?

  • @user-yw6wf3uu1o
    @user-yw6wf3uu1o 2 ปีที่แล้ว

    dataset link over!

  • @mariaanson6537
    @mariaanson6537 3 ปีที่แล้ว

    What is the editor or Ide you are using

  • @amirhosseindaraie5622
    @amirhosseindaraie5622 2 ปีที่แล้ว

    Hi Aladdin, Your code does not work and the result is still imbalanced. What should we do?

  • @sakib.9419
    @sakib.9419 3 ปีที่แล้ว

    I've been trying to do this in TensorFlow with ImageDataGens for over 5 hrs now, Could you help out?

    • @AladdinPersson
      @AladdinPersson  3 ปีที่แล้ว

      I'm not sure how you do it in Tensorflow

    • @sakib.9419
      @sakib.9419 3 ปีที่แล้ว

      @@AladdinPersson That's fine, d'you have a discord btw?

  • @helimehuseynova6631
    @helimehuseynova6631 2 ปีที่แล้ว

    Hi , If we have 3 classes, how can we do it ?

  • @seanbenhur
    @seanbenhur 3 ปีที่แล้ว +1

    Bro..are you in linkedin!?

    • @AladdinPersson
      @AladdinPersson  3 ปีที่แล้ว +1

      Yeyup, same name :)

    • @seanbenhur
      @seanbenhur 3 ปีที่แล้ว

      @@AladdinPersson Request sent😌

  • @apurbasarkar6918
    @apurbasarkar6918 3 ปีที่แล้ว

    how to do it in tensorflow? I'm struck

  • @tirthadatta2072
    @tirthadatta2072 ปีที่แล้ว

    Can u give me solution about computational cost or memory allocation problem related to oversampling imbalance dataset. It is too hard for me to buy gpu because of high cost and it is also problem for many of us. Can anyone here give solution related to the issue?

  • @wolfisraging
    @wolfisraging 3 ปีที่แล้ว

    I've been using it in almost every other projects... but never understood what's so random about it? Why random in its name? 🙂

  • @ArunKumar-sg6jf
    @ArunKumar-sg6jf 3 ปีที่แล้ว

    Do in TENSORFLOW also bro

    • @AladdinPersson
      @AladdinPersson  3 ปีที่แล้ว

      I don't know a good way to do it in Tensorflow... :\ Maybe someone else knows and can help out?