This is why you should care about unbalanced data .. as a data scientist

แชร์
ฝัง
  • เผยแพร่เมื่อ 10 ธ.ค. 2024

ความคิดเห็น • 25

  • @jessibenzel243
    @jessibenzel243 3 ปีที่แล้ว +10

    We just talked about this in my machine learning course this week!! Great timing! This video is very helpful.

  • @JessWLStuart
    @JessWLStuart ปีที่แล้ว +1

    Well presented!

  • @haneulkim4902
    @haneulkim4902 3 ปีที่แล้ว +1

    Great content, these practical content is gold. Thank you :)

  • @pgbpro20
    @pgbpro20 3 ปีที่แล้ว

    ritvikmath coming with a video of one of my favorite topics - instant like!

  • @tech-n-data
    @tech-n-data 2 ปีที่แล้ว

    Thank you so much for all you do.

  • @igorbreeze3734
    @igorbreeze3734 2 ปีที่แล้ว +2

    Hi! Great video. Is there any way you would like to creat a full in-depth catboost tutorial on some random data? Would be super useful.

  • @chenxiaodu2557
    @chenxiaodu2557 6 หลายเดือนก่อน +2

    It should be "imbalanced data" instead of "unbalanced data"

  • @joelrubinson9973
    @joelrubinson9973 3 ปีที่แล้ว

    very interesting. AdTech modeling of conversions as caused by advertising always suffers from imbalance. (Conversion rates are usually low-mid single digits).

  • @aghazi94
    @aghazi94 3 ปีที่แล้ว

    you are seriously so underrated

  • @danielwiczew
    @danielwiczew 3 ปีที่แล้ว +4

    Okey, but with oversampling - how do you use cross validation ? Because if you use it on the oversampled dataset, you'll have dataleak

    • @ritvikmath
      @ritvikmath  3 ปีที่แล้ว +8

      I think you'd want to define the folds on the original data and then oversample holding some folds fixed. Example: 3-fold CV.
      - split original data into 3 folds (A,B,C)
      - consider (A,B) as training data -> oversample that data -> validate using C.
      - repeat using A,B as validation sets
      - note that there is no data leak in this case

  • @d.a.k.o.s9163
    @d.a.k.o.s9163 ปีที่แล้ว

    Great video!
    But don’t you think with such unbalanced dataset it would be better going for an anomaly detection algorithm instead of classification algorithm?

  • @Sameerahmed373
    @Sameerahmed373 3 ปีที่แล้ว

    Can we customise loss function? For example more weight for misclassification of true minor class and less weight for the other error?

  • @davidzhang4825
    @davidzhang4825 2 ปีที่แล้ว

    Great video. For other ML algorithms like logistic regression, SVM, KNN etc, can we implement the first method (upweight the minority class) ? or this is only applicable to decision tree ?

  • @zahrashekarchi6139
    @zahrashekarchi6139 2 ปีที่แล้ว

    Great demo!
    just one thought, why did you not talk about downsampling the majority class? and see what can be the impact?

    • @douwe7493
      @douwe7493 9 หลายเดือนก่อน

      This is something I am wondering about too!

  • @bmebri1
    @bmebri1 3 ปีที่แล้ว +1

    Excellent video!
    One question though: are certain classification models immune from class imbalance? Thanks!

    • @LanNguyen-eq6lf
      @LanNguyen-eq6lf 3 ปีที่แล้ว +4

      To my knowledge, don't think any classification what immunes from imbalanced dataset because they are data-driven. However, you are still able to get very good accuracy from imbalanced dataset. It happens when inter-class separability is very high, for example, detection of water bodies (often a minority class) over a large area is often quite accurate.

  • @bernardfinucane2061
    @bernardfinucane2061 3 ปีที่แล้ว

    You could predict that aircraft engines NEVER fail and almost always be right.

  • @mrirror2277
    @mrirror2277 3 ปีที่แล้ว

    Hi just wondering if SMOTE is applicable for image data? I saw only one article on it online, so I am not sure if it even works since generating synthetic images is likely much harder.

    • @shahrinnakkhatra2857
      @shahrinnakkhatra2857 10 หลายเดือนก่อน

      That's where image augmentation comes to play. You can create different variations of that image by rotating, flipping etc various transformations

  • @Septumsempra8818
    @Septumsempra8818 3 ปีที่แล้ว

    Are you familiar with Latent vectors in network analysis?
    s/o from South Africa

  • @junkbingo4482
    @junkbingo4482 3 ปีที่แล้ว +2

    hi
    when people have problems with unbalanced data, it's just the proof they did not get what they do
    when i was young ( a long time ago, so), our teachers wanted us to do things ' step by step' to be ( nearly) sure we knew what we were calculating
    as it's not the case anymore, yes, people dont get the methodology and the maths, but practice data science, wich is sad

    • @junkbingo4482
      @junkbingo4482 3 ปีที่แล้ว

      ups, nuance wrote 'yes'!!; thx to lstm, i did not check my post, sorry! ;-)