Statistics-Finding Outliers in Dataset using Z- score and IQR

แชร์
ฝัง
  • เผยแพร่เมื่อ 5 ก.ย. 2024
  • In this video we will understand how we can find an outlier in a dataset using python.
    #Z-Score
    github url: github.com/kri...
    Support me in Patreon: / 2340909
    Connect with me here:
    Twitter: / krishnaik06
    Facebook: / krishnaik06
    instagram: / krishnaik06
    If you like music support my brother's channel
    / @ultralifeproject
    Buy the Best book of Machine Learning, Deep Learning with python sklearn and tensorflow from below
    amazon url:
    www.amazon.in/...
    You can buy my book on Finance with Machine Learning and Deep Learning from the below url
    amazon url: www.amazon.in/...
    Subscribe my unboxing Channel
    / @krishnaikhindi
    Below are the various playlist created on ML,Data Science and Deep Learning. Please subscribe and support the channel. Happy Learning!
    Deep Learning Playlist: • Tutorial 1- Introducti...
    Data Science Projects playlist: • Generative Adversarial...
    NLP playlist: • Natural Language Proce...
    Statistics Playlist: • Population vs Sample i...
    Feature Engineering playlist: • Feature Engineering in...
    Computer Vision playlist: • OpenCV Installation | ...
    Data Science Interview Question playlist: • Complete Life Cycle of...
    You can buy my book on Finance with Machine Learning and Deep Learning from the below url
    amazon url: www.amazon.in/...
    🙏🙏🙏🙏🙏🙏🙏🙏
    YOU JUST NEED TO DO
    3 THINGS to support my channel
    LIKE
    SHARE
    &
    SUBSCRIBE
    TO MY TH-cam CHANNEL

ความคิดเห็น • 79

  • @satyaprakash5905
    @satyaprakash5905 5 ปีที่แล้ว +18

    Z-Score & Outliers - explained in very easy way. Thank you..

  • @prathameshjoshi1408
    @prathameshjoshi1408 5 ปีที่แล้ว +6

    Thank You so much for explaining most of the concepts in simplest way.When you explain anything by far anyone can say you really want us ( your students ) to learn more . One of the best teachers for sure. Thankyou and best of luck for further success on TH-cam.

  • @shivambhayre5056
    @shivambhayre5056 5 ปีที่แล้ว +1

    Thanx bro mujhe ye sb pta tha i mean pure theory and formulas but implement kese krte h python me ye tumne bahut easy way me explain kia thanx man really appreciate that and me wait krunga agle video ka jisme real data p kre thanx for that because me R p ye kr leta hu but i am not good in python aur tum bahut sahi tarike se smjhate ho thnx for that🙏

  • @justfun6409
    @justfun6409 2 ปีที่แล้ว +3

    Could you please arrange the playlist videos in 1st to last manner?
    Just I am getting confused that which video I should watch after particular video.
    By the way, you explain very well👍🏻💥

  • @safwanmansuri792
    @safwanmansuri792 5 ปีที่แล้ว +6

    You continue this hardwork ,surely you will get success one day on TH-cam.

  • @vsabinat
    @vsabinat 2 ปีที่แล้ว

    Wonderfully explained. Won't forget now. Thank you

  • @ayushijmusic
    @ayushijmusic 3 ปีที่แล้ว

    Best!!! Thankyou! This channel is a saviour in DS doubts

  • @pawansonu41
    @pawansonu41 2 ปีที่แล้ว

    at 14:13 point 4 and point 5 are wrong..however in code you have implemented correctly. A suggestion please "try to keep formulas and notes clean, because if that is ruined whole plot gets off". Though you have done a good job by far.

  • @sudiptachakraborty745
    @sudiptachakraborty745 4 ปีที่แล้ว +6

    Sir - I appreciate your effort of making all these beautiful videos. It makes life easy for all of us. But I need to make one comment on this video, at the beginning you said when the weight(x) is increasing height(y) is also increasing but in real world if you think it's not possible right ?
    As per my thought process may be you wanted to say when the height(x) is increasing weight(y) is also in a increasing trend.
    In most of the cases, when a person's height increases automatically weight also get increased but it's not the reverse.
    I hope i am making sense here.
    Thank you once again :)

  • @balajivarma83
    @balajivarma83 5 ปีที่แล้ว +6

    Sir please upload you kaggle competition of house price with feature selection. Waiting for your vedio

  • @ty_b_63_prajwalwaykos86
    @ty_b_63_prajwalwaykos86 2 ปีที่แล้ว +1

    Hi Krish, its always great to learn from you... can anyone please tell when to use what?
    Or which amonthe the methds is best? or is it case wise and if yes what are the cases?

  • @Daily_Dose_Of_Life
    @Daily_Dose_Of_Life 3 ปีที่แล้ว +1

    Thank you, sir
    Finding Outliers with help of Z Score Using Pandas:
    Just some modifications
    outliers=[]
    def detect_outliers(df):

    threshold=3 # 3 standard deviations
    # mean=np.mean(df)
    #std=np.std(df)
    z_score=(df-df.mean())/df.std()
    df['out']=z_score[z_score>threshold]
    df.dropna(inplace=True)

    return df.iloc[:,0].values.tolist()
    detect_outliers(d)

  • @louerleseigneur4532
    @louerleseigneur4532 3 ปีที่แล้ว

    Thanks Krish

  • @janardanpandey6942
    @janardanpandey6942 2 ปีที่แล้ว

    Here outliers is exactly coming in Z score when I'm putting the value of threshold=1 and not coming when putting the value of threshold=3 WHY?????
    Btw thaxx alot very outstanding playlist for STATISTICS.....Gtr work

  • @sheelstera
    @sheelstera ปีที่แล้ว

    Standard Normal distribution can be applied only to a normally distributed column...SND applied to a non-normally distributed column may give you a mean of 0 and and SD = 1 but the nature of distribution is still non-normal. Z score cannot be directly tied to the 99.7% of ND in such a case. The interpretation of Z score is therefore subject to the kind of the distribution structure of the column which remains unaltered before and after the Standardization process.

  • @shashankverma4044
    @shashankverma4044 5 ปีที่แล้ว +1

    Thank you so much ...Concept cleared perfectly:)

  • @harshtamkiya8505
    @harshtamkiya8505 5 ปีที่แล้ว +2

    Nice explanation..

  • @shivambhayre5056
    @shivambhayre5056 5 ปีที่แล้ว +3

    If we have many outliers, in that case, is it good to calculate(or detect by frequency)mode over mean and median ?? And then replace them with mode (mostly we do it for categorical variables) is this a good way?????

  • @AkshaySharma-vm9sp
    @AkshaySharma-vm9sp 5 ปีที่แล้ว +2

    Sir please make video on p value

  • @mannudhapola1210
    @mannudhapola1210 4 ปีที่แล้ว

    Smoothly explained

  • @rahulgarg6363
    @rahulgarg6363 3 ปีที่แล้ว

    perfectly explained krish but How to detect outliers for multidimensional data

  • @sandipansarkar9211
    @sandipansarkar9211 4 ปีที่แล้ว

    Awesome explanation Krish.Thanks

  • @umesh789s
    @umesh789s 4 ปีที่แล้ว +1

    This was really explained very well. I have a doubt regarding multipication by 1.5 , can you please elaborate how this 1.5 came or what is the method by which we can get this value in any other dataset.

    • @umesh789s
      @umesh789s 4 ปีที่แล้ว +1

      Now i can breifly explain why only 1.5 and the reason is really amazing and really informative.

    • @KumarGolu2001
      @KumarGolu2001 4 ปีที่แล้ว

      Can you explain me ?

    • @lokeshkaturi4040
      @lokeshkaturi4040 3 ปีที่แล้ว

      @@KumarGolu2001 It is just a proof done by stats people , 1.5 gave the right bound values to detect outliers.

  • @meenadalvi9743
    @meenadalvi9743 4 ปีที่แล้ว +2

    In the detect outlier function "y" is not defined anywhere please change it to i

  • @rambaldotra2221
    @rambaldotra2221 3 ปีที่แล้ว

    Thanks a Lot Sir, Really helpful.

  • @rohitsharma-kr9gk
    @rohitsharma-kr9gk 5 ปีที่แล้ว +2

    Sir can you make a video on fitness based recommendation system.
    and can you tell me how to collect data set of different people and recommend them food and exercise according to that data set.
    plz reply sir....

    • @samitaadhikari3182
      @samitaadhikari3182 3 ปีที่แล้ว

      actually that is very good idea
      i hope you learned that
      i'm gonna try this once i learned this

  • @TheTimtimtimtam
    @TheTimtimtimtam 5 ปีที่แล้ว +2

    Nice

  • @idanamayank
    @idanamayank 5 ปีที่แล้ว

    Simply awesome,great work

  • @shahzan525
    @shahzan525 3 ปีที่แล้ว

    i couldn't find the next video which you talk about in the last....

  • @yendamurikrishnavamsi3188
    @yendamurikrishnavamsi3188 3 ปีที่แล้ว

    super!!!
    well explained sir!!!!!

  • @akshatabm4491
    @akshatabm4491 10 หลายเดือนก่อน

    Is the detect_outlier using z score program working fine for everyone?? It is not returning correct output for a diff dataset used.

  • @baharehghanbarikondori1965
    @baharehghanbarikondori1965 3 ปีที่แล้ว

    amazing tutorial, thank you

  • @pranjalmittal7475
    @pranjalmittal7475 4 ปีที่แล้ว

    sir please upload video on preprocessing data using sklearn library

  • @vicky-do5th
    @vicky-do5th 3 ปีที่แล้ว

    sir why you have taken threshold as 3 as you said it should fall in 3rd standard deviation ,so it should not be (3*std)?

  • @ukquaratine1019
    @ukquaratine1019 3 ปีที่แล้ว

    can the value of inter quantile would be range in negative(-)

  • @prateshtamhankar3568
    @prateshtamhankar3568 4 ปีที่แล้ว +1

    we can use any one of the technique 1.Z-score or 2.IQR to find outliers??

    • @niveditaparab6772
      @niveditaparab6772 3 ปีที่แล้ว

      generally when data is skewed IQR better if roughly normal then z score better

  • @903vishnu
    @903vishnu 3 ปีที่แล้ว

    If there are more number of features (like X1 X2 X3 and so on) then how can we find out using Scotter plot

  • @abishekkachroo938
    @abishekkachroo938 4 ปีที่แล้ว

    I am facing a unique issue , following this methdology for IQR:
    Mine lower value of column is coming . Below is the pseudo Code:
    Q1 = df[' shares'].quantile(0.25)
    Q3 = df[' shares'].quantile(0.75)
    IQR = Q3 - Q1
    low = Q1 - 1.5 * IQR ------>>>>>>>>> this value is coming extreme negative which is not even present in columns
    The problem maybe is IQR value > Q1.
    How to solve ? Shall I change the % age?

  • @umesh789s
    @umesh789s 4 ปีที่แล้ว +1

    I am not able to find the link of the next video which you have told related to outlier handling in a dataset from kaggle.

    • @KumarGolu2001
      @KumarGolu2001 4 ปีที่แล้ว

      I also not getting the video

  • @namansharma9697
    @namansharma9697 3 ปีที่แล้ว

    how can mean be equal to zero in real-time scenario as weight cannot be assigned a negative value please help

  • @ele_wings7521
    @ele_wings7521 4 ปีที่แล้ว

    thanks alot... sir..

  • @ash_engineering
    @ash_engineering 5 ปีที่แล้ว +1

    Sir how is this method scalable when we have 50s or 100s of dimension /features in a dataset

    • @anirudhyadav-mq8nd
      @anirudhyadav-mq8nd 5 ปีที่แล้ว +2

      IQR method can be used when the data is huge

  • @tamilanaroundtheworld
    @tamilanaroundtheworld 3 ปีที่แล้ว

    Hi Krish,
    I have a doubt regarding the parameter inside the function , when defining the function ...it is given as data. It should be the dataset right ?

    • @shantomatt
      @shantomatt 3 ปีที่แล้ว

      data is just the parameter name , can be any label

  • @indirajithkv7793
    @indirajithkv7793 2 ปีที่แล้ว

    ❤💫

  • @kadhirn4792
    @kadhirn4792 5 ปีที่แล้ว

    Damn good dude. Thank you so much.

  • @akankshagupta5067
    @akankshagupta5067 4 ปีที่แล้ว

    In your code, after are np.abs(z_score) > threshold: Why are you appending y?? Shouldn't you append i?? Please correct it as it is misleading.

  • @lifeisfun9
    @lifeisfun9 5 ปีที่แล้ว

    Sir. Firstly a brilliant teaching and mentor. I have some doubt in calculating upper bound and lower bound you showed. You are saying anything beyond 1.5iqr will be outlier, then why we are doing low= q1-1.5iqr and high=q3+1.5iqr. cant we straight away get outliers by comapring if number is less than or greater than 1.5iqr. I did not understand meaning of q1-1.5iqr as lower bound and similarly for upper bound. How?

    • @lifeisfun9
      @lifeisfun9 5 ปีที่แล้ว

      Oh I understood . I read the text later. Thankyou so much for sharing your knowledge

    • @Joshua75623
      @Joshua75623 4 ปีที่แล้ว

      Akansha, did you got the answer for your question? Even Im getting the same doubt

    • @BehindTheLogics
      @BehindTheLogics 4 ปีที่แล้ว +1

      @@Joshua75623 First the values are sorted in increasing order. Q1 is the 25th percentile, Q3 is the 75th percentile. IQR is the length of the box. so values > Q3+1.5times IQR and values < Q1+1.5times IQR indicates the outlier.

  • @maithreshpalemkota
    @maithreshpalemkota 3 ปีที่แล้ว

    It is more appropriate to call it Quartile than calling them Quantile.

  • @anchitbhushan6172
    @anchitbhushan6172 4 ปีที่แล้ว

    So which one should we use for removing the outliers??

    • @lokeshkaturi4040
      @lokeshkaturi4040 3 ปีที่แล้ว

      Exploring all the ways is the best thing.

  • @knavk1
    @knavk1 3 ปีที่แล้ว

    Hi.. How to work with outliers.. I don't want to remove outliers from my data. Is there a way which I can do that?

  • @cvb6931
    @cvb6931 3 ปีที่แล้ว

    I have few doubts...
    why threshold value is considered as 3 I have seen other examples they also considering as 3 only ? is it always 3 ?
    why to only multiply with 1.5 for lower and upper bound values ... any specific reason?
    Does iQR works on categorical columns also ... I know that zscore doesn't work for categorical values?

    • @ty_b_63_prajwalwaykos86
      @ty_b_63_prajwalwaykos86 2 ปีที่แล้ว

      Answering your doubts---
      1. it is not really 3 actually it is 3*sd where sd is 1(since by using z-score we make the normal distribution with mean=0 and sd =1) and yes, we always take the 3rd sd. that is all the values which are beyond 3rd sd are outliers. because until we reach 3rd sd 99.7% values are covered. which seems pretty enough.
      2. As krish stated 1.5 is a statistical figure and drawn out of experiments so basically we have to use it.
      3. It is very obvious ....using such statistics on categorical data doesn't make any sense. If you still not get it. just tell me how exactly are you going to sort categorical values at the first place😅😂😂🤣

  • @nayazithousifkhan1696
    @nayazithousifkhan1696 4 ปีที่แล้ว

    Post the video of box plot technique

    • @BehindTheLogics
      @BehindTheLogics 4 ปีที่แล้ว +1

      @nayazi
      th-cam.com/video/CVdcr_MC2KU/w-d-xo.html Watch this video from 6.25

    • @shantomatt
      @shantomatt 3 ปีที่แล้ว +1

      plt.boxplot(x=dataset)

  • @forammodi5850
    @forammodi5850 3 ปีที่แล้ว

    Sir i got error name error
    Nameerror: name mean is not defined

  • @rajprajapati888
    @rajprajapati888 2 ปีที่แล้ว

    Z-score isn't give output for more than 3 outliers

  • @tammy4994
    @tammy4994 4 ปีที่แล้ว

    how he has set threshold as 3?

  • @srinathganesh6985
    @srinathganesh6985 3 ปีที่แล้ว

    can categorical data (encoded) have outliers?

    • @lokeshkaturi4040
      @lokeshkaturi4040 3 ปีที่แล้ว

      Here is a nice explanation about how to detect outliers on categorical data .

    • @dreamday4810
      @dreamday4810 2 ปีที่แล้ว

      @@lokeshkaturi4040 can you please share timestamp in this video

  • @TechBinod
    @TechBinod 4 ปีที่แล้ว

    Your video is great but not systemically managed

  • @sandipanpaul1994
    @sandipanpaul1994 4 ปีที่แล้ว

    @krish why it is only 1.5 . Why not 1 or 2. Why only 1.5

    • @amalsunil4722
      @amalsunil4722 4 ปีที่แล้ว +1

      it's proven mathematically comparing with the standard normal distribution ....actually it's 1.7 but it's taken to be 1.5 for 'symmetry'

  • @jiteshmishra6949
    @jiteshmishra6949 3 ปีที่แล้ว

    How to detect and remove from each column in one shot

  • @ashwinimandani2829
    @ashwinimandani2829 4 ปีที่แล้ว

    Sir, I tried finding the outliers in a dataset using both methods(Z-Score,IQR) and the answers are different. Can you please explain why?

    • @amalsunil4722
      @amalsunil4722 4 ปีที่แล้ว

      yea that happens....cuz mathematically proven: 1.7 is the accurate measure for the length of whiskers but it's taken as 1.5 for "symmetry".
      So you may check with 1.7 instead of 1.5 and observe the same result for z-score and iqr