Finding an outlier in a dataset using Python

แชร์
ฝัง
  • เผยแพร่เมื่อ 14 ต.ค. 2024
  • In this video we will understand how we can find an outlier in a dataset using python.
    ref: #medium articles
    #Outlierdetection
    github url: github.com/kri...
    Data Science Projects playlist: • Generative Adversarial...
    NLP playlist: • Natural Language Proce...
    Statistics Playlist: • Population vs Sample i...
    Feature Engineering playlist: • Feature Engineering in...
    Computer Vision playlist: • OpenCV Installation | ...
    Data Science Interview Question playlist: • Complete Life Cycle of...
    You can buy my book on Finance with Machine Learning and Deep Learning from the below url
    amazon url: www.amazon.in/...

ความคิดเห็น • 119

  • @yourkarma7012
    @yourkarma7012 3 ปีที่แล้ว +6

    Clustering techniques are also widely used in industry to detect outliers. Specially isolation forest algo

  • @shujashakir9952
    @shujashakir9952 ปีที่แล้ว +4

    The tutorial offers a lucid explanation of a complex problem of outliers. It is well-presented with examples that made it easier to follow. However, threshold = 3 isn't working for me. I modified it to threshold = 3+std to make it work properly. Moreover, declaring outliers = [ ] outside the function is causing problems if you want use this function in another dataset in the same notebook. So, declaring outlier list inside the function would be a better approach, I think.

  • @vamsinadh100
    @vamsinadh100 3 ปีที่แล้ว +12

    13:57 Correction
    Lower bound=Q1-IQR*1.5
    Upeer bound= Q3+IQR*1.5

    • @aggreykip2006
      @aggreykip2006 ปีที่แล้ว

      can you use Upper bound in a histogram as a max value?

  • @doubando
    @doubando 9 หลายเดือนก่อน +1

    Amazing Krish, now I understand the concept of outliers, thanks

  • @mridulagarwal5881
    @mridulagarwal5881 4 ปีที่แล้ว +27

    You have explained things well. Just one correction - it's inter-quartile range and not inter-quantile range.

    • @FaraazKhanfz
      @FaraazKhanfz 3 ปีที่แล้ว

      It's Inter Quartile Range

    • @nosseibagacem9014
      @nosseibagacem9014 2 ปีที่แล้ว

      Hello sir, i hope you are doing well, i was hoping if you can help me with OD, I'm doing a thesis on the subject and i'm very new to python and programming, i hope to hear from you and thank you in advance.

  • @AmitSharma-po1zb
    @AmitSharma-po1zb 4 ปีที่แล้ว +1

    Superb explanation...in very simple way..

  • @smalirizvi8026
    @smalirizvi8026 2 ปีที่แล้ว +2

    I have a couple of questions.
    1. Is it always better to remove the outliers or could it be big mistake as well? You gave an example of a fraudulent transaction. Now, an outlier indeed is a hint that the transaction was fraud. If I remove all transactions at the first place, how am i going to achieve my results?
    2. You did not explain how do we perform outlier checks with multivariate dataset. Suppose IRIS dataset. I have seen a couple of videos here and there but no proper way is coming out. What is the proper way to identify outliers with multivariate datasets.
    Tahnks

  • @shadrul2783
    @shadrul2783 4 ปีที่แล้ว +14

    Here is the correction lower bound = q1 - 1.5*IQR and upper bound = q3 + 1.5*IQR

    • @rohankupate5917
      @rohankupate5917 ปีที่แล้ว +1

      You mean in video it's mistake?

    • @Kishor_D7
      @Kishor_D7 8 หลายเดือนก่อน

      Yes bro, check statistics playlist by krish naik.

  • @ryando4556
    @ryando4556 7 หลายเดือนก่อน

    Well explained, would be great if you can add some plot for visualization.

  • @srijeetful
    @srijeetful 3 ปีที่แล้ว +1

    Very clear and crisp explanation, loved it

  • @ksoftqatutorials9251
    @ksoftqatutorials9251 5 ปีที่แล้ว +1

    I have been following your videos and I have learnt many things Krish Naik. Could you please tell me have you written any Datascience and machine learning books. I would like to buy your books and follow your videos to clinch Datascience job as soon as possible.

    • @krishnaik06
      @krishnaik06  5 ปีที่แล้ว +2

      Hi Kiran,
      I have written a book on finance with ML and DL

    • @ksoftqatutorials9251
      @ksoftqatutorials9251 5 ปีที่แล้ว +1

      @@krishnaik06 could you please share the link,so that I would buy that book..looking forward to more videos.

  • @adityapradhan8474
    @adityapradhan8474 4 หลายเดือนก่อน

    Thank you so much sir, I understood everything

  • @kaka83185
    @kaka83185 3 ปีที่แล้ว +3

    Just a correction, when calculating z-score , you are doing subtraction of i to an array, you should enumerate on datasets and then subset i from the current index of mean and std.

    • @nosseibagacem9014
      @nosseibagacem9014 2 ปีที่แล้ว

      Hello sir, i hope you are doing well, i was hoping if you can help me with OD, I'm doing a thesis on the subject and i'm very new to python and programming, i hope to hear from you and thank you in advance.

    • @karimdandachi9200
      @karimdandachi9200 2 ปีที่แล้ว

      mean and std are not arrays... the mean of a list of values is a single value and so is the standard deviation

  • @gyapti-fctfinder3336
    @gyapti-fctfinder3336 3 ปีที่แล้ว

    Nice Content and you explained it very well.ThankYou So Much

  • @dhivya_animal_lover
    @dhivya_animal_lover 4 ปีที่แล้ว +1

    Hi Sir , a smal doubt in the video part where you talk about the Std Normal Distribution. You told the graph is about Std normal distribution, but the you told when data falls before and beyonf 3rd std deviation, you will not consider it. Kindly clarify

  • @thedatascientist_me
    @thedatascientist_me ปีที่แล้ว

    Nice work mate. I also tried something similar but with Upper and Lower Bound on the Return

  • @otroleonarbe
    @otroleonarbe 3 ปีที่แล้ว

    thanks for sharing this video.
    One correction, in the loop it should be *outliers.append(i) *
    not
    outliers.append(y)

  • @dikshadhiman2474
    @dikshadhiman2474 3 ปีที่แล้ว

    Thankyou sir for this content.

  • @cliffkwok
    @cliffkwok 5 ปีที่แล้ว +3

    Hi Krish, I just ordered your finance book in Amazon, which is the newest one in whole amazon about python in finance, will you do more video on finance?

    • @krishnaik06
      @krishnaik06  5 ปีที่แล้ว +3

      Thanks Kwok for buying my book...yes I will be uploading more videos on finance.

    • @varunchandrappa5123
      @varunchandrappa5123 3 ปีที่แล้ว

      @@krishnaik06 Hands-On Python for Finance is out of stock..Please let us know when it will be available for sale

  • @mohanadjibory2191
    @mohanadjibory2191 2 ปีที่แล้ว

    Thanks , i wonder how to detect outliers in ndarry numpy. I mean n by m shape array. You explained for 1D array, what abot 2d?

  • @dineshlakshitha7309
    @dineshlakshitha7309 3 ปีที่แล้ว

    amazing video
    supper explanation

  • @niveshtayal979
    @niveshtayal979 5 ปีที่แล้ว +1

    Hi Krish
    Thanks for excellent explanation....But if we get some outliers in any feature should we remove those records containing outliers(but in this case we loose some data), if not then how can we handle outliers??? Please cover this portion also :)

    • @amanpreetsinghgulati2475
      @amanpreetsinghgulati2475 2 ปีที่แล้ว

      Capping (wensorization) is another way where we can deal with outliers by imputing the values (within the range) in that case the data will not be lost

  • @jakekiddall5108
    @jakekiddall5108 3 ปีที่แล้ว +1

    Is there any anamoly detection videos that dont use credit card fraud as an example???

  • @Ashokkumar-sc3vt
    @Ashokkumar-sc3vt 5 ปีที่แล้ว +3

    Hi Krish, well explained. can you please post a video on how to equate the outliers using any dataset. Thanks in advance.

  • @satheeshswaminathan2328
    @satheeshswaminathan2328 4 ปีที่แล้ว +1

    Hi Krish, Thank you so much for the tutorial, Very clear and crisp explanation, loved it :)

  • @amitsawant4961
    @amitsawant4961 2 ปีที่แล้ว

    insightful for me

  • @magicmushroom9670
    @magicmushroom9670 3 ปีที่แล้ว +1

    Every single TH-cam channel explain with perspective of Univariate. Can you please explain this with Multivariate ? There is very less data about that on internet.

  • @subhamasthan7294
    @subhamasthan7294 4 ปีที่แล้ว

    Hi Krish thank you so much for a nice video can you pls share the link of nxt video where you applied these techniques on kaggle dataset ?

  • @ahmedbaheeg
    @ahmedbaheeg ปีที่แล้ว

    Thanks

  • @sanathdas4071
    @sanathdas4071 4 ปีที่แล้ว +2

    Sir,please can you tell me the difference between anomaly and outliers?
    I am confused about this two.
    please, sir answer me

  • @parikshitgupta343
    @parikshitgupta343 3 ปีที่แล้ว

    How is lower bound which you said is q1*1.5 is greater then lower quartile which you said it's q1
    Lower bound seems like something which should be less then lower quartile

  • @AbhishekMishra-mq4jw
    @AbhishekMishra-mq4jw 3 ปีที่แล้ว +1

    what to do with natural outliers?
    the outliers which are expected to be there which are not because of any artificial errors

  • @sakhawathossain3812
    @sakhawathossain3812 2 ปีที่แล้ว

    Very helpful...

  • @yuktikhantwal2342
    @yuktikhantwal2342 5 ปีที่แล้ว +1

    great video sir. great content, and explained in the cleanest way possible. thanks

  • @sheetalyoutub
    @sheetalyoutub 2 ปีที่แล้ว

    Very helpful !

  • @arjyabasu1311
    @arjyabasu1311 4 ปีที่แล้ว +2

    Sir, shouldn't the threshold value be 3*std and not just 3 ?? Because the rule is a data point is will be considered to be a outlier if it falls outside 3rd standard deviation and not just value 3.

    • @jondoe3693
      @jondoe3693 4 ปีที่แล้ว +1

      Do you mean when z score = 3? Then it is correct to use threshold of 3 because you have standardized the data and standard deviation of z scored values is 1 and its mean is 0.

  • @rizkamilandgamilenio9806
    @rizkamilandgamilenio9806 ปีที่แล้ว

    Is there any condition better we use one method over another?

  • @sekharpink
    @sekharpink 5 ปีที่แล้ว +1

    Hi Krish
    I like ur videos alot..very informative..Could you please put videos related to word2vec models like skipgram, CBOW, gensim, glove..
    Thanks in advance.

  • @deeptijoshi377
    @deeptijoshi377 3 ปีที่แล้ว +1

    What will we do in case when outliers are not following gaussian distribution and outlier is present in between the data distribution but not at the extremes

  • @nabilahhannani2326
    @nabilahhannani2326 4 ปีที่แล้ว

    I've applied both of the method in my dataset, but I found different results for both of them? Which one should I choose? Is it possible they have different result?

  • @jayantdikshit4181
    @jayantdikshit4181 4 ปีที่แล้ว

    Hi Krish thanks for making such an amazing content. I have a query at 09:35.
    As you have mentioned that we can find outliers using scatter plots. But how can we find outliers if we do have multiple features(more than 2 features)? Your views/response on this would be much appreciated.
    Thanks in advance.

    • @rachittoshniwal
      @rachittoshniwal 4 ปีที่แล้ว

      You can try with any two random features from your data
      You'll either see most values following a trend with a few outliers, or you'll see most values cluster at a place with a few outliers. Or maybe something else too!

    • @sanjaysanjay862
      @sanjaysanjay862 2 ปีที่แล้ว

      yes, you can do it by plotting each feature with the target.

  • @yomeshyadav3407
    @yomeshyadav3407 3 ปีที่แล้ว +1

    sir, I have a doubt, threshold is nothing but 3rd standard deviation as you said so it must be 3 * sigma but here you have taken the threshold as 3 can you please clarify this

    • @somomitachattopadhyay2846
      @somomitachattopadhyay2846 ปีที่แล้ว

      yes thats because here in standard normal distribution the standard deviation is considered to be having the value 1 , sigma = 1

  • @bhagyaraj5506
    @bhagyaraj5506 5 ปีที่แล้ว +2

    in z-score threshold value mentioned as 3 , threshold is nothing but 3rd standard deviation is it?

  • @nosseibagacem9014
    @nosseibagacem9014 2 ปีที่แล้ว

    Hello sir, i hope you are doing well, i was hoping if you can help me with OD, I'm doing a thesis on the subject and i'm very new to python and programming, i hope to hear from you and thank you in advance.

  • @muhammadmuneebkhanafridi154
    @muhammadmuneebkhanafridi154 4 ปีที่แล้ว

    Very well explained.

  • @mdazizulislam9653
    @mdazizulislam9653 4 ปีที่แล้ว +2

    Any suggestions for multivariate outliers having mixed variables (continuous & Categorical)?

    • @bonishagarwal9315
      @bonishagarwal9315 4 ปีที่แล้ว

      In case of categorical data, it will be better to find the outlier using a scatter plot as sir explained.

  • @chandrasekharpoluboyina8865
    @chandrasekharpoluboyina8865 4 ปีที่แล้ว

    Generally we remove this noise, But for fraud detection and identifying a rare disease outliers will be helpful, in such cases how to handle or use them instead of removing them.

  • @satyanarayanajammala5129
    @satyanarayanajammala5129 5 ปีที่แล้ว +1

    excellent

  • @rushikeshbulbule8120
    @rushikeshbulbule8120 4 ปีที่แล้ว

    Excellent👍👏😆

  • @dhirendrajha9667
    @dhirendrajha9667 5 ปีที่แล้ว

    Hi, Krish, well explained, can you build one video on rasa chatbot.

  • @raghavgirigiri1
    @raghavgirigiri1 3 ปีที่แล้ว

    Krish i just wanna make a small correction, while saying "less than 2" OR "less than 3" say "10% of the data (or whatever the data is) fall below 2 or 3"....otherwise it's great, Good job !!

  • @BAIBHAVPATHYBEE
    @BAIBHAVPATHYBEE ปีที่แล้ว

    for z score how did you know the threshold
    value ???

  • @jatingupta4026
    @jatingupta4026 3 ปีที่แล้ว

    how to remove those values that are more than the upper bound and lower than the lower bound values respectively? Please tell that too sir

  • @muditmathur465
    @muditmathur465 2 ปีที่แล้ว

    Why do we use 1.5 times IQR? Can we take any other number?

  • @aashaygoel7338
    @aashaygoel7338 3 ปีที่แล้ว

    During a project in ml I come to an scenario where when I split the dataset with train_test_split the test set contained some categorical column that were not present in the train set while label encoding it. Can you please explain what to do in this type of scenario and also do the outliers be detected before train test split or after. I have seen that you explain each topic in detail. Please help me in this scenario.

  • @meghnasingh9941
    @meghnasingh9941 4 ปีที่แล้ว

    great explanation, kudos !

  • @aayushijain2160
    @aayushijain2160 4 ปีที่แล้ว

    Sir I understood that how to identify outliers using Z-score and IQR but can you tell us how to fix them like either we should drop that column or what else we should do to remove that outlier from the dataset????

    • @farazmev3430
      @farazmev3430 4 ปีที่แล้ว

      drop rows or replace them (mean,mode,median)

  • @adarshrai22
    @adarshrai22 3 ปีที่แล้ว

    @krish naik how to remove outliers from non-normal distributed dataset?

  • @karishmaqweera3869
    @karishmaqweera3869 4 ปีที่แล้ว

    Sir, Are you having handwritten notes of whatever you taught in ML course videos?Please share them Sir.

  • @vishalb1204
    @vishalb1204 5 ปีที่แล้ว +3

    Can you please enable English subtitle?

  • @prateeksmithpatra5796
    @prateeksmithpatra5796 3 ปีที่แล้ว

    outliers.append(y)
    y is not defined but how did you complied it

  • @samarendrapradhan5067
    @samarendrapradhan5067 4 ปีที่แล้ว

    Sir,pls help if i have a dataset which contains 10 features each with a date for a particula index,how can i detect and see the outliers for it happens for an index in one or more than one fearures.i have 4000 fixed indexes and feature values are updates for each date.thanks

  • @aws384
    @aws384 4 ปีที่แล้ว

    great video and really it is inspiring

  • @PratapO7O1
    @PratapO7O1 3 ปีที่แล้ว

    14:06 here it is a single dimension df how to sort multidimensional df. We can't sort all rows at once we need to specify one row or 2 how to do it with multi-dimension df?
    Thank you

  • @manavagarwal9763
    @manavagarwal9763 11 หลายเดือนก่อน

    where can i get this jupyter notebook for revision

  • @ga43ga54
    @ga43ga54 5 ปีที่แล้ว

    Please talk about data strategy

  • @saniyamanchekar9978
    @saniyamanchekar9978 4 ปีที่แล้ว +1

    How can I find out outliers when there will be many numbers of Columbus in a large datasets.

  • @RahulKumar-hj8qk
    @RahulKumar-hj8qk 4 ปีที่แล้ว +1

    if we have more than one feature, after that we remove the outliers than, is it not affect other features

    • @bonishagarwal9315
      @bonishagarwal9315 4 ปีที่แล้ว +3

      You need to remove the whole sample of that outlier because if you remove only the outlier from one feature, it results in an empty space leading to inaccurate predictions.
      Eg. if you have Age, Height, and Weight as your input features and u find an outlier in your Age column, you need to remove the whole sample of that particular outlier i.e. remove the complete row of that outlier. Hope I have answered your question.

  • @NickolayGrin
    @NickolayGrin 5 ปีที่แล้ว +1

    Using mean is Ok, but not best idea for outlier detection. Median based methods usually more robust.

  • @Getrocknete_Kotze_Schlabbern
    @Getrocknete_Kotze_Schlabbern 3 หลายเดือนก่อน

    i dont understand why we compute 1.5 * iqr , what does this 1.5 mean where do you get this number?

  • @mithunkumar7063
    @mithunkumar7063 5 ปีที่แล้ว +1

    Thank you

  • @ganeshkumarpatel
    @ganeshkumarpatel 4 ปีที่แล้ว

    Why to do such calculations and looping to find outlier... Just apply standard scaling and create new conditional dataframe of scaled data which contains morethan 3 std values... Those are outliers... Isn't it?

  • @deepquest
    @deepquest 3 ปีที่แล้ว

    Hi Krish, How can we identify root cause of an outlier?

    • @newbie8051
      @newbie8051 2 ปีที่แล้ว

      Due to human error in data entry/recording or maybe due to some error/bug in the Data Pipeline

  • @terwasevictorsesugh3902
    @terwasevictorsesugh3902 ปีที่แล้ว

    What if the data does not follow a normal distribution?

  • @chandrasekharpoluboyina8865
    @chandrasekharpoluboyina8865 4 ปีที่แล้ว

    tell us about robust outlier

  • @pratikramteke3274
    @pratikramteke3274 3 ปีที่แล้ว

    How to find outliers in multiple linear regression?

  • @hritwijkamble9988
    @hritwijkamble9988 ปีที่แล้ว +2

    Why threeshold = 3

    • @Blodia1990
      @Blodia1990 3 หลายเดือนก่อน

      It represents the quartile

  • @mashirnizami134
    @mashirnizami134 4 ปีที่แล้ว

    Gr8

  • @shishirdixit5996
    @shishirdixit5996 4 ปีที่แล้ว

    Sir once we have detected these outliers using z score method and if they are too many outliers how can we drop those outliers

    • @SkipperPlaysYT
      @SkipperPlaysYT 4 ปีที่แล้ว +1

      you can use .difference() method to do that
      If A and B are two sets then you can calculate the difference as :
      A.difference(B) , equivalent to (A-B) of the set.
      Similarly (B-A) = B.difference(A)
      Hope this helps

  • @aakashsinghrawat3313
    @aakashsinghrawat3313 4 ปีที่แล้ว

    sir, in any dataset like bank loan prediction, what if credit score is beyond its ranging(300-850), will they considered as outliers? if yes, how to handle them?
    great fellows are welcome to help...please

    • @rachittoshniwal
      @rachittoshniwal 4 ปีที่แล้ว

      If the range itself is 300-850 and you are having values above or below that range, then that is a data error, and you can drop them unless you can devise a way to find the real value

  • @LailahaillahChannel
    @LailahaillahChannel 4 ปีที่แล้ว

    can u do a ransac

  • @muhammadyazidbaihaqi1479
    @muhammadyazidbaihaqi1479 2 ปีที่แล้ว

    why your video no subtitle? please make it, thanks

  • @jorgeeg2668
    @jorgeeg2668 2 ปีที่แล้ว

    how detect outliers in fuction to datetime?

  • @abdulaziz-lh3nb
    @abdulaziz-lh3nb 2 ปีที่แล้ว

    what if I have a lot of outliers in the dataset (around 27%), how to handle that?

    • @newbie8051
      @newbie8051 2 ปีที่แล้ว

      If I were you, I would go for missing value treatment first, then try to go with outlier treatment, also if I had to deal with such high % of outliers, my first thought would be treat them like normal data points, as deleting outliers would lead to loss of too-many data points.
      Can you share how you solved the problem ?

  • @iliyasn2760
    @iliyasn2760 5 ปีที่แล้ว +1

    we need to append 'i' value not 'y'

  • @econdoc3000
    @econdoc3000 4 ปีที่แล้ว +1

    Hi Krish, your definition of quantiles is wrong! If you have 0.1=F(x) with F() being the cumulative density, then its 0.1 = F(x)=P(X

    • @sanjaysanjay862
      @sanjaysanjay862 2 ปีที่แล้ว

      yes, and your definition is nice.

  • @julieohn
    @julieohn 4 ปีที่แล้ว

    What to do after detecting outliers? How do we treat them?

  • @AmeerulIslam
    @AmeerulIslam 4 ปีที่แล้ว

    should be i instead of y in outlier.append(i)

    • @AmeerulIslam
      @AmeerulIslam 4 ปีที่แล้ว

      i can see you have fixed it in the video but not in github.

  • @KNfarming882
    @KNfarming882 3 ปีที่แล้ว

    its not data set its data point which away from >=3

  • @zehraup4722
    @zehraup4722 4 ปีที่แล้ว

    codes:
    www.kaggle.com/c0derr/outlier-detection

  • @LauraMiller-z4u
    @LauraMiller-z4u วันที่ผ่านมา

    Harris David Anderson Jeffrey Gonzalez Patricia

  • @aparnashrivastava5837
    @aparnashrivastava5837 4 ปีที่แล้ว

    Thanks