The tutorial offers a lucid explanation of a complex problem of outliers. It is well-presented with examples that made it easier to follow. However, threshold = 3 isn't working for me. I modified it to threshold = 3+std to make it work properly. Moreover, declaring outliers = [ ] outside the function is causing problems if you want use this function in another dataset in the same notebook. So, declaring outlier list inside the function would be a better approach, I think.
Hello sir, i hope you are doing well, i was hoping if you can help me with OD, I'm doing a thesis on the subject and i'm very new to python and programming, i hope to hear from you and thank you in advance.
I have a couple of questions. 1. Is it always better to remove the outliers or could it be big mistake as well? You gave an example of a fraudulent transaction. Now, an outlier indeed is a hint that the transaction was fraud. If I remove all transactions at the first place, how am i going to achieve my results? 2. You did not explain how do we perform outlier checks with multivariate dataset. Suppose IRIS dataset. I have seen a couple of videos here and there but no proper way is coming out. What is the proper way to identify outliers with multivariate datasets. Tahnks
Just a correction, when calculating z-score , you are doing subtraction of i to an array, you should enumerate on datasets and then subset i from the current index of mean and std.
Hello sir, i hope you are doing well, i was hoping if you can help me with OD, I'm doing a thesis on the subject and i'm very new to python and programming, i hope to hear from you and thank you in advance.
14:06 here it is a single dimension df how to sort multidimensional df. We can't sort all rows at once we need to specify one row or 2 how to do it with multi-dimension df? Thank you
Hi Sir , a smal doubt in the video part where you talk about the Std Normal Distribution. You told the graph is about Std normal distribution, but the you told when data falls before and beyonf 3rd std deviation, you will not consider it. Kindly clarify
Sir, shouldn't the threshold value be 3*std and not just 3 ?? Because the rule is a data point is will be considered to be a outlier if it falls outside 3rd standard deviation and not just value 3.
Do you mean when z score = 3? Then it is correct to use threshold of 3 because you have standardized the data and standard deviation of z scored values is 1 and its mean is 0.
I have been following your videos and I have learnt many things Krish Naik. Could you please tell me have you written any Datascience and machine learning books. I would like to buy your books and follow your videos to clinch Datascience job as soon as possible.
Hi Krish thanks for making such an amazing content. I have a query at 09:35. As you have mentioned that we can find outliers using scatter plots. But how can we find outliers if we do have multiple features(more than 2 features)? Your views/response on this would be much appreciated. Thanks in advance.
You can try with any two random features from your data You'll either see most values following a trend with a few outliers, or you'll see most values cluster at a place with a few outliers. Or maybe something else too!
Every single TH-cam channel explain with perspective of Univariate. Can you please explain this with Multivariate ? There is very less data about that on internet.
sir, I have a doubt, threshold is nothing but 3rd standard deviation as you said so it must be 3 * sigma but here you have taken the threshold as 3 can you please clarify this
How is lower bound which you said is q1*1.5 is greater then lower quartile which you said it's q1 Lower bound seems like something which should be less then lower quartile
What will we do in case when outliers are not following gaussian distribution and outlier is present in between the data distribution but not at the extremes
firstly if a data point is in between the data distribution then IT IS NOT AN OUTLIER so your question ends over here only. outlier is any value which lies far away from the majority of the data distribution.
Hi Krish, I just ordered your finance book in Amazon, which is the newest one in whole amazon about python in finance, will you do more video on finance?
Generally we remove this noise, But for fraud detection and identifying a rare disease outliers will be helpful, in such cases how to handle or use them instead of removing them.
During a project in ml I come to an scenario where when I split the dataset with train_test_split the test set contained some categorical column that were not present in the train set while label encoding it. Can you please explain what to do in this type of scenario and also do the outliers be detected before train test split or after. I have seen that you explain each topic in detail. Please help me in this scenario.
Hi Krish Thanks for excellent explanation....But if we get some outliers in any feature should we remove those records containing outliers(but in this case we loose some data), if not then how can we handle outliers??? Please cover this portion also :)
Capping (wensorization) is another way where we can deal with outliers by imputing the values (within the range) in that case the data will not be lost
Sir,pls help if i have a dataset which contains 10 features each with a date for a particula index,how can i detect and see the outliers for it happens for an index in one or more than one fearures.i have 4000 fixed indexes and feature values are updates for each date.thanks
I've applied both of the method in my dataset, but I found different results for both of them? Which one should I choose? Is it possible they have different result?
Sir I understood that how to identify outliers using Z-score and IQR but can you tell us how to fix them like either we should drop that column or what else we should do to remove that outlier from the dataset????
You need to remove the whole sample of that outlier because if you remove only the outlier from one feature, it results in an empty space leading to inaccurate predictions. Eg. if you have Age, Height, and Weight as your input features and u find an outlier in your Age column, you need to remove the whole sample of that particular outlier i.e. remove the complete row of that outlier. Hope I have answered your question.
Hello sir, i hope you are doing well, i was hoping if you can help me with OD, I'm doing a thesis on the subject and i'm very new to python and programming, i hope to hear from you and thank you in advance.
Hi Krish I like ur videos alot..very informative..Could you please put videos related to word2vec models like skipgram, CBOW, gensim, glove.. Thanks in advance.
Krish i just wanna make a small correction, while saying "less than 2" OR "less than 3" say "10% of the data (or whatever the data is) fall below 2 or 3"....otherwise it's great, Good job !!
you can use .difference() method to do that If A and B are two sets then you can calculate the difference as : A.difference(B) , equivalent to (A-B) of the set. Similarly (B-A) = B.difference(A) Hope this helps
sir, in any dataset like bank loan prediction, what if credit score is beyond its ranging(300-850), will they considered as outliers? if yes, how to handle them? great fellows are welcome to help...please
If the range itself is 300-850 and you are having values above or below that range, then that is a data error, and you can drop them unless you can devise a way to find the real value
If I were you, I would go for missing value treatment first, then try to go with outlier treatment, also if I had to deal with such high % of outliers, my first thought would be treat them like normal data points, as deleting outliers would lead to loss of too-many data points. Can you share how you solved the problem ?
Why to do such calculations and looping to find outlier... Just apply standard scaling and create new conditional dataframe of scaled data which contains morethan 3 std values... Those are outliers... Isn't it?
13:57 Correction
Lower bound=Q1-IQR*1.5
Upeer bound= Q3+IQR*1.5
can you use Upper bound in a histogram as a max value?
Amazing Krish, now I understand the concept of outliers, thanks
Clustering techniques are also widely used in industry to detect outliers. Specially isolation forest algo
Superb explanation...in very simple way..
The tutorial offers a lucid explanation of a complex problem of outliers. It is well-presented with examples that made it easier to follow. However, threshold = 3 isn't working for me. I modified it to threshold = 3+std to make it work properly. Moreover, declaring outliers = [ ] outside the function is causing problems if you want use this function in another dataset in the same notebook. So, declaring outlier list inside the function would be a better approach, I think.
Very clear and crisp explanation, loved it
Here is the correction lower bound = q1 - 1.5*IQR and upper bound = q3 + 1.5*IQR
You mean in video it's mistake?
Yes bro, check statistics playlist by krish naik.
Nice Content and you explained it very well.ThankYou So Much
amazing video
supper explanation
Thank you so much sir, I understood everything
You have explained things well. Just one correction - it's inter-quartile range and not inter-quantile range.
It's Inter Quartile Range
Hello sir, i hope you are doing well, i was hoping if you can help me with OD, I'm doing a thesis on the subject and i'm very new to python and programming, i hope to hear from you and thank you in advance.
Thankyou sir for this content.
I have a couple of questions.
1. Is it always better to remove the outliers or could it be big mistake as well? You gave an example of a fraudulent transaction. Now, an outlier indeed is a hint that the transaction was fraud. If I remove all transactions at the first place, how am i going to achieve my results?
2. You did not explain how do we perform outlier checks with multivariate dataset. Suppose IRIS dataset. I have seen a couple of videos here and there but no proper way is coming out. What is the proper way to identify outliers with multivariate datasets.
Tahnks
Nice work mate. I also tried something similar but with Upper and Lower Bound on the Return
Well explained, would be great if you can add some plot for visualization.
Just a correction, when calculating z-score , you are doing subtraction of i to an array, you should enumerate on datasets and then subset i from the current index of mean and std.
Hello sir, i hope you are doing well, i was hoping if you can help me with OD, I'm doing a thesis on the subject and i'm very new to python and programming, i hope to hear from you and thank you in advance.
mean and std are not arrays... the mean of a list of values is a single value and so is the standard deviation
Hi Krish, Thank you so much for the tutorial, Very clear and crisp explanation, loved it :)
14:06 here it is a single dimension df how to sort multidimensional df. We can't sort all rows at once we need to specify one row or 2 how to do it with multi-dimension df?
Thank you
Excellent👍👏😆
great video sir. great content, and explained in the cleanest way possible. thanks
Hi Sir , a smal doubt in the video part where you talk about the Std Normal Distribution. You told the graph is about Std normal distribution, but the you told when data falls before and beyonf 3rd std deviation, you will not consider it. Kindly clarify
insightful for me
Very helpful...
Very well explained.
Is there any anamoly detection videos that dont use credit card fraud as an example???
Very helpful !
what to do with natural outliers?
the outliers which are expected to be there which are not because of any artificial errors
great explanation, kudos !
Sir, shouldn't the threshold value be 3*std and not just 3 ?? Because the rule is a data point is will be considered to be a outlier if it falls outside 3rd standard deviation and not just value 3.
Do you mean when z score = 3? Then it is correct to use threshold of 3 because you have standardized the data and standard deviation of z scored values is 1 and its mean is 0.
I have been following your videos and I have learnt many things Krish Naik. Could you please tell me have you written any Datascience and machine learning books. I would like to buy your books and follow your videos to clinch Datascience job as soon as possible.
Hi Kiran,
I have written a book on finance with ML and DL
@@krishnaik06 could you please share the link,so that I would buy that book..looking forward to more videos.
Thanks , i wonder how to detect outliers in ndarry numpy. I mean n by m shape array. You explained for 1D array, what abot 2d?
in z-score threshold value mentioned as 3 , threshold is nothing but 3rd standard deviation is it?
yes you're correct
Is there any condition better we use one method over another?
great video and really it is inspiring
Hi Krish thanks for making such an amazing content. I have a query at 09:35.
As you have mentioned that we can find outliers using scatter plots. But how can we find outliers if we do have multiple features(more than 2 features)? Your views/response on this would be much appreciated.
Thanks in advance.
You can try with any two random features from your data
You'll either see most values following a trend with a few outliers, or you'll see most values cluster at a place with a few outliers. Or maybe something else too!
yes, you can do it by plotting each feature with the target.
Every single TH-cam channel explain with perspective of Univariate. Can you please explain this with Multivariate ? There is very less data about that on internet.
Hi Krish thank you so much for a nice video can you pls share the link of nxt video where you applied these techniques on kaggle dataset ?
sir, I have a doubt, threshold is nothing but 3rd standard deviation as you said so it must be 3 * sigma but here you have taken the threshold as 3 can you please clarify this
yes thats because here in standard normal distribution the standard deviation is considered to be having the value 1 , sigma = 1
thanks for sharing this video.
One correction, in the loop it should be *outliers.append(i) *
not
outliers.append(y)
How is lower bound which you said is q1*1.5 is greater then lower quartile which you said it's q1
Lower bound seems like something which should be less then lower quartile
What will we do in case when outliers are not following gaussian distribution and outlier is present in between the data distribution but not at the extremes
firstly if a data point is in between the data distribution then IT IS NOT AN OUTLIER so your question ends over here only. outlier is any value which lies far away from the majority of the data distribution.
Hi Krish, I just ordered your finance book in Amazon, which is the newest one in whole amazon about python in finance, will you do more video on finance?
Thanks Kwok for buying my book...yes I will be uploading more videos on finance.
@@krishnaik06 Hands-On Python for Finance is out of stock..Please let us know when it will be available for sale
Any suggestions for multivariate outliers having mixed variables (continuous & Categorical)?
In case of categorical data, it will be better to find the outlier using a scatter plot as sir explained.
@krish naik how to remove outliers from non-normal distributed dataset?
Sir,please can you tell me the difference between anomaly and outliers?
I am confused about this two.
please, sir answer me
anomaly and outliers are the same just that they are have different names like how we human have original name and a pet name.
how to remove those values that are more than the upper bound and lower than the lower bound values respectively? Please tell that too sir
for z score how did you know the threshold
value ???
Generally we remove this noise, But for fraud detection and identifying a rare disease outliers will be helpful, in such cases how to handle or use them instead of removing them.
During a project in ml I come to an scenario where when I split the dataset with train_test_split the test set contained some categorical column that were not present in the train set while label encoding it. Can you please explain what to do in this type of scenario and also do the outliers be detected before train test split or after. I have seen that you explain each topic in detail. Please help me in this scenario.
Hi Krish
Thanks for excellent explanation....But if we get some outliers in any feature should we remove those records containing outliers(but in this case we loose some data), if not then how can we handle outliers??? Please cover this portion also :)
Capping (wensorization) is another way where we can deal with outliers by imputing the values (within the range) in that case the data will not be lost
outliers.append(y)
y is not defined but how did you complied it
Sir,pls help if i have a dataset which contains 10 features each with a date for a particula index,how can i detect and see the outliers for it happens for an index in one or more than one fearures.i have 4000 fixed indexes and feature values are updates for each date.thanks
excellent
I've applied both of the method in my dataset, but I found different results for both of them? Which one should I choose? Is it possible they have different result?
where can i get this jupyter notebook for revision
Hi Krish, well explained. can you please post a video on how to equate the outliers using any dataset. Thanks in advance.
Sir I understood that how to identify outliers using Z-score and IQR but can you tell us how to fix them like either we should drop that column or what else we should do to remove that outlier from the dataset????
drop rows or replace them (mean,mode,median)
Why do we use 1.5 times IQR? Can we take any other number?
How can I find out outliers when there will be many numbers of Columbus in a large datasets.
if we have more than one feature, after that we remove the outliers than, is it not affect other features
You need to remove the whole sample of that outlier because if you remove only the outlier from one feature, it results in an empty space leading to inaccurate predictions.
Eg. if you have Age, Height, and Weight as your input features and u find an outlier in your Age column, you need to remove the whole sample of that particular outlier i.e. remove the complete row of that outlier. Hope I have answered your question.
Hello sir, i hope you are doing well, i was hoping if you can help me with OD, I'm doing a thesis on the subject and i'm very new to python and programming, i hope to hear from you and thank you in advance.
Hi Krish
I like ur videos alot..very informative..Could you please put videos related to word2vec models like skipgram, CBOW, gensim, glove..
Thanks in advance.
i dont understand why we compute 1.5 * iqr , what does this 1.5 mean where do you get this number?
why you did choose threshold = 3 ?
Because any value greater than 3 standard deviation is considered as an outlier. so here that threshold 3 is basically standard deviation 3.
Why threeshold = 3
It represents the quartile
What if the data does not follow a normal distribution?
Sir, Are you having handwritten notes of whatever you taught in ML course videos?Please share them Sir.
Hi, Krish, well explained, can you build one video on rasa chatbot.
Krish i just wanna make a small correction, while saying "less than 2" OR "less than 3" say "10% of the data (or whatever the data is) fall below 2 or 3"....otherwise it's great, Good job !!
Sir once we have detected these outliers using z score method and if they are too many outliers how can we drop those outliers
you can use .difference() method to do that
If A and B are two sets then you can calculate the difference as :
A.difference(B) , equivalent to (A-B) of the set.
Similarly (B-A) = B.difference(A)
Hope this helps
How to find outliers in multiple linear regression?
Hi Krish, How can we identify root cause of an outlier?
Due to human error in data entry/recording or maybe due to some error/bug in the Data Pipeline
sir, in any dataset like bank loan prediction, what if credit score is beyond its ranging(300-850), will they considered as outliers? if yes, how to handle them?
great fellows are welcome to help...please
If the range itself is 300-850 and you are having values above or below that range, then that is a data error, and you can drop them unless you can devise a way to find the real value
tell us about robust outlier
what if I have a lot of outliers in the dataset (around 27%), how to handle that?
If I were you, I would go for missing value treatment first, then try to go with outlier treatment, also if I had to deal with such high % of outliers, my first thought would be treat them like normal data points, as deleting outliers would lead to loss of too-many data points.
Can you share how you solved the problem ?
Please talk about data strategy
Can you please enable English subtitle?
can u do a ransac
Gr8
Thank you
Why to do such calculations and looping to find outlier... Just apply standard scaling and create new conditional dataframe of scaled data which contains morethan 3 std values... Those are outliers... Isn't it?
how detect outliers in fuction to datetime?
Using mean is Ok, but not best idea for outlier detection. Median based methods usually more robust.
Thanks
why your video no subtitle? please make it, thanks
should be i instead of y in outlier.append(i)
i can see you have fixed it in the video but not in github.
we need to append 'i' value not 'y'
Hi Krish, your definition of quantiles is wrong! If you have 0.1=F(x) with F() being the cumulative density, then its 0.1 = F(x)=P(X
yes, and your definition is nice.
What to do after detecting outliers? How do we treat them?
codes:
www.kaggle.com/c0derr/outlier-detection
its not data set its data point which away from >=3
Harris David Anderson Jeffrey Gonzalez Patricia
Thanks