Today I started working on the titanic data. Tried to predict the missing age values but failed and was very tensed. So, I started watching your video in hope for a way. When you opened the notebook I felt such a relief - 'ki aab to ho hi jaega'. Thank you for making this video.
I think there is a quantitative justification why we should fill the NaN values on 'Age' with median that classified by 'Sex' and 'Pclass'. On EDA step, we can print or visualize heatmap of the correlations between each columns (dataset.corr().abs()). We can see that 'Age' columns has relatively high correlation to 'Sex' and 'Pclass' columns.
Hi Krish, Your videos are quite useful and simple to understand. My request is if you can create video on how we can deploy ML model with Flask that will be very useful..
Thank you Krish sir. I was following the kaggle learn course on machine learning but couldn't understand this topic even after so much of hard work - now it's all clear. Keep it up.
Thank you Krish, you have explained the second option very well. Wondering how we do this for categorical columns and when values are missing from multiple fields
thanks a lot for sharing your knowledge with us. Kindly address one confusion that do we need to impute missing values in the test data set the same way you have taught in the video?
Well, I appreciate the video that Mr. krish naik made and i love to see his videos and I really want to discuss on how can we handle missing values. Ok well using separate model to see relation between variables that have complete dataset is not great though because the value, since it's a value generated from machine learning, is not a real data and may statistically far from central of population data because it comes from other equation. I would love to use statistical method like mean, median or mode and, I don't know this will work or not, checking the range of population mean and make sure that the value is not going far from population mean
Nice explanation, conclusion depending on your end goal, and whether if drop or change to mean will affect on your analysis, in he’s example he need the age but he didn’t need the cabinet.
Krish, I have one doubt. You are saying that we need to compute the null values by considering the other releted columns. Then tell me how we can implement the same as a pipeline(sklearn.pipeline import Pipeline) so that the pipeline can be used to compute the missing values of the test dataset. Please clear my doubt if anybody knows! It will be helpful for me.....
Thanks for the video, you said that option -2 (model based imputation) is less preferred for huge datasets, does that mean that in general it is good to go with statistical based imputation over model based imputation in real world datasets? Since we get lot of data in real world?. I am working on Home-Credit-Default-Risk (kaggle competetion dataset) request your comment on which imputation method to use?
Honestly, I really love your videos, simple and easy to understand. Always answering my machine learning and data science questions! I do have one though. I watched your video on standardisation and normalisation. I am trying to build a benchmark/index, would it be okay to make the data standardized before creating it or?
hi krish i have a doubt. How will you treat if one variable is having missing values around 30% and that variable is important to consider. Overall records are around 550K
What's the recommended rule for deciding the whether to do data imputation techniques or just simple dropping of the rows having missing values. As the missing values can have any patterns like Mising Data at Random, Not missing at Random and so on. So what to do in that case.
If you're using the isnull() function, it will turn all your missing values into True (or 1) and not-null into False (or 0). After that you can just sum() all of the 1's to find out how many nan values in your dataset.
All good with the imputation of null values related to Age. However for the Cabin feature, instead of deleting roughly 70% of the records, if we aren't able to find any way for imputation via domain knowledge, why can't we tag them as "Undetected" and keep the records for model training. Deleting 70% of the records just for 1 feature will surely not be the best solution if we want to improve the model performance.
Our first priority is mean...if we have large outliers we go for either mode or median depending on the situation as these 2 figures are least affected by the outliers
Hi Krish I find your videos very useful for beginners like me. Here you have shown how to handle missing values for numbers and string fields. We also need to handle for date and time columns. Please guide us through this.
Hi Krish , I have one doubt on this case study . Why you have imputed on basis of class column ? We can also do it on basis of Gender column as well . Median/mean of Male passenger & mean/median of female passenger . Also we have normally distributed age data , Can we apply mean instead of median ?
Sir plz give suggestion regarding cabin feature if it has low number of missing values how we deal with that type?it is a combination of catrogical and neumerical
Please sire what do I do if I have 80% of missing values in my target variable. I'm trying to predict the gross of movies but the target variable to train my model with has 80% missing values.
First, I want to thank "Krish" for all your content, i have numpy array of continuous value obtained from regression model but i don't know how to fill the null value using the continuous np array of, can any one help me out?
Sir I have a doubt in this. What is we have 50 Pclass values it become really tedious to write all of them. Is there any way we can use list of such pclass values while using the list of the potential age list while defining the function. For ex If Pclass == list1: return age ==list2
Sir I was unable to under stand the programming part in Udemy. That is why searched in the youtube but here I can see both of them are exactly same.. You should at least change the digits.. With all due respect.. Chap diya apne Udemy se..
Today I started working on the titanic data. Tried to predict the missing age values but failed and was very tensed. So, I started watching your video in hope for a way. When you opened the notebook I felt such a relief - 'ki aab to ho hi jaega'. Thank you for making this video.
What you had done for cabin data set .We can't remove this simply by saying there are many missing values .
I think there is a quantitative justification why we should fill the NaN values on 'Age' with median that classified by 'Sex' and 'Pclass'. On EDA step, we can print or visualize heatmap of the correlations between each columns (dataset.corr().abs()). We can see that 'Age' columns has relatively high correlation to 'Sex' and 'Pclass' columns.
Hi Krish,
Your videos are quite useful and simple to understand. My request is if you can create video on how we can deploy ML model with Flask that will be very useful..
Sure I will do that
staring data science with so many years career gap..your videos are god leavel
Thank you Krish sir. I was following the kaggle learn course on machine learning but couldn't understand this topic even after so much of hard work - now it's all clear. Keep it up.
Your channel is awesome, please keep going! Can't tell you how valuable your videos are when starting to learn!
Thank you Krish, you have explained the second option very well. Wondering how we do this for categorical columns and when values are missing from multiple fields
Thank you for making life so much easier for us!
thanks a lot for sharing your knowledge with us. Kindly address one confusion that do we need to impute missing values in the test data set the same way you have taught in the video?
Cleared all my doubts! Great..Thank you so much!!
Well, I appreciate the video that Mr.
krish naik made and i love to see his videos and I really want to discuss on how can we handle missing values. Ok well using separate model to see relation between variables that have complete dataset is not great though because the value, since it's a value generated from machine learning, is not a real data and may statistically far from central of population data because it comes from other equation. I would love to use statistical method like mean, median or mode and, I don't know this will work or not, checking the range of population mean and make sure that the value is not going far from population mean
Nice explanation, conclusion depending on your end goal, and whether if drop or change to mean will affect on your analysis, in he’s example he need the age but he didn’t need the cabinet.
beautifully explained with the detailing!
Great way of explaining things. I like it very much.
Hi Krish, could you please tell what to do when there are missing values in the dependent variables?
Your explanation is pretty much amazing and your my perfect as usual.
Hi Krish,want to understand why did you choose Pclass to replace null values for Age.
Why not any of the other attributes.
Thought that you will also implement Regression Model for synthetic imputation. But the content is great!!
Thanks Krish. I can't think of an easier explanation of a tricky topic!!! Simply superb!!!👍
Krish, I have one doubt. You are saying that we need to compute the null values by considering the other releted columns. Then tell me how we can implement the same as a pipeline(sklearn.pipeline import Pipeline) so that the pipeline can be used to compute the missing values of the test dataset.
Please clear my doubt if anybody knows! It will be helpful for me.....
Could you please make a video on missing values imputation using decision trees ?
Hi Krish,
I think the age column in the distplot is right skewed. I do not think that it has a normal distribution.
a really good idea of creating seprate model thanks for sharing.
Thanks for the video, you said that option -2 (model based imputation) is less preferred for huge datasets, does that mean that in general it is good to go with statistical based imputation over model based imputation in real world datasets? Since we get lot of data in real world?. I am working on Home-Credit-Default-Risk (kaggle competetion dataset) request your comment on which imputation method to use?
Honestly, I really love your videos, simple and easy to understand. Always answering my machine learning and data science questions! I do have one though. I watched your video on standardisation and normalisation. I am trying to build a benchmark/index, would it be okay to make the data standardized before creating it or?
QUESTION: why did you choose the imputed value of age with respect to the Pclass and not respect to male or female?
Very helpful video
Thanks Krish
hi krish
i have a doubt. How will you treat if one variable is having missing values around 30% and that variable is important to consider. Overall records are around 550K
Thanks Kris, very helpful.
What's the recommended rule for deciding the whether to do data imputation techniques or just simple dropping of the rows having missing values. As the missing values can have any patterns like Mising Data at Random, Not missing at Random and so on. So what to do in that case.
Thanks sir, I was confused in this part only, about nan values and why we take sum of those nan values..
If you're using the isnull() function, it will turn all your missing values into True (or 1) and not-null into False (or 0). After that you can just sum() all of the 1's to find out how many nan values in your dataset.
@@BretskoD : Thank you sir.
great! thanks for explanation
All good with the imputation of null values related to Age. However for the Cabin feature, instead of deleting roughly 70% of the records, if we aren't able to find any way for imputation via domain knowledge, why can't we tag them as "Undetected" and keep the records for model training.
Deleting 70% of the records just for 1 feature will surely not be the best solution if we want to improve the model performance.
By using Flask,u can do some more deployment ..please Mr.Krish
But anyway, thankyou for your sharing. It help me a lot to learn how to handle missing values, nice works!
Are these only for numerical data? What all methods can I used for characters/names or Years? Please suggest, Thanks!
How can we decide whether to use the mean, median or mode to replace a missing value?
Based on our data you have to decide
Our first priority is mean...if we have large outliers we go for either mode or median depending on the situation as these 2 figures are least affected by the outliers
Hi Krish I find your videos very useful for beginners like me. Here you have shown how to handle missing values for numbers and string fields. We also need to handle for date and time columns. Please guide us through this.
Hi how to deal with year like...2006 0 in same column
Thank you so much ..
Thanks a lot for detailed explanation. It really helps
Sir, CAN YOU PLEASE TELL US ABOUT THE ROLE OF ROC AND CAP curve analysis for improving model performance
At 3:45 mins you said that we delete the record , but what if that variable / feature is significant ?
We can replace the null values with the mean of each column
Hi Krish , I have one doubt on this case study .
Why you have imputed on basis of class column ? We can also do it on basis of Gender column as well . Median/mean of Male passenger & mean/median of female passenger .
Also we have normally distributed age data , Can we apply mean instead of median ?
Sir, if we have missing values in output column, then how separate model will utilise?
Sir plz give suggestion regarding cabin feature if it has low number of missing values how we deal with that type?it is a combination of catrogical and neumerical
Same doubt bro .Do you know the answer.
Sir can u please build a video on named entitity recognition using tensorflow keras
Hi , i want to know you used box plot median to replace missing values in age column but why no mean or mode ? can you please tell me the reason
How to handle missing(NaN) values in column having binary data values i.e Just 0 or 1 ?
Thank you so much
Please sire what do I do if I have 80% of missing values in my target variable.
I'm trying to predict the gross of movies but the target variable to train my model with has 80% missing values.
i am thinking Netflix problem-solution type of filling missing, kind of minimizing a cost function
in which senirio we can delete data in missing values ?
When there is a very large data set.
nice video
First, I want to thank "Krish" for all your content, i have numpy array of continuous value obtained from regression model but i don't know how to fill the null value using the continuous np array of, can any one help me out?
Sir I have a doubt in this. What is we have 50 Pclass values it become really tedious to write all of them. Is there any way we can use list of such pclass values while using the list of the potential age list while defining the function. For ex If Pclass == list1:
return age ==list2
Just use a dictionary where key is pclass and value is the mean.
Why write and not just make text appear (as in: pre-typed so people can read it and use transition? )
Loved it
Awesome
Great !
Sir I was unable to under stand the programming part in Udemy. That is why searched in the youtube but here I can see both of them are exactly same.. You should at least change the digits.. With all due respect.. Chap diya apne Udemy se..
Hi sir, how to fill missing values using Linear regression?
👍👍👍
terrible presentation cant really understand the red scrible, maybe try typing