Thanks a lot for this very clear video. Do you know if we can combine multiple imputation with variable selection (with lasso for example) for prediction purposes?
This is so clearly explained. Thank you very much for this concise and informative video! I have a question. I believe the purpose of step 2 - calculating the standard deviation - is to confirm that the mean is a reliable one. What if the standard deviation is too large? Does it imply that the imputation method is not a reliable one and should not be adopted? Thank you!
Great explanation and excellent in describing how multiple imputations! But I have a question to ask, how could I choose the final value for the imputation if there is 5 value? should I go average 5 of the value instead, or is there any better approach? Thank You
@@qwertyuiop-qy6hb I am confused, is calculating the mean of the predicted value as a final chosen value or calculating the mean of the sample means as a chosen value? It makes more sense to me using the mean of predicted values, but why do we want to see the standard deviation of the mean for the sample mean? What will actually affect our decision?
@@tinghuachen7844 I am not a statistician so what I will tell you is the way I understand it. Multiple imputation gives you an estimation of missing data points of a specific variable in a data set. So this estimation is based on the values of the same variable (one column) of the other "individuals or rows" in the data set. Everytime you perform the imputation the resultant value depends on the selected "individuals" which should be selected randomly. So if you do imputation let's say 5 times, you end up with five estimated values. Here the mean of these values gives the estimation of the missing data point. Calculating the standard deviation (from here this is my own understanding) gives you an idea how variable these estimations are. Same thing if standard of error is calculated. If the estimated values are (spread or not close in value), SD will be high and I'd be careful in my assumptions and interpretation of the final analysis . I have not done multiple imputation in my domain (medicine). I would be very careful using multiple imputation but certainly this is a great method to avoid missing data and use all the sample size.
Thanks for the practical example, not clear to me, at the end which value we have to use to fill in the missing value with the multiple imputation method. Could you please clarify?
~2~Thanks for the clear explanation. One thing I'm struggling to understand is when you are running multiple iterations, say 5, how are the different sets of data points generated? In your example, you fit lines among 50 data points. Do you randomly select 50 data points among those that have non-missing value in the raw dataset?
Data points selection when sampling is almost always done randomly, in order to avoid bias. A similar sample-&-test approach is taken when you perform cross-validation. Same logic follows when choosing the right sample size.
very clear and easy to follow thanks, but will we not get as good a result by taking one regression sample of 250 data items, as opposed to five sets of fifty, then taking the mean of the means?
What we do with 5 imputations that have been calculated? which of them can be considered as the imputed value finally if we want just to show this as a graph?
Yeah I am missing the final information too. We have the values for the total final mean and the standard deviation of the means to the final mean but what are we supposed to do with these values? How do we decide, which values to we impute for the missing data?
Great explanation, thanks. I have done many retrospective clinical research projects and I have never dealt with missing data. I always left these blank knowing that they will automatically be excluded from analysis. I believe leaving these missing data unfilled is better to avoid any chance of bias influenced by data of other patients in the study cohort. What do you think? Now looking at your clear video I am thinking about this approach as well for future projects. I am not a statistician and I've done all these while in training.
You have to think about what you do when you just omit the Data with missing values. You heighten the statistical power of the other Entries. Why are the values missing? When the data is missing for a reason you introduce bias into your data set and ergo reduce variance and unless you cant proof that the data is missing completly at random its best practice to act as if it is missing not at random. With methods like this you try your best to keep the variance of the data set. Its important to use a method, that tries to model the most plausible value otherwise you would reduce variance. I recommend you look into the types of Missing Data "Missing Completly at Random", "Missing at Random" and "Missing not at Random".
Thanks for the great video! Question: suppose I have 5 different random samples with which I can get 5 regressions, and then \mu_1, ..., \mu_5, to find an aggregate mean \mu_A. Why not just pool those 5 data sets into one large data set and compute the grand mean \mu_B that way? Wouldn't my answer \mu_B be more precise (less variable) than just taking the average of the 5 means to get \mu_A?
A few things to note, >The 5 random samples that you taken from the existing data, may have common elements. So, straight up combining them might increase bias towards the repeating data points. >Lets say you avoid having repeating data points, then, combining the 5 samples only help create a subset of the original dataset. Thus you would be better off just running a regression imputation. >In my opinion, the whole point here is to have multiple versions of estimated values, so that you may better understand how well the estimation fits our data. Usualy, if the variance or spread of the final values is quite high then we might not want to go ahead with imputation or we may wanna use something other than regression to estimate the missing values. >Therefore, multiple regression is giving you a clearer & broader picture of what & how much compromises you are making for implementing imputation to replace missing values.
@@vishnumohank1299 thanks for the thoughtful reply. I think that I was missing the point of multiple imputation, but your last two points clear that up for me. Thanks again!
Can you actually do standard deviation? Won't that just reduce the sd for each regression by adding a bunch of point that perfectly fit the regression?
Isn't it actually even more complicated than that? Isn't it that for each regression, instead of imputing the missing fine value with the value predicted by the regression we actually randomly sample from the distribution of fine values around that predicted value (the distribution of fine conditional on distance)? This adds even more of the uncertainty involved in the guess we are making to our imputation process.
It wasn’t apparent to me why this estimator would be less biased than a single imputation, you mentioned that doing multiple regressions and the aggregating ‘washes away the noise’ but each of your individual regressions would also be more noisy than a single regression that uses the whole dataset - so how do I know that in the aggregate they are less noisy than a single regression?
Thank you very much. Quick question, which imputed values do you end up leaving in the dataset for further analysis. Say now I want to impute values to be used later for a variety of machine learning applications. Surely, I cant use multiple imputation every time I want to implement a new machine learning model and measure a metric?
Thanks for the clear explanation. One thing I'm struggling to understand is when you are running multiple iterations, say 5, how are the different sets of data points generated? In your example, you fit lines among 50 data points. Do you randomly select 50 data points among those that have non-missing value in the raw dataset?
I don't understand why it is more unbiased to run 5 OLS regressions with only 50 out of the 2000 rows. Why not just run a single OLS regression with all rows and use that as my predictor for the missing values?
Wouldn't this be problematic if your objective with the dataset is precisely to demonstrate if there is any relationship (like a linear relationship) between those 2 variables? Filling a missing value through a method which assumes the very same linear realtionship you are trying to demosntrate would actually be begging the question, isn't it?
Great explanation! But one that also seems at odds with what I'm reading from other sources, which make it sound like parameters in the model estimating the outcome are what get randomly selected for each iteration, not the observations used to make the prediction. Is what I'm describing an alternative approach to the same thing, or am I misunderstanding the approach?
What do you mean by randomly selecting parameters? His choice of single imputation method is least squares regression and the parameters (the a and b in your "ax + b" regression line) have a closed-form solution. If you use the same dataset, the parameters of least squares don't have any variability in and of themselves. Maybe you can elaborate more on what you mean?
Here you know that the fine amount is a dependent variable that depends on the distance from the library.... but what if you have a data set with missing values in a particular column but the column is actually a independent variable column... how will you use multiple imputation in that case... can you do something like using the distribution to find the values
I just wish that you were more neat instead of writing everything on that one paper and you keep moving it and it isnt clear what you are referring to when you point your finger on the paper as you've written everything in every nook and corner of that paper.
Wow, you have a natural ability to make complicated concepts become simple.
I was struggling with the concept, but your video made it crystal clear to me, thanks
This was so clear and easy to understand! Thank you!
This explanation is awesome! Congratulations!
Glad you think so!
Amazing sir. It's really helpful.
VERY clear explanation. Thank you!
Thank you for producing this high-quaity video.
A thousand thanks, your explanation is very easy to understand, it's really helpful.
You are welcome!
Very very clear. Very helpful. Thank you!
This is an outstanding explanation. Thank you so much for making this.
Great explanation ! Thanks a lot
Thanks very clear and useful!
Thanks you very much! love your videos, they were always clearly explained.
Thanks !
Very informative! Thank you, good sir :)
clearly explained! thanks a lot!
Very helpful, thank you!
Great job!
Thank you for the interesting and helpful series about missing data. Also, great video quality.
Goodjob! Great video!
Thanks a lot for this very clear video. Do you know if we can combine multiple imputation with variable selection (with lasso for example) for prediction purposes?
This is an amazing video. Thank you so much. Do we have to check the assumptions for linear regression for each model for each imputed variable?
This is so clearly explained. Thank you very much for this concise and informative video! I have a question. I believe the purpose of step 2 - calculating the standard deviation - is to confirm that the mean is a reliable one. What if the standard deviation is too large? Does it imply that the imputation method is not a reliable one and should not be adopted? Thank you!
Thanks! That was a really nice explanation!!
Great explanation and excellent in describing how multiple imputations! But I have a question to ask, how could I choose the final value for the imputation if there is 5 value? should I go average 5 of the value instead, or is there any better approach? Thank You
Yes. As explained in the video, you calculate the mean of the 5 values. You also calculate the standard deviation as well.
@@qwertyuiop-qy6hb I am confused, is calculating the mean of the predicted value as a final chosen value or calculating the mean of the sample means as a chosen value? It makes more sense to me using the mean of predicted values, but why do we want to see the standard deviation of the mean for the sample mean? What will actually affect our decision?
@@tinghuachen7844 I am not a statistician so what I will tell you is the way I understand it. Multiple imputation gives you an estimation of missing data points of a specific variable in a data set. So this estimation is based on the values of the same variable (one column) of the other "individuals or rows" in the data set. Everytime you perform the imputation the resultant value depends on the selected "individuals" which should be selected randomly. So if you do imputation let's say 5 times, you end up with five estimated values. Here the mean of these values gives the estimation of the missing data point. Calculating the standard deviation (from here this is my own understanding) gives you an idea how variable these estimations are. Same thing if standard of error is calculated. If the estimated values are (spread or not close in value), SD will be high and I'd be careful in my assumptions and interpretation of the final analysis . I have not done multiple imputation in my domain (medicine). I would be very careful using multiple imputation but certainly this is a great method to avoid missing data and use all the sample size.
Thanks for the practical example, not clear to me, at the end which value we have to use to fill in the missing value with the multiple imputation method. Could you please clarify?
~2~Thanks for the clear explanation. One thing I'm struggling to understand is when you are running multiple iterations, say 5, how are the different sets of data points generated? In your example, you fit lines among 50 data points. Do you randomly select 50 data points among those that have non-missing value in the raw dataset?
Data points selection when sampling is almost always done randomly, in order to avoid bias. A similar sample-&-test approach is taken when you perform cross-validation. Same logic follows when choosing the right sample size.
Very clear!
very clear and easy to follow thanks, but will we not get as good a result by taking one regression sample of 250 data items, as opposed to five sets of fifty, then taking the mean of the means?
Thank you very much.
What we do with 5 imputations that have been calculated? which of them can be considered as the imputed value finally if we want just to show this as a graph?
Yeah I am missing the final information too. We have the values for the total final mean and the standard deviation of the means to the final mean but what are we supposed to do with these values? How do we decide, which values to we impute for the missing data?
That was very helpful, thank you!
This was my aaahaaa moment. Thank you!
Great explanation, thanks.
I have done many retrospective clinical research projects and I have never dealt with missing data. I always left these blank knowing that they will automatically be excluded from analysis.
I believe leaving these missing data unfilled is better to avoid any chance of bias influenced by data of other patients in the study cohort.
What do you think?
Now looking at your clear video I am thinking about this approach as well for future projects.
I am not a statistician and I've done all these while in training.
You have to think about what you do when you just omit the Data with missing values. You heighten the statistical power of the other Entries. Why are the values missing? When the data is missing for a reason you introduce bias into your data set and ergo reduce variance and unless you cant proof that the data is missing completly at random its best practice to act as if it is missing not at random. With methods like this you try your best to keep the variance of the data set. Its important to use a method, that tries to model the most plausible value otherwise you would reduce variance.
I recommend you look into the types of Missing Data "Missing Completly at Random", "Missing at Random" and "Missing not at Random".
It would be great if you can share links to some of the papers or books that you refer here.
Thanks for the video! If the subsets are random, all the estimators are unbiased right? The aggregated estimator would just have lower variability.
Thanks for the great video! Question: suppose I have 5 different random samples with which I can get 5 regressions, and then \mu_1, ..., \mu_5, to find an aggregate mean \mu_A. Why not just pool those 5 data sets into one large data set and compute the grand mean \mu_B that way? Wouldn't my answer \mu_B be more precise (less variable) than just taking the average of the 5 means to get \mu_A?
A few things to note,
>The 5 random samples that you taken from the existing data, may have common elements. So, straight up combining them might increase bias towards the repeating data points.
>Lets say you avoid having repeating data points, then, combining the 5 samples only help create a subset of the original dataset. Thus you would be better off just running a regression imputation.
>In my opinion, the whole point here is to have multiple versions of estimated values, so that you may better understand how well the estimation fits our data. Usualy, if the variance or spread of the final values is quite high then we might not want to go ahead with imputation or we may wanna use something other than regression to estimate the missing values.
>Therefore, multiple regression is giving you a clearer & broader picture of what & how much compromises you are making for implementing imputation to replace missing values.
@@vishnumohank1299 thanks for the thoughtful reply. I think that I was missing the point of multiple imputation, but your last two points clear that up for me. Thanks again!
Can you do regression imputation next? I really loved this vid
Thank you very much :)
How does PMM identify nearby candidates when there are a mixture of numeric and categorical variables? Thanks :)
Can you actually do standard deviation? Won't that just reduce the sd for each regression by adding a bunch of point that perfectly fit the regression?
oh my gosh thank you so much!
Isn't it actually even more complicated than that? Isn't it that for each regression, instead of imputing the missing fine value with the value predicted by the regression we actually randomly sample from the distribution of fine values around that predicted value (the distribution of fine conditional on distance)? This adds even more of the uncertainty involved in the guess we are making to our imputation process.
It wasn’t apparent to me why this estimator would be less biased than a single imputation, you mentioned that doing multiple regressions and the aggregating ‘washes away the noise’ but each of your individual regressions would also be more noisy than a single regression that uses the whole dataset - so how do I know that in the aggregate they are less noisy than a single regression?
Awesome!
Sir,
you said we need from 5 to 10 models. How to calculate the exact needed number?
thank you
Thank you!
thank you so much
Thank you very much. Quick question, which imputed values do you end up leaving in the dataset for further analysis. Say now I want to impute values to be used later for a variety of machine learning applications. Surely, I cant use multiple imputation every time I want to implement a new machine learning model and measure a metric?
You would use the grand mean that you calculated at the end after investigating whether it is a valid metric.
Thanks for the clear explanation. One thing I'm struggling to understand is when you are running multiple iterations, say 5, how are the different sets of data points generated? In your example, you fit lines among 50 data points. Do you randomly select 50 data points among those that have non-missing value in the raw dataset?
i guess he randomly pick 50 data points you might wanna hear at 3.27
Thanks!
I don't understand why it is more unbiased to run 5 OLS regressions with only 50 out of the 2000 rows.
Why not just run a single OLS regression with all rows and use that as my predictor for the missing values?
Wouldn't this be problematic if your objective with the dataset is precisely to demonstrate if there is any relationship (like a linear relationship) between those 2 variables? Filling a missing value through a method which assumes the very same linear realtionship you are trying to demosntrate would actually be begging the question, isn't it?
Great explanation! But one that also seems at odds with what I'm reading from other sources, which make it sound like parameters in the model estimating the outcome are what get randomly selected for each iteration, not the observations used to make the prediction.
Is what I'm describing an alternative approach to the same thing, or am I misunderstanding the approach?
What do you mean by randomly selecting parameters? His choice of single imputation method is least squares regression and the parameters (the a and b in your "ax + b" regression line) have a closed-form solution. If you use the same dataset, the parameters of least squares don't have any variability in and of themselves. Maybe you can elaborate more on what you mean?
Is this the PMM approach?
thank you
Here you know that the fine amount is a dependent variable that depends on the distance from the library.... but what if you have a data set with missing values in a particular column but the column is actually a independent variable column... how will you use multiple imputation in that case... can you do something like using the distribution to find the values
imputation is for missing data treatment , no matter it is dependent or independent
Noiccccccceeeeeeeeeeeeeeeeeee
TTTTHhHHHHhhaaaannnnnkkkkk yooooooooooooooooooooooooou very very very much
Data for 1.7 & 2.1 mi is not, prima faci true
I found it confusing !! Especially you move the paper up n down when you talk!
I just wish that you were more neat instead of writing everything on that one paper and you keep moving it and it isnt clear what you are referring to when you point your finger on the paper as you've written everything in every nook and corner of that paper.