Dealing with MISSING Data! Data Imputation in R (Mean, Median, MICE!)

Spencer Pao

มุมมอง 20 475

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 22 ส.ค. 2024

ความคิดเห็น • 102

@sudeyaren4221 3 หลายเดือนก่อน
I would like to say, you are doing a good job. You explain the topics with the detail, show the codes and more in a short time. Thank you!
@khushisrivastava835 5 หลายเดือนก่อน
I am in thesis for my undergrad. And you saved me from the multiple breakdowns I have had over the last entire week!!!
@MrMareczek111 2 ปีที่แล้ว ⁺¹
I really appreciate your videos - I hope your channel will grow. Keep making this great content!!!
@justinarends5871 3 ปีที่แล้ว ⁺¹
Loving the content, my dude! In this context, I would love to see you take it further and validate your imputing method as discussed at the end.
Looking forward to more videos!!
@SpencerPaoHere 3 ปีที่แล้ว ⁺¹
I'm glad you are liking the content :)
That is a good point you've made. I'll probably validate future imputations with future models that I'd like to go through and compare and contrast down the road.
@larrygoodnews 3 ปีที่แล้ว
@@SpencerPaoHere Hi thx for the video, may i know where can i get the data set and code you used?
@SpencerPaoHere 3 ปีที่แล้ว
@@larrygoodnews As requested, here is a link to the code and data github.com/SpencerPao/DataImputation
@ej9432 2 ปีที่แล้ว
This video is great. Keep up the great work buddy!
@mukhtarabdi824 3 ปีที่แล้ว
very useful, thanks Spencer. waiting to hear more from you
@principia1372 2 ปีที่แล้ว
Fantastic video man, thank you so much
@AndyA86 ปีที่แล้ว
Excellent video thank you.
@ShivSutradhar ปีที่แล้ว
Thanks, brother, for solving my doubt.
@HangwHwang 3 ปีที่แล้ว
Thank you for your video. Extremely helpful. I just subscribed and look forward to learning more from you.
@SpencerPaoHere 3 ปีที่แล้ว
Hi! Thank you for your support! :)
@Philantrope 5 หลายเดือนก่อน
This is very instructive - thank you! One question I have: How would you proceed when having serveral thousands of variables in the dataset. Would you do some prior feature selection in that case? Only with complete variables? All the best for you.
@ericpenichet7489 2 ปีที่แล้ว
Excellent video!
@yidong7706 2 ปีที่แล้ว
This is exactly what I'm looking for, thank you! I've been searching forever to find ways to impute categorical variables and this video really helps to clarify the confusion. I was wondering if this method would still be effective if 1/3 of the dataset contains NAs or if you have any recommended methods for treating datasets with numerous NAs.
@SpencerPaoHere 2 ปีที่แล้ว ⁺²
The More NAs in the dataset = the less effective the method is unfortunately. If you have lots of NAs, perhaps in your dataset eliminate the observations that might not make the most sense, thereby decreasing the total amount of observations. However, at the end of the day, go ahead and try it out! Compare your imputed observation accuracy with whichever target variable in question.
@simwaneh1685 ปีที่แล้ว
Thanks dear
@hoanggiangpham9312 6 หลายเดือนก่อน
Tks for sharing. I have a question: Why don't you use the finished_data_imputed after completing to do logistic regression instead of doing that for each imputed dataset ?
@kingraidi1578 ปีที่แล้ว
Hey Spencer,
really appreciated the video and I am not sure if you will see this comment since the video is kind of old.
However, I have two questions:
1. Regarding the way that we use the created imputations --> In the first variant you just selected one of the datasets with imputed values. This means that basically, eventhough we are using mice, it is a single imputation method at the end of the day right? So is this a viable method to use?
2. In the second variant you pooled all the imputations, creating a mix of all the imputed datasets (so a real multiple imputation methode I guess). I would like to replicate that, but I am not sure how to. I conducted a survey within a business (for my thesis) and I have various variables that are either intervals or categorical. However, basically all of them are independend variables.
@SpencerPaoHere ปีที่แล้ว
No worries! I am still active on this channel. It might just be a matter of comment volume, but I'll probably get around to it :)
1) Yes. I used a single dataframe that was imputed via the MICE method.
2) You can also copy and paste my code from github ! github.com/SpencerPao/Data_Science/tree/main/Data%20Imputations
Other than that, I am unsure what the question is.
@mustafa_sakalli 2 ปีที่แล้ว ⁺²
So, for categorical values "rf" is godd, what about for any numerical ones? I have a dataset mixture of both categorical values. I wanna apply rf for categorical and something different for numerical, is it ok also?
@SpencerPaoHere 2 ปีที่แล้ว ⁺¹
You can still use Random forest for a dataset that has numerical and categorical features!
It would typically come down to comparing your model output to see which one performs the best and decide from there.
@irinavalsova3268 2 ปีที่แล้ว
Perfect explanation. Do you tutor? I need a specific task to complete on a dataset.
@SpencerPaoHere 2 ปีที่แล้ว ⁺¹
I'm flattered :3
I don't tutor at the moment; however, if you have any "quick" questions, I do regularly answer comments on all my videos for free.
@ThomasMesnard ปีที่แล้ว
Thank you so much for your video! Question: how do you create a new dataset with the estimated values of missing data from the regression? It seems like the last step is missing at the end of your video. It would be very usefull to know how you deal with that. Thks!!
@SpencerPaoHere ปีที่แล้ว
Are you referring to 14:00 in the video? Line 80 is doing just that. (new dataset with the estimated values of missing data)
@briabrowne2928 ปีที่แล้ว
Thank you for a great video! I wanted ask if there is a rule of thumb for choosing which cycle to use when running the MICE imputation? You used the first cycle but was there a particular reason or not?
@SpencerPaoHere ปีที่แล้ว
No rule of thumb in particular!
@renmarbalana1448 ปีที่แล้ว
Thank you so much! This is of great help for my data mining course!.
One question though, how to set a limit for the imputed values if I don't want negative values to be generated?
@SpencerPaoHere ปีที่แล้ว
That depends. Are you expecting negative values? If not, then you may have to do some data cleaning in your original dataset. There are some unorthodox ways of imputation where you can perhaps set a boundary for imputations. But the method behind it is sort of a mystery to me. I attached a link which may be more of use to you.
stats.stackexchange.com/questions/116587/multiple-imputation-introduces-negative-values-dataset-still-valid
@lnsyrae ปีที่แล้ว
Super helpful! Are there strategies that you recommend for evaluating the model fit after MI with MICE?
@SpencerPaoHere ปีที่แล้ว
The typical train/val/test will go a long way when swapping the imputed datasets. You can get a general sense on how well a model performs on all of the datasets.
@nithidetail7187 2 ปีที่แล้ว
@Spencer : Appreciate ! Good Video for kickstart MICE concepts. Git path URL is not woring Please update
@SpencerPaoHere 2 ปีที่แล้ว
Hmm. The link: github.com/SpencerPao/Data_Science/tree/main/Data%20Imputations
works with me.
@alyssafuentes8500 หลายเดือนก่อน
imputation
@kalemguy ปีที่แล้ว
Thank you, this is what I need. Is it possible you also discuss Model Based Treatment of Missing Data in R using mdmb package? What do you think this method compare to MICE
@SpencerPaoHere ปีที่แล้ว
The mdmb seems like a brand new package. (10/13/2022) -- I am unfamiliar with it. Though from taking a peak at the documentation, it seems that the mdmb package uses MLM and or Bayesian estimation. This is definitely more "narrow" than MICE, which there are many more models to choose from. However, you may get better results? Not sure. You'd have to try on your data.
@kalemguy ปีที่แล้ว
@@SpencerPaoHere Thank you for your explanation. May I know, how many percentage of missing data is good for MICE? or other imputation methods?
@SpencerPaoHere ปีที่แล้ว
@@kalemguy I think if ~20% of your rows have a missing value, then you can probably impute. However, if you are missing 80% of your data, it might be advisable to just drop the feature altogether.
@kalemguy ปีที่แล้ว
@@SpencerPaoHere Thank you very much for your insight...
@atthoriqpp ปีที่แล้ว
Hi, thanks for the video. It was helpful to further reinforce my learning on missing values!
But I have a question. You said that if the missing values in a variable are more than 20%, it's better to drop them. I recently experienced this particular scenario, and if I were to MICE impute the value, is it better than dropping them? Or are dropping the missing values better because it crosses the 20% threshold of missing values tolerance?
Oh, and one other thing, when choosing the imputed data result (from 1 to 5), what is the basis for determining which is better?
@SpencerPaoHere ปีที่แล้ว ⁺¹
When you have more than 20% of the rows missing for a column it might be best to drop them since the imputations are largely based off of the rest of the rows. So you may be getting garbage imputations (it would be intersting to see on a simulated smaller chunk of data if costs were a factor)
The basis on which imputed dataset to choose is somewhat arbitrary. I'd just do batch inferencing to see which gets better predictive results. Though the resulting differences can be more or less neglible.
@atthoriqpp ปีที่แล้ว
@@SpencerPaoHere Well said. But what if the column is essential for the analysis?
@SpencerPaoHere ปีที่แล้ว ⁺¹
@@atthoriqpp haha yeah. You're going to need more data. Or, you can do a 'haily mary' and see how the imputations fair. (A lot of testing will be needed to ensure you are confident about the results.)
@atthoriqpp ปีที่แล้ว
@@SpencerPaoHere Thanks for the answer, Spencer!
@kaili3477 2 ปีที่แล้ว
Hi Spencer,
Could you explain the functions in the Mice command?
1. You said 'm' is the number of cycles? I'm a bit confused about that. I googled and said it's the number of imputations, I don't understand what that means, imputation is simply replacing the missing values. What happens if you increase the number of 'm' compared to decreasing 'm'? Why is it generally we use '5'
2. What is the 'maxit' in Mice command, I always thought that was the number of cycles, since it's the number of iterations.
3. Do you also understand the 'seed' in mice command? It affects the random number generator I believe, but what happens if you increase or decrease it.
Sorry I know you didn't use 'maxit' or 'seed', but I'm having trouble understanding the r document explanation.
@SpencerPaoHere 2 ปีที่แล้ว ⁺¹
1) The 'm' term refers to the number of times you want to impute your dataset. I was using the term "cycles" colloquially. So, if m = 5, you are expecting 5 different datasets with different imputed values. (more or less)
2) maxit : (in the mice package) just refers to the number of iterations taken to impute missing values. This is related to whichever objective function you utilize and thus uses the maxit as the upper ceiling for its iterations.
3) The "seed" is a deterministic number generator. It doesn't matter what value you use for seed(numeric) as long as the numeric is consistent among all your experiments.
Hope that helps!
@kaili3477 2 ปีที่แล้ว
That helps a lot, I understand the video a lot more now!
Thank you!
I know you learned a lot at post-secondary, but I'd love a video explaining your background, and tips/tools to self-learn to other people (if you have any)! We all want to be experts like you one day.
@SpencerPaoHere 2 ปีที่แล้ว
@@kaili3477 haha thanks I appreciate that. Maybe when this channel gets bigger, I can do a video autobiography of some sorts.
@felixangulo4677 2 ปีที่แล้ว
Hey Spencer, awesome video on multiple imputation.
Question: is it possible to run a MANOVA on the pooled imputed datasets and obtain a pooled parameter estimate? I followed your video all the way through to the 15:40 mark, and then attempted to run a MANOVA on the pooled data set but I'm running into some difficulties/errors. I'm basically trying to impute missing data and then run a MANOVA (or a repeated measures ANOVA) on the imputed datasets in order to obtain a pooled parameter estimate. I'm using two categorical (binary) predictor variables and two continuous dependent variables for my model. I normally use SPSS, but unfortunately SPSS doesn't allow to run general linear model tests on imputed data (or at least doesn't provide a parameter estimate of the pooled datasets).
@SpencerPaoHere 2 ปีที่แล้ว ⁺¹
Thanks!
Yes. You should be able to run the MANOVA on the imputed datasets. However you will have to run the algorithm on each set individually. Then aggregate the model results thereafter.
@felixangulo4677 2 ปีที่แล้ว
@@SpencerPaoHere I was successfully able to run a Manova for each of my imputed datasets (m= 10). As an example, here’s the code I used to run the Manova for the 10th imputed dataset: model.10 = manova(Anxiety ~ Treatment + DepStatus + Treatment_X_DepStatus + BLanx, data = finished_imputed_data10). I’m still however having trouble aggregating the results, and I’m not sure if the code I’m using is correct. Here’s the code I’m using for this: pooled_model = with(imputed_data, manova(Anxiety ~ Treatment + DepStatus + Treatment_X_DepStatus + BLanx)). Does this seem correct? Here’s the error message I’m receiving: Error in (function (cond) : error in evaluating the argument 'object' in selecting a method for function 'summary': Problem with `summarise()` column `qbar`. ℹ `qbar mean(.data$estimate)`.✖ Column `estimate` not found in `.data`. ℹ The error occurred in group 1: term = Treatment. In addition: Warning message: In get.dfcom(object, dfcom) : Infinite sample size assumed.
@SpencerPaoHere 2 ปีที่แล้ว
@@felixangulo4677 Hi! Yes. Try saving the model weights! (a model for each imputed dataset)
And, once you have the model weights, you can ideally aggregate or do some form of model aggregation. You can also choose which model to go with based on best model performance.
I did a video on just this:
th-cam.com/video/6pw9IDFxWFM/w-d-xo.html
@asrarmostofa818 2 ปีที่แล้ว
Helpful for me and o......
@elissamsallem688 ปีที่แล้ว
thank you for this video really helpful! If I want to impute only 1 categorical variable in the dataset? How can we do it? and based on what do we choose the column for the final dataset? can we pool the columns? Thank you
@SpencerPaoHere ปีที่แล้ว
1) You could just extract that 1 specific imputed column from the imputed dataset and replace it to the raw dataset.
2) depending on your use case, you can use whichever column you so desire to be imputed (typically when you want a filled column)
3) Pooling columns? As in merge 2 different columns together? I mean you could.....
@PoetenfranLevanten 3 ปีที่แล้ว
Very helpful Spencer, thank you! I have a question regarding the difference between the complete function you used to finish the dataset and the pooled function? Do the finished dataset combine all the 5 imputations and can I use the finished dataset in other statistical programs like SPSS and Jamovi and still claim that the missing values has been been imputed by multiple imputation? Or do I always have to use the pooled function?
@SpencerPaoHere 3 ปีที่แล้ว
Hi! I'm glad you liked it! :)
The pool function averages the estimates of the complete data model (and outputs a variety of statistics related to those features)
The complete function fills in the missing values with the imputed values. AND yes. This function imputes ALL of the features based ONLY on 1 imputed dataset that you have chosen. See complete(data, m) where m represents which imputed featured you would want to plug into your original dataset.
You can use the finished dataset with other programs. All you need to do is write the file to a CSV and load the file into a different program.
@PoetenfranLevanten 3 ปีที่แล้ว
@@SpencerPaoHere thanks for the clear answer. Is it possible to carry out an EFA (exploratory factor analysis) utilising the pooled function? If yes, is it possible for you to illuminate us with a tutorial on how to carry it out? :).
Thanks in advance
@SpencerPaoHere 3 ปีที่แล้ว
@@PoetenfranLevanten Hmm. I have done a video on Factor Analysis. You can check that out!
The only thing different you would do is to apply the pooling function on whatever data you might have. THEN, go ahead and utilize the FA function as noted.
@PoetenfranLevanten 3 ปีที่แล้ว
@@SpencerPaoHere Thanks, how would the code look like if I apply the pooling function? Tried to do that on a dataset I have but received an error message in R.
@SpencerPaoHere 3 ปีที่แล้ว
@@PoetenfranLevanten Hi! you'd want to apply the pool function on a model object. i.e pool(fit)
It won't work if you plug in a dataset.
@abhijitjantre8427 2 ปีที่แล้ว
When I executed md.pattern command, I got the table but not the plot that you have shown in the youtube video. Please share the means to get the plot.
@SpencerPaoHere 2 ปีที่แล้ว
My code is located here:
github.com/SpencerPao/Data_Science/tree/main/Data%20Imputations
@mohammedabdulkhaliq8746 2 ปีที่แล้ว
hello spencer thank you for the tutorial. By any chance do you know what package includes least square imputation. Thanks in advance.
@SpencerPaoHere 2 ปีที่แล้ว ⁺¹
least squares? That's built in the Rstudio framework. Checkout out lm()
EDIT: Least squares imputation: check out pcaMethods()
@mohammedabdulkhaliq2644 2 ปีที่แล้ว
@@SpencerPaoHere thank you for the reply. Do we have same for python package?
@SpencerPaoHere 2 ปีที่แล้ว ⁺¹
@@mohammedabdulkhaliq2644 Pandas has an interpolate function per column.
pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html
@TheFabricioosousaa 2 ปีที่แล้ว
Hi! I am using MICE to deal with my missing values. Instead of using complete() function and choose one number (between 1 and 5), I would like to combine these 5 options. I think that is what you have done using the with() function but I couldn't understand the arguments there. Or the step at 15:23 using the with() function has nothing to do with MICE imputation and is just another (and new) way of imputation? Thanks!
@SpencerPaoHere 2 ปีที่แล้ว ⁺¹
Yep! In layman terms, the with() is inserting each of the imputed dataframes to the glm() model. As a result, at 15:23 or so, you will have 5 glm models where each model builds off of each individual dataframe. You can then do an average of the weights (or provide standard errors with the model weights) and evaluate predictions etc..
@TheFabricioosousaa 2 ปีที่แล้ว
@@SpencerPaoHere Thanks for the answer! :) One last question: after using the function with() you just used the function plot() to have some information and analyse it. But using the with() function I was not supposed to be abble to extract the "new" values to substitute my NAs too? (The "combined" values)
@SpencerPaoHere 2 ปีที่แล้ว
@@TheFabricioosousaa Can you provide a timestamp? (or a line of code); But you can think of it as having different models for similar but unique datasets.
@TheFabricioosousaa 2 ปีที่แล้ว
@@SpencerPaoHere For example, if I have this:
#INSTALL AND LOAD MICE
install.packages("mice")
library(mice)
#IMPORT DATASET:
library(readxl)
data_md
@SpencerPaoHere 2 ปีที่แล้ว
@@TheFabricioosousaa Yep! And, that should occur after your data_imp variable -- imputation should occur. (Try printing it out to see if that is what you are looking for form an imputation POV)
Then, you will use the imputed data set(s) for your training/testing process.
@markelov ปีที่แล้ว
Hello! Loved this video! I am interested in using MICE’s defaultMethod approach. I recognize that doing so requires that objects be specified as the appropriate data type. I recently received a CSV of variable names and only the numerical values assigned. As a result, R is reading everything in as numeric. I understand that I could change these variables one by one when importing in the preview pane, but I’d rather do it through code. Is there a more efficient way to specify variable types than on an individual basis (example below)?
dataframe$object1
@SpencerPaoHere ปีที่แล้ว
I believe you don’t even need to cast your objects to be a certain type. Try using the tidyr and dplyr package. You can define the features as a specific data type where you don’t have to cast over when reading the data.
@markelov ปีที่แล้ว
OK-will do! Thank you! I just have two follow-up questions if possible:
1. I got an error about logged events. My reading thus far tells me that these essentially get at perfect prediction and that R will not impute those values when this occurs. What I am confused about, though, is that (a) the number of logged events reported is greater than the number of missing values and (b) looking at my imputed data set shows no NA values. Do you have any read on these two pieces?
2. As opposed to pooling estimates and whatnot across all the imputed data sets, is it permissible from a statistical standpoint to select one of the imputed data sets to use for all analyses as opposed to the pooling procedure that you did here for the regression)?
@SpencerPaoHere ปีที่แล้ว
@@markelov 1) There can be a multitude of factors that cause the issues of logged events. Try printing out the loggedEvents of your MICE object. (Variable$loggedEvents) -- That might give a hint. Also, sometimes in a dataframe, NA values can be "null" values as well. So, you may need to run a few different checks to see if the null values actually do exist in the data.
2) You could just use one imputed dataset but it might not be a representative of all data. Pooling gives you a wider band and a concentration of outcomes in an area.
@markelov ปีที่แล้ว
Thank you!
@mahmoudmoustafamohammed5896 2 ปีที่แล้ว
Hallo Spencer, thank you so much for your video and explanation :) I have a small question: I am using this code: data_imp
@SpencerPaoHere 2 ปีที่แล้ว ⁺¹
Hmm. Yeah. It must be related to your input data. Try to one hot encode your categorical variable and see what happens.
@mahmoudmoustafamohammed5896 2 ปีที่แล้ว
@@SpencerPaoHere I tried it and I got the same problem unfortunately. I tried to create a new dataset of these categorical variables and I imputed them and it works. But when I impute them with the main dataset I get this problem :(
@SpencerPaoHere 2 ปีที่แล้ว
@@mahmoudmoustafamohammed5896 What's the stacktrace? There might be an issue with your other variables perhaps. It's strange that one set of features don't work but the others do? And combined?
@tsehayenegash8394 2 ปีที่แล้ว
If you know please inform me the matlab code of MICE
@SpencerPaoHere 2 ปีที่แล้ว ⁺¹
Perhaps this might help you?
www.researchgate.net/post/I-am-looking-for-a-Matlab-code-for-Multiple-imputation-method-for-missing-data-analysis-can-anybody-help-me
@tsehayenegash8394 2 ปีที่แล้ว
@@SpencerPaoHere I appreciate your help.
@dgeFPS 2 ปีที่แล้ว
everythings helpful but the audio quality killed me
@SpencerPaoHere 2 ปีที่แล้ว
In my later videos, I handled the background noise. Sorry for that!
@raihana3376 2 ปีที่แล้ว
the package MICE does not work : Warning in install.packages :
unable to access index for repository YOUR FAVORITE MIRROR/src/contrib:
impossible d'ouvrir l'URL 'YOUR FAVORITE MIRROR/src/contrib/PACKAGES'
@SpencerPaoHere 2 ปีที่แล้ว
Strange. I've ran install.packages("mice") and the library was able to be installed -- then run library(mice) to load the package in your environment. Perhaps updating your Rstudio might do the trick?

ต่อไป

เล่นอัตโนมัติ