R: Regression With Multiple Imputation (missing data handling)

แชร์
ฝัง
  • เผยแพร่เมื่อ 22 ส.ค. 2024
  • How best to treat missing data in linear regression analysis? The current view is that multiple imputation by chained equations (mice) is one of the best ways for missing data handling in regression. This multiple imputation tutorial is going to show you how to use the mice package in R to analyze datasets with missing data (MCAR, MAR) in a regression framework.
    Here is a current journal article giving theoretical background and specific recommendations regarding the use of multiple imputation for missing data:
    Austin, P. C., White, I. R., Lee, D. S., & van Buuren, S. (2020). Missing data in clinical research: a tutorial on multiple imputation. Canadian Journal of Cardiology.
    www.sciencedir...
    Companion webpage with the R code:
    www.regorz-stat...
    Tutorial for checking regression assumptions with multiple imputation:
    • Multiple Imputation an...

ความคิดเห็น • 37

  • @nacentdatanerd
    @nacentdatanerd 15 วันที่ผ่านมา +1

    This is a great video! thanks for going over the details with such clarity.
    Thanks so much!

  • @Dr-Lex
    @Dr-Lex ปีที่แล้ว +2

    THANK YOU for this video with clear audio! I have been searching all over for a reference example for handling simple regressions with mice(), and so many of the videos out there sound like they were recorded via laptop mics while standing right under an air conditioner. Clear and helpful, thank you again!

  • @kamarularifinkasim3138
    @kamarularifinkasim3138 ปีที่แล้ว +1

    Thank you so much for making such video. Your explanation and coding are way simple and clear which it is easier to understand and very helpful for my analysis for my dissertation where I used simulacrum dataset

  • @malithapatabendige6541
    @malithapatabendige6541 ปีที่แล้ว +2

    Thanks for this! It is crystal clear up to pooling. However, I have 2 questions.
    1. How can we get a final dataset with pooled results? the combine function gives a dataset with 10 or 20 cycles and do we need to get one final pooled dataset?
    2. If we have more than one variable with missing data, do we need to do the regression model for each of these?
    3. Do we need to upload the full dataset with other non-missing variables for the MICE process?

    • @RegorzStatistik
      @RegorzStatistik  ปีที่แล้ว +1

      1. With multiple imputation there is no pooled dataset. The results are pooled, not the datasets.
      2. During imputation more than one variable can be imputed.
      3. If you want to use other variables to help with imputation then you have to upload them.

    • @malithapatabendige6541
      @malithapatabendige6541 ปีที่แล้ว +1

      @@RegorzStatistik Thanks very much for your prompt reply.
      1. It means we can select one of 5 (if m = 5) datasets with imputed values for the final analysis. Am I right?
      2. What is the aim of 'pooling the results'?
      Is it to decide whether our assumptions are correct? (MNAR or MAR)
      3. What if the pooled results contain statistically significant estimates?
      4. Can we use Random forest for this?
      Many thanks

    • @RegorzStatistik
      @RegorzStatistik  ปีที่แล้ว +1

      @@malithapatabendige6541
      1.-3.
      No.
      MI has 3 steps:
      Step 1: Imputing m datasets
      Step 2: Running your analysis in each of your datasets - you don't choose one dataset but you use all of them. So you get m different regression results.
      Step 3: Pooling the results - here you get one result from your m results - and this pooled result (and its p-values) is what counts.
      I recommend reading an introductory journal article about MI to get a theoretical understanding of the procedure.
      I don't know if MI works with random forests.

    • @malithapatabendige6541
      @malithapatabendige6541 ปีที่แล้ว +2

      @@RegorzStatistik Thanks. These 3 steps are clear. But, nobody has mentioned how to 'interpret' pooled results and how to get the 'final imputed data for the analysis of the original research. Basically, once it is pooled, what imputed dataset is to be selected out of m number of sets.
      "Step 3: Pooling the results - here you get one result from your m results - and this pooled result (and its p-values) is what counts" - next step has not been mentioned anywhere. It is strange what are we supposed to do with the pooled result and where can we get one single dataset with imputed data to 'start' the original analysis.

    • @malithapatabendige6541
      @malithapatabendige6541 ปีที่แล้ว

      @@RegorzStatistik I think I have to compare pooled estimates, p-values, F-statistic, etc, with each of m data sets and get the BEST GUESS of the imputed data set out of it. Thanks.

  • @bornaloncar2458
    @bornaloncar2458 10 หลายเดือนก่อน

    Thank you, this is very informative. Could you point me to a source or clarify 1. how the regression is meant to be set up if more than 1 item/variable is missing and you want to imputate? Is the dependent variable in the regression model the only variable that gets imputated? 2. How do you obtain a table that combines inputated data and original data? Thank you!!

    • @RegorzStatistik
      @RegorzStatistik  10 หลายเดือนก่อน +1

      1. I don't have a source available. But MI does not change whether there is 1 item missing or more (in my example, there are rows with more than 1 item missing - so the dependent variable is not the only variable that gets imputed)
      2. Only by combining those tables per hand (e.g. with tidyverse). However, that rarely makes sense because you don't have one imputed dataset! In my example you have 50 imputed datasets so combining those 50 datasets with the original dataset would lead to somethin quite large and difficult to interpret.

  • @DariaKoksal
    @DariaKoksal ปีที่แล้ว +1

    Thank you very much for the video! Could you explain please how to save the complete file?

    • @RegorzStatistik
      @RegorzStatistik  ปีที่แล้ว +1

      In my code example the dataframe with the completed data is called imp.datasets. You can save that as you would any other dataframe in R, e.g. with the write.csv() function.

  • @gallinule6213
    @gallinule6213 2 หลายเดือนก่อน

    Is this the same approach that you'd use for multiple imputation in logistic regression, or just linear regression?

    • @RegorzStatistik
      @RegorzStatistik  2 หลายเดือนก่อน +1

      I haven't used it for logistic regression, yet, so I don't know whether the pooling function of mice works for that as well.

    • @gallinule6213
      @gallinule6213 2 หลายเดือนก่อน

      @@RegorzStatistik Good to know, thanks for the response!

  • @andreapatrignani2026
    @andreapatrignani2026 5 หลายเดือนก่อน

    Thank you veary much, i have a question, why does you do the pooling on imputed values model instead of compleate dataset? couldn't be better to have information also from the not imputed datas in the model before pooling? so u can have better datas for modelling and after pooling?

    • @RegorzStatistik
      @RegorzStatistik  5 หลายเดือนก่อน

      Pooling is the 3rd step, after running the model in all imputed datasets (2nd step) and "imputed datasets" does not mean that they only contain the cases with missing values, those are completed datasets. You can see that at 0:10:09 in the video - the regression result is based on the df a regression with all cases.

  • @EHJ599
    @EHJ599 ปีที่แล้ว

    Thank you very much for this clear and helpful tutorial!
    Interestingly, my imputed datasets consisted of fewer rows per variable than I expected (9 to be exact). Do you have any idea what happened and how to get R to impute all missingness? Thank you in advance :).

    • @EHJ599
      @EHJ599 ปีที่แล้ว

      Ps. I checked if the # of ms or iterations made a difference. It did not, and neither did the seed or a change of methods.

    • @RegorzStatistik
      @RegorzStatistik  ปีที่แล้ว +1

      Based on that information I don't know why that happened.

  • @shadens98
    @shadens98 6 หลายเดือนก่อน

    Super interesting video, do you have any videos or tips on how we can get the pooled results of MLR after MI using spss? i try to do it, but for the important values i get either no pooled values or many missings in the pooled values so i can report them properly?

    • @RegorzStatistik
      @RegorzStatistik  6 หลายเดือนก่อน +1

      Unfortunately, I don't know how to do it in SPSS.

    • @shadens98
      @shadens98 6 หลายเดือนก่อน

      thanks a lot for getting back to me so quickly! will try to it out with R, is there something extra one must do if i am importing already imputed data file from SPSS before i run the regression and pooled regression code there?@@RegorzStatistik

    • @RegorzStatistik
      @RegorzStatistik  6 หลายเดือนก่อน

      @@shadens98 I only know how to do imputation completely in R, unfortunately.

  • @elissamsallem688
    @elissamsallem688 ปีที่แล้ว

    Thank you for this video! If I want impute missing values for only 1 categorical variable in a large dataset. What should I do?

    • @RegorzStatistik
      @RegorzStatistik  ปีที่แล้ว

      The key question is which other variables to include in order to impute the categorical variable. You should at least include all variables you are going to use in your regression model.

  • @666dazai
    @666dazai 2 หลายเดือนก่อน

    Hello, thank you for this video but I get this error and I could not figure out how to solve it:
    > imp.data

    • @RegorzStatistik
      @RegorzStatistik  2 หลายเดือนก่อน

      This looks to me that for some of the models the regression did not converge. However, I am somewhat astonished about "glm.fit" - I would expect that message in, e.g., a logistic regression, not in a linear regression.

    • @666dazai
      @666dazai 2 หลายเดือนก่อน

      @@RegorzStatistik I used logreg as the imputation method for my variables as they are dichotomous. I am suspecting that is the reason

    • @RegorzStatistik
      @RegorzStatistik  2 หลายเดือนก่อน

      @@666dazai That could be the case - I am not sure whether that package works with log regression or not (haven't tried it yet).

    • @666dazai
      @666dazai 2 หลายเดือนก่อน

      @@RegorzStatistik Alright, thank you for your answer!

  • @solomonwafula311
    @solomonwafula311 ปีที่แล้ว

    What if I want to impute variables before using them in PCA. regressions may not work. Kindly suggest how to handle that

    • @RegorzStatistik
      @RegorzStatistik  ปีที่แล้ว

      Maybe you could look into the package missMDA. There seems to be a function you can use for imputing a PCA (but I haven't used it yet).
      search.r-project.org/CRAN/refmans/missMDA/html/MIPCA.html

  • @christoph3933
    @christoph3933 8 หลายเดือนก่อน

    How about auxiliary variables? Are they not needed here?

    • @RegorzStatistik
      @RegorzStatistik  8 หลายเดือนก่อน +1

      I think in this case age is an auxiliary variable since it is not used in the regression model (but during imputation).