Handling missing values in PCA

แชร์
ฝัง

ความคิดเห็น • 30

  • @vincentsmith8339
    @vincentsmith8339 4 ปีที่แล้ว +1

    Great presentation

  • @DanielMorenoSoto
    @DanielMorenoSoto 6 ปีที่แล้ว +2

    Professor Husson,
    I want to thank you and acknowledge the great usefulness of this tool. For me it has always been a great issue how to deal with missing values, especially with some molecular biology techniques (such as quantitative PCR) that will eventually throw some NAs just due to the detection threshold of the machine not being reached.
    I've just used the package and it worked (it imputed values and ran the PCA), but I wanted to know if it's possible for the number of dimensions estimated by estim_ncpPCA to be equal to zero (I used zero when imputing), which was the case with my data.
    Thanks in advance.
    Best regards.

  • @cindyconlin2124
    @cindyconlin2124 2 ปีที่แล้ว

    Professor Husson,
    Thank you very much for the helpful video and packages. Is it possible to impute data with a large number of variables, and then use only a subset of the variables in the plot for MIPCA and also in the PCA? (I have "total" variables that are a sum of variables within that subcategory. If I use only the total variables to impute, estim_ncpPCA tells me to use 0 dimensions, but if I impute with the subvariables also, estim_ncpPCA tells me to use 1 dimension. I would like to use the full dataset to impute, but then only use the "total" variables for the rest of the analysis).
    If this is possible, how would I specify the subset of variables I would like to see plotted in the plot(mi) command?
    If this isn't a wise strategy, could you kindly advise me on better approaches? Many thanks.

  • @3Mus-cat-tears
    @3Mus-cat-tears ปีที่แล้ว

    Professor Husson,
    Thank you for the video and the explaination of the concept.
    I have one question though. I have ran through your code and generated all these graphs, but how do I put those imputed values back into the original orange dataset?

    • @HussonFrancois
      @HussonFrancois  ปีที่แล้ว

      the imputed values are in the object res$completeObs

  • @javierhernando5063
    @javierhernando5063 2 ปีที่แล้ว

    Nice video! So helpful! Do you think this could be applied to gene expression? A bunch of genes after treatment to see if they form clusters?

    • @HussonFrancois
      @HussonFrancois  2 ปีที่แล้ว +1

      Yes sure!

    • @javierhernando5063
      @javierhernando5063 2 ปีที่แล้ว

      @@HussonFrancois Would you recommend to scale and center variables? Taking into account that gene expression values are already normalized using a control gene, and every sample (and consequently, the genes) is treated after treatment in relation to before treatment (every patient). I mean, we are already carrying out a normalization

  • @doctorwhyphi
    @doctorwhyphi 5 ปีที่แล้ว +1

    Bonjour! Could I used the imputed data for an exploratory factor analysis of a Likert Questionnaire ? Merci

    • @HussonFrancois
      @HussonFrancois  5 ปีที่แล้ว +1

      Yes, You can use the imputed dataset for any statistical method. But, don't forget that the links between variables are reinforced when you complete the data, especially if you have a lot of missing values.

    • @doctorwhyphi
      @doctorwhyphi 5 ปีที่แล้ว

      @@HussonFrancois Merci Beaucoup!

  • @aboubakeraden9116
    @aboubakeraden9116 5 ปีที่แล้ว

    Bonjour Professeur,
    J'aurai quelques questions à vous poser :
    - la multicollinearité entre les inputs features est-elle un problème pour les neural nets ?
    - après tous ces années de hype des réseaux de neurones (ou deep learning) comment se fait-il que les reseaux de neurones échouent dans 95% de cas de litterature ou de competition Kaggle PORTANT SUR DES DONNEES STRUCTUREES OU TABULAIRES à pouvoir battre des modeles non-lineaires comme ceux à base d'arbres (Xgboost, Adaboost...) ?
    - y-a-t-il une explication scientifique à savoir pourquoi les reseaux de neurones ne fonctionnent pas aussi bien sur les donnees structurees qu'ils le furent sur les donnees non-structurees (images, videos, audios, textes, sequences...) ?

  • @mab963
    @mab963 5 ปีที่แล้ว +4

    Merci beaucoup, très utile!

  • @valabreu5870
    @valabreu5870 6 ปีที่แล้ว

    Dear Dr. Husson, I'm doing the ncpMCA function on a data frame of categorical variables, but keep getting this error 'Error in tab.disj.comp - vrai.tab : non-conformable arrays'. Would you please advise me? Thank you so much

  • @EricSmith9000
    @EricSmith9000 6 ปีที่แล้ว +1

    Well done. Many thanks!

  • @meghanshirleybezerra7079
    @meghanshirleybezerra7079 3 ปีที่แล้ว

    Hello! I am getting a value of '0' for the estimated ncps. Can you advise what to do in this situation? Thanks so much.

    • @HussonFrancois
      @HussonFrancois  3 ปีที่แล้ว

      A value of 0 means that the imputation by the mean of each variable is the best. It also means that there is not a lot of links betwwen your variables.

    • @meghanshirleybezerra7079
      @meghanshirleybezerra7079 3 ปีที่แล้ว

      @@HussonFrancois Thank you for the quick reply! I don't have very many missing variables - is it possible this is also a potential reason for mean imputation being the best option?

  • @fabricen26
    @fabricen26 7 ปีที่แล้ว +1

    Great! Thank for your help

  • @kirkgeier417
    @kirkgeier417 2 ปีที่แล้ว

    Big thank you!

  • @liwenzhao8785
    @liwenzhao8785 6 ปีที่แล้ว

    Got error when I tried to compute:
    nb

    • @noereyna2553
      @noereyna2553 2 ปีที่แล้ว

      We’re you able to solve this? I’m wondering the same thing

  • @kevintschirhart412
    @kevintschirhart412 7 ปีที่แล้ว +1

    Great video! Merci

  • @machheydt176
    @machheydt176 7 ปีที่แล้ว

    I'm currently working on my Msc degree and I'm in need of a package that can impute my missing data. So I've found this video and I started reading your paper, but you mention quite early that the aim of missMDA is more to try and visualise the PCA despite missing data, whereas other approaches might be better suited to impute the missing data.
    I don't know if this is the right place for such a discussion, but what is the difference between performing 'your' PCA on a set including missing data, and performing 'regular' pca after you've imputed the data with a better suited package/approach? It seems that performing PCA on a set of which you've imputed the data more correctly is inherently more correct? Or are there more intricacies that I'm not (yet) aware of?
    I'm just curious. Anyway, this video does help a lot understanding your paper, as I'm not by any means a good statistician or mathematician or anything. Merci!

    • @HussonFrancois
      @HussonFrancois  7 ปีที่แล้ว +2

      In fact, we first consider missMDA to handle missing values in PCA, but then we study the quality of the imputations obtained by missMDA and the results was better or equivalent than the other competitive methods such as random forest for instance.

    • @machheydt176
      @machheydt176 7 ปีที่แล้ว +1

      Could you then please elaborate why on page 3 of your article you mention that
      'The main difference between the two R packages missMDA and pcaMethods is that the primary aim of missMDA is to estimate PCA parameters and obtain the associated graphical representations in spite of missing values, whereas pcaMethods focuses more on imputation aspects.' ?
      What does this then really mean? Is missMDA 'better' at imputing values than pcaMethods? Even though pcaMethods 'focuses more on imputation aspects'?
      Thank you for your reply!

    • @mattbeets
      @mattbeets 3 ปีที่แล้ว

      @@machheydt176 I know this is a very late answer, but maybe this quote from the missMDA::MIPCA documentation gives a clue to the difference between methods? (at least the Bayesian vs. bootstrap methods in missMDA):
      "The methods differ by the way in which the variability due to missing values is reflected. The method used is controlled by the method.mi argument. By default, MIPCA uses the parametric bootstrap method.mi="Boot". This bootstrap method is more recommended to evaluate uncertainty in PCA (through confidence ellipses). Otherwise, the Bayesian method can be used by specifying the argument method.mi="Bayes". It is based on an iterative algorithm which alternates imputation of the data set and draw of the PCA parameters in a posterior distribution. These steps are repeated Lstart times to reach a convergence. Then, one imputed data set is kept each L iterations to ensure independence between imputed values from a data set to another. The Bayesian method is more recomm[e]nded to apply a statistical method on an incomplete data set."

  • @nayldev9185
    @nayldev9185 4 ปีที่แล้ว

    Tu as un très bon anglais !

  • @HouDa-fi9nq
    @HouDa-fi9nq 2 ปีที่แล้ว

    3:52

  • @Nicolas-mp4oq
    @Nicolas-mp4oq 3 ปีที่แล้ว

    thx