StatQuest: PCA - Practical Tips

แชร์
ฝัง
  • เผยแพร่เมื่อ 1 มิ.ย. 2024
  • This is a follow-up video for StatQuest: Principal Component Analysis (PCA), Step-by-Step • StatQuest: Principal C...
    In it, I give practical advice about the need to scale your data, the need to center your data, and how many principal components you should expect to get.
    If you are interested in doing PCA in R see: • StatQuest: PCA in R
    For a complete index of all the StatQuest videos, check out:
    statquest.org/video-index/
    If you'd like to support StatQuest, please consider...
    Buying The StatQuest Illustrated Guide to Machine Learning!!!
    PDF - statquest.gumroad.com/l/wvtmc
    Paperback - www.amazon.com/dp/B09ZCKR4H6
    Kindle eBook - www.amazon.com/dp/B09ZG79HXC
    Patreon: / statquest
    ...or...
    TH-cam Membership: / @statquest
    ...a cool StatQuest t-shirt or sweatshirt:
    shop.spreadshirt.com/statques...
    ...buying one or two of my songs (or go large and get a whole album!)
    joshuastarmer.bandcamp.com/
    ...or just donating to StatQuest!
    www.paypal.me/statquest
    Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
    / joshuastarmer
    0:00 Awesome song and introduction
    0:47 Make sure the data are on the same scale
    2:53 Make sure the data are centered
    3:30 How to determine the number of principal components
    #statquest #PCA #ML

ความคิดเห็น • 174

  • @statquest
    @statquest  2 ปีที่แล้ว +4

    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

  • @buihung3704
    @buihung3704 7 หลายเดือนก่อน +8

    This is a gold mine for Data Scientist, Data Engineer, ML/DL engineer. I can hardly think of anyone else that can teach the same concept more clearly.

    • @statquest
      @statquest  7 หลายเดือนก่อน

      Thank you very much! :)

  • @geethanjalikannan5527
    @geethanjalikannan5527 4 ปีที่แล้ว +51

    Dear Josh. I had so much issues with stats as I am from a totally different background. Watching Ur videos helped me overcome my insecurities. Thank you so much.

    • @statquest
      @statquest  4 ปีที่แล้ว +5

      Hooray! I'm glad the videos are helpful.

  • @iloveno3
    @iloveno3 6 ปีที่แล้ว +7

    The intro with you singing is so cute, made me smile...

  • @Jason-xe4tt
    @Jason-xe4tt 5 ปีที่แล้ว +3

    All prof in the world need to learn how to teach from you ! Thanks !

    • @statquest
      @statquest  5 ปีที่แล้ว

      You're welcome!!! :)

  • @shwetankagrawal4253
    @shwetankagrawal4253 4 ปีที่แล้ว +15

    Your initial music always make me smile😂😂

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      Thanks! :)

  • @caperucito5
    @caperucito5 4 ปีที่แล้ว +10

    Josh's videos are so cool that I usually like them before watching.

    • @statquest
      @statquest  4 ปีที่แล้ว

      That's awesome! :)

    • @alecvan7143
      @alecvan7143 4 ปีที่แล้ว +1

      I concur :P

    • @statquest
      @statquest  4 ปีที่แล้ว

      @@alecvan7143 :)

  • @jesusfranciscoquevedoosegu4933
    @jesusfranciscoquevedoosegu4933 5 ปีที่แล้ว +4

    Thank you so much for, basically, all your videos on PCA

    • @statquest
      @statquest  5 ปีที่แล้ว

      You're welcome!!! I"m glad that you like them. :)

  • @bendiknyheim6936
    @bendiknyheim6936 3 ปีที่แล้ว +4

    Thank you for all the amazing videos. I would be having a really hard time without them

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      Glad you like them!

  • @stevenmugishamizero8471
    @stevenmugishamizero8471 7 หลายเดือนก่อน +2

    The best on this platform hands down🙌

    • @statquest
      @statquest  7 หลายเดือนก่อน

      Thank you!

  • @yurobert3007
    @yurobert3007 ปีที่แล้ว

    This PCA series (step-by-step, practical tips , then R) is brilliant! I found them very helpful. Thank you for these great videos!
    Would you be considering to do a series on factor analysis?

    • @statquest
      @statquest  ปีที่แล้ว +2

      Thanks! One day I hope to do a series on factor analysis.

  • @ylazerson
    @ylazerson 5 ปีที่แล้ว +3

    Fantastic video once again!

    • @statquest
      @statquest  5 ปีที่แล้ว

      Hooray!!!! I'm glad you're enjoying them. :)

  • @jonatanottinogonzalez2965
    @jonatanottinogonzalez2965 6 ปีที่แล้ว

    great video! quick question: converting raw scores into z-scores would both center and scale my data? thanks!

  • @AakashOnKeys
    @AakashOnKeys ปีที่แล้ว +1

    Thanks for the headsup! Very helpful!

    • @statquest
      @statquest  ปีที่แล้ว

      Happy to help!

  • @urd4651
    @urd4651 3 ปีที่แล้ว +1

    well explained !!!!! thank you very much!

    • @statquest
      @statquest  3 ปีที่แล้ว

      Glad you liked it!

  • @arifahafdila5531
    @arifahafdila5531 3 ปีที่แล้ว +1

    Thank you so much for the videos 👍

    • @statquest
      @statquest  3 ปีที่แล้ว

      Glad you like them!

  • @urjaswitayadav3188
    @urjaswitayadav3188 6 ปีที่แล้ว

    Thanks for the video Joshua! Would you please consider doing a video on hypergeometric distribution and hypergeometric test? I have seen that it is often used to check the significance of overlaps between lists generated by high throughput analyses, but I am always confused on how to set it up when I have to do one myself. Thanks a lot!

    • @urjaswitayadav3188
      @urjaswitayadav3188 6 ปีที่แล้ว

      Yes! That's exactly what I wanted. Thanks a lot :)

  • @mjifri2000
    @mjifri2000 5 ปีที่แล้ว +1

    Man ;
    you are the best.

  • @lingxinhe4627
    @lingxinhe4627 3 ปีที่แล้ว +1

    Hi Josh,
    Thank you for the amazing videos, the content on this channel on stats is so much better than everything else I've found online. I have a quest(ion): Once you get PC1 and PC2 as the main components that explain variation, how can we get back to the variables that compose them?
    Thank you!

    • @statquest
      @statquest  3 ปีที่แล้ว

      I show how to do this exact thing in my PCA in R th-cam.com/video/0Jp4gsfOLMs/w-d-xo.html and PCA in Python th-cam.com/video/Lsue2gEM9D0/w-d-xo.html videos.

  • @joyousmomentscollection
    @joyousmomentscollection 4 ปีที่แล้ว +1

    Thanks Josh... If your data contains one-hot encoded data(transformed from categorical data) and discrete data along with continuous data types, what kind of scaling would be prefered before applying PCA technique

    • @statquest
      @statquest  4 ปีที่แล้ว

      It may be better to use lasso or elastic-net regularization to select the variables that are most important than to use PCA. Regularization can remove variables that are not useful for making predictions. If you're interested in this subject, I have several videos on it. Just look for "regularization" on my video index page: statquest.org/video-index/

  • @samirsivan8134
    @samirsivan8134 22 วันที่ผ่านมา +1

    I love statequest❤

    • @statquest
      @statquest  22 วันที่ผ่านมา

      Thank you! :)

  • @wolfisraging
    @wolfisraging 6 ปีที่แล้ว

    Thank u sooooooooo much, that's damn awesome.

  • @reytns1
    @reytns1 6 ปีที่แล้ว

    I have a question: Could I enter a percentage value in order to obtain a PCA?

  • @survivio8937
    @survivio8937 4 ปีที่แล้ว

    Thank you so much for these amazing videos. With my new found free time, I am trying to learn about PCA in preparation for upcoming RNA-seq experiments. I have yet to do this and will probably understand more once I have practical experience, but one thing struck me as odd in your video. When scaling data, you state that the typical method of doing this is to divide by the standard deviation assuming large values will have larger SD. But, intuitively it would make sense to scale data based on the mean rather than the SD. For example if I had one gene which is highly expressed but not variable, then it would not be scaled down appropriately and would have an oversized contribution to PC1. Am I thinking about this wrong or is there some reason what mean is a bad choice? Next, it seems that with scaling, small changes in rare transcripts (that might just be error and not true transcripts) would contribute a lot to the variability and thus PC1; does this not present a problem?
    Also, another comment: I find this and the prior video on PCA from 2018 much more intuitive than the one you produced previously in which you discuss generating a PC axis by looking at the spread of data for 2 cells with multiple transcripts and coming up with weights or "loading scores" for each trancript based on high and low expression.
    Thank you

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      In the PCA step-by-step video ( th-cam.com/video/FgakZw6K1QQ/w-d-xo.html ) one of the first things we do is center the data. This is the equivalent of subtracting the mean value from each dimension in the data. So, for your example, if you have a gene with high expression, but no variation, we will subtract the mean of that gene from each replicate. So that part of the data standardization is already taken care of.
      The original PCA video is based on the old way of doing PCA, which is still taught as if the new way does not exist. The old way is based on creating a variance/covariance matrix of all the observations. I agree, that it is not as intuitive to understand as the new way, which is to use Singular Value Decomposition.

    • @gspb4
      @gspb4 4 ปีที่แล้ว

      ​@@statquest Hi Josh. You mention the "old way" of performing PCA using the variance/covariance matrix versus the new way of using SVD. Do both techniques produce identical results?
      Further, have you considered producing videos for non-linear PCA?
      Anyways, thanks so much for what you do. I'm currently taking a computational biology course in grad school and wouldn't be able to get through it without your videos!!

  • @johnfinn9495
    @johnfinn9495 ปีที่แล้ว +1

    Very nice videos. Have you considered a segment on kernel PCA?

    • @statquest
      @statquest  ปีที่แล้ว

      I'll keep that in mind.

  • @boultifnidhal2600
    @boultifnidhal2600 2 ปีที่แล้ว

    Thank you so much for switching to Math and reading, cause the genes and cells things were giving headaches. Nevertheless; Thank you so much for your efforts ♥♥

  • @reytns1
    @reytns1 6 ปีที่แล้ว +1

    Other question regarding PLS, as I know PLS is a regression over PCA, Is that rigth? and PLS can make over only one variable to regress (I mean the Y variable)? uhmm another question if you have a lot of trait to do a PCA, there are some statistics that show me what is the best and second trait that it is important for that PCA? I mean not the autovalue?? thanks

    • @statquest
      @statquest  6 ปีที่แล้ว +1

      Partial Least Squares (PLS) and Principle Component Regression (PCR) are both ways to combine regression with PCA, as a way avoid overfitting the model (if there are more variables than samples, you'll overfit your model and your future predictions will be bad). PCR does PCA on just the variables (the measurements used to predict something). As a result, it focuses on the variables responsible for most of the variation in the data. In contrast, PLS does PCA on both the variables and the thing you want to predict. This makes PLS focus on variation in the variables as well as variables that correlate with the thing you want to predict.
      As for statistics on which variable is the most important for PCA (other than just looking at the loading scores), you could probably use bootstrapping, but, at least at this time, I don't have a lot of experience with this.

  • @MyKornflake
    @MyKornflake 4 ปีที่แล้ว

    Great explanation, I made a PCA plot with 96 genes from 6 different samples using SPSS, but I am having a hard time trying to interpret PC1 and PC2 as what do they represent. Could you please give me some idea on this? Thanks in advance.

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      Look at the magnitude of the loading scores for PC1 and PC2.

  • @paulotarso4483
    @paulotarso4483 3 ปีที่แล้ว +1

    Hey Josh thx so much for your videos... 3 quick questions:
    1. 7:54 says "if there are fewer samples than variables, the number of samples puts an upper bound on the number of PCs with eigenvalues greater than 0", but in the example there, the number of samples is equal to the number of variables, not less. Should the statement be "if # of samples

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      1) What matters is that there is an upper bound and it depends on the number of variables and the number of samples, and that means we can actually write it both ways: "if # of samples

    • @ptflecha
      @ptflecha 3 ปีที่แล้ว +1

      Thanks so much!!

  • @shubhamgupta6567
    @shubhamgupta6567 4 ปีที่แล้ว

    Can u make a video on partial least square regression please

  • @thuyduongnguyen1231
    @thuyduongnguyen1231 4 ปีที่แล้ว

    Dear Josh, I have watched PCA (step-by-step) and this video of yours. It really helps me get over the scare of math and try to understand these terminologies. However, I wonder what if our data has 20 attributes (not 2 or 3 attributes like in the video), does it mean we will have 20 PCA? or there will be another approach to determine the maximum number of PCA? Thank you very much

    • @statquest
      @statquest  4 ปีที่แล้ว

      I answer this question at 3:30

  • @user-ib9lp8zx6x
    @user-ib9lp8zx6x 6 ปีที่แล้ว +1

    Two quick questions, Joshua. When we deal with RNA-seq data, we should log-transformed the data before running PCA right? Can I say it is a way to minimize the effect from outliers when determining PC?. My second question is in R, there is a embedded function called prcomp; also in many other packages there are functions like runPCA and plotPCA, how do I know these functions will center the data before doing calculating variation and doing projections? Thanks!

    • @statquest
      @statquest  6 ปีที่แล้ว

      Log transforming RNA-seq data before PCA is a good idea and I generally do it. For prcomp(), there is a parameter "scale" that you can set to TRUE. When you do this, prcomp() will center and scale your data for you. In general, you can always look up the documentation for the PCA function you are using. In R you can get the documentation for prcomp() with the call "?prcomp()".

  • @Asia25Asia
    @Asia25Asia 2 ปีที่แล้ว

    Hi Josh! thank you for your videos. Could you please give some hint what to do with NA (not obtained) values in PCA? How to deal with them? Additionally - what is better - to use raw data (abundance) or relative abundance (percentage) as an input to PCA?

    • @statquest
      @statquest  2 ปีที่แล้ว +1

      You can try to impute the missing values. And depending on what you want to show, it can be better to use raw data or some sort of transformed version.

    • @Asia25Asia
      @Asia25Asia 2 ปีที่แล้ว

      @@statquest Thanks a lot for quick response. Can you recommend some easy and friendly function for imputing biological data?

    • @statquest
      @statquest  2 ปีที่แล้ว +1

      @@Asia25Asia Not off the top of my head.

    • @Asia25Asia
      @Asia25Asia 2 ปีที่แล้ว

      @@statquest OK, no problem :)

  • @alexisvivoli8963
    @alexisvivoli8963 4 ปีที่แล้ว +1

    Hi josh ! thanks a lot for your videos. I vizualise very well how it works for 3 variables thanks to your animation, but i'm struggling to understand what happen if you add more variables : how do you project/center and calculate PC since you can't have more than 3 dimension. So how it works if you have, let say 4 or even more like 100 variables ? Thanks !

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      If I have one variable, called var1, then I can center it by calculating the mean for var1 and subtracting that from each value.
      If I have two variables, var1 and var2, I can center the data by calculating the mean for var1 and subtracting that from all of the var1 values and calculating the mean for var2 and subtracting that from all of the var2 values.
      If have 3 variables, var1, var2 and var3, then I can center it by calculating the mean for var1 and subtracting that from all of the var1 values, calculating the mean for var2 and subtracting that from all of the var2 values and calculating the mean for var3 and subtracting that from all of the var3 values.
      If I have 4 variables, var1, var2, var3 and var4, then I can center it by calculating the mean for var1 and subtracting that from all of the var1 values, calculating the mean for var2 and subtracting that from all of the var2 values, calculating the mean for var3 and subtracting that from all of the var3 values and calculating the mean for var4 and subtracting that from all of the var4 values.
      If I have N variables, var1, var2, var3 ... varN, then I can center it by calculating the mean for var_i and subtracting that from all of the var_i values, where I is a value from 1 to N. etc. Does that make sense?

    • @jasperli7794
      @jasperli7794 2 ปีที่แล้ว

      @@statquest Thanks, I understand this idea of centering the data for all variables. But then how do you draw the principle components, for all the variables, beyond 3? After you draw principle component 1 through the origin (so that it best fits the data, using SVD etc.), and place principle component 2 through the origin perpendicular to it, and principle component 3 perpendicular to both 1 and 2, how do you continue placing principle components perpendicular to the first 3? Is there an explanation for further principle components which does not rely on the restrictions of the physical 3D world? Thank you very much!

    • @statquest
      @statquest  2 ปีที่แล้ว

      @@jasperli7794 It's just relatively abstract math, which isn't limited to 3-dimensions. However, the concepts are the same, regardless of the number of dimensions.

    • @jasperli7794
      @jasperli7794 2 ปีที่แล้ว +1

      @@statquest Okay, so if I understand correctly, the principle components capture various axes which are related to each other by position, and which explain (decreasing amounts of) variance within the data and the relative contributions of each feature/variable at each principle component. Thanks!

    • @statquest
      @statquest  2 ปีที่แล้ว

      @@jasperli7794 Yep!

  • @samggfr
    @samggfr ปีที่แล้ว +1

    Hi Josh. Thanks for your videos, especially when you are diving into details and tips.
    In tip#2 concerning centering, you show 2 sets of 3 points and you present the centering to the mean. Let's imagine an experiment with 3 patients with drug A and 3 patients with drugs A and B. Let's say the lower/left set if the reference, drug A, and the upper/right set is the test, drug A+B. What about centering on A (set A will be at the origin)? This centering should show the total effect of adding drug B to drug A, whereas the mean centering shows half the effect. In the same vein, the variables plot should show the variables that change from drugA set to drugAB set instead of showing variables that change from the mean experiment ie ((drugA+drugAB)/2). What's your view?

    • @statquest
      @statquest  ปีที่แล้ว +1

      Centering using all of the data does not change the relationship between the two groups of points - they are still the same distance apart from each other, and the eigenvalue will reflect this and give you a sense of how much a difference A is from A+B.

    • @samggfr
      @samggfr ปีที่แล้ว

      @@statquest Thanks for your reply concerning the distance, which I might interpret as the effect size. Could you tell me your view concerning the plot of variables?

    • @statquest
      @statquest  ปีที่แล้ว +1

      @@samggfr I'm not 100% certain I understand your question about the variables plot, but the loadings for the variables on PC1 will tell you which variables have the largest influence in causing variation in that direction.

  • @paulohmarco
    @paulohmarco 3 ปีที่แล้ว

    Hi professor Josh Starmer,
    Thanks a lot for your videos. This a very joyful way to teach these methods!
    lease, let me ask you: I am giving a lecture about PCA online in Brazil, in Portuguese language, and I would like to ask your permission to use some of your examples to teach PCA. Of course, I will reference it to your StatQuest channel.
    Thanks in advance!

    • @statquest
      @statquest  3 ปีที่แล้ว

      Feel free to use the examples and cite the video.

  • @kushaltm6325
    @kushaltm6325 5 ปีที่แล้ว +1

    Josh, Thank you very much for helping us out with stats. When i get a job, I sure should contribute towards your efforts.
    I am struggling to understand things @3:10
    Why should it be a problem if we do NOT centre the data ?
    Can you please explain with respect to your "PCA -Clearly Explained" Video. My Prof would't answer it. So asking a Cool-Stat-Guru about it :)
    If it requires too much eleboration please point me to other resources.... Thanks Again.
    Best Wishes from India... :)

    • @statquest
      @statquest  5 ปีที่แล้ว

      Thanks!! Do mean try to explain it in terms of "PCA-Clearly Explained" or do you mean "PCA Step-By-Step". The former shows the "old" or "original" method of PCA, which was to find the eigenvectors of the covariance matrix. The latter, "Step-by-step", shows how PCA is done using the more modern technique using Singular Value Decomposition. I think it is easier to understand centering in terms of SVD.

  • @sane7263
    @sane7263 ปีที่แล้ว

    Great Video Josh!
    I am wondering @ 7:32 "Find the line perpendicular to PC1 that fits best" what does this means?
    I mean either you can have line perpendicular or a best fit line.

    • @statquest
      @statquest  ปีที่แล้ว +1

      When you have more than 2-dimensions, then the first perpendicular line can rotate around PC1 and still be perpendicular. Thus, any line in that plane will be perpendicular. For more details, see: th-cam.com/video/FgakZw6K1QQ/w-d-xo.html

    • @sane7263
      @sane7263 ปีที่แล้ว

      @@statquest Thanks for the lighting fast reply Josh!
      I have already seen that video and after watching this I had same question.
      If a PC2 (a line) is passing through PC1 (another line) perpendicular i.e., at 90 degree, how can it rotate and still maintain that angle?

    • @statquest
      @statquest  ปีที่แล้ว +1

      @@sane7263 If we have 3-dimensions, PC1 can go anywhere. PC2 however, can go anywhere in a plane that is perpendicular to PC1 and PC3 has not choice but to be perpendicular to both PC1 and PC2. I try to illustrate this here: th-cam.com/video/FgakZw6K1QQ/w-d-xo.html

    • @sane7263
      @sane7263 ปีที่แล้ว

      @@statquest Ahh! I see!
      So if we have a 2D plane PC1 can go anywhere but in this case PC2 will have no choice but to be perpendicular. Right?
      I think now I got it.

    • @statquest
      @statquest  ปีที่แล้ว +1

      @@sane7263 That's right. When we only have 2-dimensions, then the first line can go anywhere, but once that is determined, the second line has no choice. When we have 3-dimensions, things are a little more interesting for the second line.

  • @misseghe3239
    @misseghe3239 4 ปีที่แล้ว

    Can PCA be used for Regression problems or only classification problem? Thanks .

    • @statquest
      @statquest  4 ปีที่แล้ว

      There are actually several types of regression that use PCA. PCA reduces the number of variables in your model.

  • @mostafael-tager8908
    @mostafael-tager8908 4 ปีที่แล้ว +1

    Thanks for the video, but I think there is a simple mistake at @2:08 when you said mix 0.77 Math with 0.77 Reading , I thought that both must add up to 1 , or I got something wrong ?

    • @statquest
      @statquest  4 ปีที่แล้ว +3

      0.77 for math and 0.77 for reading represent 2 sides of a triangle that has been normalized so that the hypotenuse = 1. In other words, using the Pythagorean theorem, sqrt(0.77^2 + 0.77^2) = 1. For more details about this, see minute 11 and second 16 in this video: th-cam.com/video/FgakZw6K1QQ/w-d-xo.html

  • @AyatUllah-zr6ij
    @AyatUllah-zr6ij 2 หลายเดือนก่อน +1

    Good ❤

    • @statquest
      @statquest  2 หลายเดือนก่อน

      Bam! :)

  • @Patrick881199
    @Patrick881199 4 ปีที่แล้ว +1

    Hi, Josh, , I am a little confusing that at 2:37, you mentioned using standard deviation, well, if we have math scores(0-100) with standard deviation of 5 and, in the same time, the reading scores(0-10) also has sd of 5, then by dividing sd, math and reading are still NOT in the same scale.

    • @statquest
      @statquest  4 ปีที่แล้ว

      Regardless of the original scale, if you divide each value in a set of measurements by the standard deviation of that set, the standard deviation of the new values will be 1. And that puts all variables on the same scale.

    • @statquest
      @statquest  4 ปีที่แล้ว

      For more details, see: stats.idre.ucla.edu/stata/faq/how-do-i-standardize-variables-in-stata/

    • @Patrick881199
      @Patrick881199 4 ปีที่แล้ว +1

      @@statquest Thanks, Josh

  • @lucaliberato1457
    @lucaliberato1457 2 ปีที่แล้ว

    Hello Josh, i have a question. You say that, to find a 3rd PC, we should find a line perpendicular to PC1 and PC2 and it's not possible. But in the first video you say we cand find PC3 that goes through the origin and is perpendicular to PC1 and PC2. I lost something in the video for sure, can you help me pls?

    • @statquest
      @statquest  2 ปีที่แล้ว +1

      In the first video, we have enough data points on the graph that we can meaningfully create 3 axes. However, in this example, we don't have enough data to do that. The point being made in this video is that the maximum number of PCs can be limited by the number of data points. So, even if you have 3-D data, if you only have 2 points, then you will only have 1 PC, because 2 points only define a specific line. We need 3 points to define a specific plane (for 2 pcs) and we'd need 4 points to define 3 PCs etc.

    • @lucaliberato1457
      @lucaliberato1457 2 ปีที่แล้ว +1

      @@statquest Thank you so much Josj, you're super😎

  • @marianaferreiracruz5398
    @marianaferreiracruz5398 6 หลายเดือนก่อน

    love the music

    • @statquest
      @statquest  6 หลายเดือนก่อน +1

      Thanks!

  • @doubletoned5772
    @doubletoned5772 4 ปีที่แล้ว

    I have a trivial question at 1:39 . If the recipe to make PC1 is using approx 10 parts Math and only 1 part reading, why does that mean that Math is '10' times more important than Reading to explain the variation in data? I mean I understand that it will be more important but is that specific number (10) correct?

    • @statquest
      @statquest  4 ปีที่แล้ว

      I think my wording may have been sloppy here.

  • @raisaoliveira7
    @raisaoliveira7 ปีที่แล้ว +1

    • @statquest
      @statquest  ปีที่แล้ว

      You're welcome again! :)

  • @namithacherian1743
    @namithacherian1743 ปีที่แล้ว

    DOUBT: When there are only 2 points, and you mentioned that you can fit only one line through them (Correct). However, there is no guarantee that it will pass through the origin. In other words, when there are only 2 points, you can draw a line that goes through the origin and fit it with one data point for sure. But having it pass through both data points is a matter of chance. Right?

    • @statquest
      @statquest  ปีที่แล้ว

      What time point, minutes and seconds, are you asking about?

  • @mojojojo890
    @mojojojo890 2 ปีที่แล้ว

    If you could explain why the first PCA is the eigen vector that would be nice
    I know that eigen vectors are the vectors that their span doesn't change after a transformation even if they are scaled... so here what exactly is the transformation applied ?

    • @statquest
      @statquest  2 ปีที่แล้ว

      I explain eigen vectors in this video: th-cam.com/video/FgakZw6K1QQ/w-d-xo.html

    • @mojojojo890
      @mojojojo890 2 ปีที่แล้ว

      @@statquest I watched that video but it does not explain why the the first PCA is an eigen vector

    • @statquest
      @statquest  2 ปีที่แล้ว

      @@mojojojo890 Ah, I see. First, that video focuses on Singular Value Decomposition, which is the modern way to do PCA and doesn't actually involve calculating eigenvectors. However, the old method does by applying eigen decomposition to the variance-covariance matrix of the raw data. And in the old method, PC1 was an eigenvector for the variance/co-variance matrix. In other words, if the variance/co-covariance matrix is V, then V x PC1 = eigenvalue * PC1, which makes PC1 an eigenvector.

    • @mojojojo890
      @mojojojo890 2 ปีที่แล้ว +1

      @@statquest That sends me somewhere to find my answer... Thanx a lot !!

  • @Amf313
    @Amf313 3 ปีที่แล้ว

    How we should scale for the variables which don’t have clear upper or lower bounds?
    For example If our 2 variables are human height and Weight ...
    is it rational to scale them based on the maximum height and weight existing in the whole samples?
    What if we have just one person weighting above 100 Kg and his weight is 160Kg;
    If we drop only this sample from the Data, the scale and whole PCAs will differ significantly. So is it rational to consider the variable scale based on the max and min of the values existing in the samples? (for such variables without intrinsic upper and lower bounds)
    🤔

    • @statquest
      @statquest  3 ปีที่แล้ว

      You scale the data based on the data itself, not theoretical bounds.

  • @majidkh2695
    @majidkh2695 5 ปีที่แล้ว

    @ 7:01, for the case we have 2 points, 3 features, shouldn't the number of PCs be 2?! With 3 features we don't have a line anymore, but a hyperplane!

  • @etornamtsyawo6407
    @etornamtsyawo6407 ปีที่แล้ว +1

    Let me like the video before I even start watching.

  • @addisonmcghee9190
    @addisonmcghee9190 3 ปีที่แล้ว

    So Josh, would the upper bound for Principal components be: minimum{ # of variables, (# of samples - 1)}

    • @statquest
      @statquest  3 ปีที่แล้ว

      I answer this question at 3:30

    • @addisonmcghee9190
      @addisonmcghee9190 3 ปีที่แล้ว

      @@statquest
      Ok, so if we had 2 students and 5 variables, wouldn't we only have 1 principal component? These are two points in a 5-dimensional space, so it would be a line, right?
      So, (# of samples - 1)?
      I'm just trying to find a pattern...

    • @statquest
      @statquest  3 ปีที่แล้ว

      @@addisonmcghee9190 That is correct.

    • @addisonmcghee9190
      @addisonmcghee9190 2 ปีที่แล้ว +1

      @@statquest Revisiting this comment because I'm learning about PCA in G-school....old StatQuest for the win!

  • @danielasanabria3242
    @danielasanabria3242 4 ปีที่แล้ว

    Did you already make a video about Partial Least Squares?

  • @seazink5357
    @seazink5357 5 ปีที่แล้ว

    love you

  • @mrweisu
    @mrweisu 4 ปีที่แล้ว

    At 6:19, even the two points are on a line, but does the line necessarily go through (0,0)? If not, there still can be two PCs. Can you help clarify? Thanks.

    • @statquest
      @statquest  4 ปีที่แล้ว

      PC1 always goes through the origin. That's why we center the data to begin with.

    • @mrweisu
      @mrweisu 4 ปีที่แล้ว

      @@statquest Yes, but the line connecting the two points might not.

    • @statquest
      @statquest  4 ปีที่แล้ว

      If the data are centered, then the line connecting the two points will go through the origin. If they are not centered, then, technically, you are correct, we will have 2 PCs - but neither PC will do a good job reflecting the relationship of the data as well as the PC derived from the centered data.

    • @mrweisu
      @mrweisu 4 ปีที่แล้ว +1

      @@statquest does centering data make the connecting line go through (0,0)?

    • @statquest
      @statquest  4 ปีที่แล้ว

      @@mrweisu Yes

  • @giosang1111
    @giosang1111 4 ปีที่แล้ว

    Hi, is it always true that to scale the values of the variants we divide the values by their SDs?

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      For PCA, yes.

    • @giosang1111
      @giosang1111 4 ปีที่แล้ว

      Thanks! Can you make a video which summarizes which statistical methods used in which cases? There are so many methods out there and I am really confused which and when to use them. Thanks a lot.

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      @@giosang1111 Since there are so many methods, this would probably be a series of videos, rather than a single video, but either way, it's on the to-do list. However, it will probably be a while before I can get to it.

    • @giosang1111
      @giosang1111 4 ปีที่แล้ว +1

      @@statquest Hi! I am looking forward to it. All the bests!

  • @rezkyilhamsaputra8472
    @rezkyilhamsaputra8472 ปีที่แล้ว

    Are these tips also applied in principal component regression (PCR)?

    • @statquest
      @statquest  ปีที่แล้ว +1

      They apply to any time you want to use PCA, so yes, they would also apply to PCR.

    • @rezkyilhamsaputra8472
      @rezkyilhamsaputra8472 ปีที่แล้ว

      @@statquest and if the software gives us an option whether or not we want to center and/or scale the data, is there a condition where we shouldn't center/scale the data or we must always do it?

    • @statquest
      @statquest  ปีที่แล้ว

      @@rezkyilhamsaputra8472 I can't think of a reason you wouldn't want to center your data. Scaling depends on the data itself. If it's already on the same scale, you might not want to do it.

    • @rezkyilhamsaputra8472
      @rezkyilhamsaputra8472 ปีที่แล้ว +1

      @@statquest alright, thank you so much for the crystal clear explanation!

  • @k_a_shah
    @k_a_shah 10 หลายเดือนก่อน

    which application is used to plot this graph ?
    or any software

    • @statquest
      @statquest  10 หลายเดือนก่อน

      I give all my secrets away in this video: th-cam.com/video/crLXJG-EAhk/w-d-xo.html

  • @basharabdulrazeq4349
    @basharabdulrazeq4349 10 วันที่ผ่านมา

    Hello Josh. @ 7:57, you explained that if there are fewer samples than variables then the number of samples puts an upper bound on the number of PCs. In the last example, there are 3 samples and 3 variables (therefore the number of samples isn't fewer than the number of variables), and the number of PCs should be 3 (not 2). could you explain why did you decide that the number of PCs should be 2!!. (BTW I watched all of your videos about PCA, but I don't understand this specific example).

    • @statquest
      @statquest  10 วันที่ผ่านมา

      The answer to your question starts at 5:09. The key is that we don't include PCs that have an eigenvalue = 0. If there's no variation in a direction, then there is no need for an axis in that direction. Thus, 3 data points can only define a 2-dimension plane, and thus PC3 will have an eigenvalue = 0 and thus, we can exclude PC3.

    • @basharabdulrazeq4349
      @basharabdulrazeq4349 9 วันที่ผ่านมา

      @@statquest I agree, but I still can see variation in a third direction. I just can't comprehend the idea that in there isn't, because there are three variables for three students and all of them change with each other. I'd be so grateful, if you could prove or give me some source to see a proof that there shouldn't be PC3, because I really need to comprehend the idea.

    • @statquest
      @statquest  9 วันที่ผ่านมา +1

      @@basharabdulrazeq4349 For each student, we have 3 values for the 3 variables that represent a single point in the 3-dimensional space. Thus, we have 3 points in the 3-dimensional space, one per student. 3 points define a plane, which is only a 2-dimensional space. Thus, only 2 PCs can possibly have eigenvalues > 0.

    • @basharabdulrazeq4349
      @basharabdulrazeq4349 9 วันที่ผ่านมา +1

      @@statquest Thanks a lot. I understand now that no matter which direction you arrange any three points, they will always lie on the same plane.

  • @vigneshvicky6720
    @vigneshvicky6720 2 ปีที่แล้ว +1

    Your r using datapoints to get pca's but in general we r using covariance matrix to get pca y??

    • @statquest
      @statquest  2 ปีที่แล้ว +1

      The old way to do pca was to use a covariance matrix. However, no one does that any more. Instead, we apply Singular Value Decomposition directly to the data.

    • @vigneshvicky6720
      @vigneshvicky6720 2 ปีที่แล้ว +2

      @@statquest tq love frm india💖

  • @kartikmalladi1918
    @kartikmalladi1918 ปีที่แล้ว

    What is the need of pca if you can use average scores as contribution?

    • @statquest
      @statquest  ปีที่แล้ว

      My main PCA video gives a reason to use it: th-cam.com/video/FgakZw6K1QQ/w-d-xo.html

    • @kartikmalladi1918
      @kartikmalladi1918 ปีที่แล้ว

      @@statquest I mean I've gone through your videos. Great work by the way. Main goal of PCA is to understand the contribution of each variable to a sample. However finding out average of each variable and their percentage contribution still gives some idea. So how is this average contribution different from PCA loading scores?

    • @statquest
      @statquest  ปีที่แล้ว

      @@kartikmalladi1918 What time point in the video, minutes and seconds are you asking about?

    • @kartikmalladi1918
      @kartikmalladi1918 ปีที่แล้ว

      @@statquest it's from the main PCA video, discussing about loading score contribution.

    • @statquest
      @statquest  ปีที่แล้ว

      @@kartikmalladi1918 What time point?

  • @haydergfg6702
    @haydergfg6702 ปีที่แล้ว

    which programming using

    • @statquest
      @statquest  ปีที่แล้ว

      I'm not sure I understand your question. Are you asking how I created the video? If so, see: th-cam.com/video/crLXJG-EAhk/w-d-xo.html

  • @mojo9Y7OsXKT
    @mojo9Y7OsXKT 4 ปีที่แล้ว

    How come this video is gone "Private"? The screen says: "Video Unavailable" "This video is private"!!

    • @statquest
      @statquest  4 ปีที่แล้ว

      This specific video? Or are you asking about another video?

    • @mojo9Y7OsXKT
      @mojo9Y7OsXKT 4 ปีที่แล้ว

      @@statquest This video was showing as private yesterday. Its come back us today! Could've been a glitch. Thanks for all your vidz.

    • @statquest
      @statquest  4 ปีที่แล้ว

      @@mojo9Y7OsXKT Yeah, something strange must have happened. I'm glad it's back. :)

  • @shwetankagrawal4253
    @shwetankagrawal4253 4 ปีที่แล้ว

    Hey John,I am not able to understand kernal PCA, can u explain or tell me the book name which can give me the clear understanding of this?

  • @sarahjamal86
    @sarahjamal86 5 ปีที่แล้ว +1

    Ok... since PCA uses the SVD and covariance matrix, so not centering the data to the origin means, that the data is not mean free, freeing the data from the mean is part of constructing the covariance matrix. So not having zero mean data means that our eigenvector will not be 100% correctly derived.

    • @statquest
      @statquest  5 ปีที่แล้ว +3

      PCA uses SVD or the covariance matrix. It doesn't use both. Older PCA methods use a covariance matrix and a covariance matrix is automatically centered, so you don't need to worry about this. However, newer PCA methods use SVD because it is more likely to give you the correct result (SVD is more "numerically stable"), and when using SVD, you need to center your data (or make sure that the program you are using will center it for you). Otherwise you get the errors illustrated at 2:55.

  • @JaspreetSingh-eh1vy
    @JaspreetSingh-eh1vy 3 ปีที่แล้ว

    So technically speaking, number of PC = number of features but if the number of samples < number of features, then the number of PC = number of samples - 1. Am I right?

  • @pattiknuth4822
    @pattiknuth4822 3 ปีที่แล้ว

    Drop the song.

  • @1989ENM
    @1989ENM 5 ปีที่แล้ว +1

    ...for me?