Principal Component Analysis (PCA) - easy and practical explanation
ฝัง
- เผยแพร่เมื่อ 1 มิ.ย. 2024
- In this video, I will give you an easy and practical explanation of Principal Component Analysis (PCA) and how to use it to visualise biological datasets.
You can also find a step by step explanation here: biostatsquid.com/pca-simply-e...
Hope you like it!
--------------------------------------------------------------------------------------------------------------------
Watched it already?
If you liked this video or found it useful, please let me know! Your comments and feedback are very much appreciated😊
If you have questions, don't hesitate to leave me a comment down below, I will answer as soon as I can:)
--------------------------------------------------------------------------------------------------------------------
For more biostatistics tools and resources, you can visit: biostatsquid.com/
for more
• simple and clear explanations of biostatistics methods
• computational biology tools
• easy step-by-step tutorials in R and Python
To analyse and visualise your biological data!
Or follow me on Instagram at @biostatsquid: / biostatsquid
Don’t forget to subscribe if you don’t want to miss another video from me!
--------------------------------------------------------------------------------------------------------------------
More PCA resources:
A more deep explanation of the math behind PCA, without math!
towardsdatascience.com/princi...
I also really love this explanation from StatQuest!
• StatQuest: Principal C...
I am now convinced that there are no tough subjects, only ineffective tutors. I have been struggling to understand this concept for over 3 years, and here I am, within 11 minutes things have fallen into place.
An expert not necessarily be a great teacher. There might be great experts assigned in educational institute to teach such concepts.
But someone like you is what we need in our schools and colleges (expert and well articulated).
Simplicity is the utmost form of sophistication.
Thanking you from the bottom of my heart.
Keep on helping people like us.
Perhaps another video on how to do it in R will be great hit.
Thank you so much for your kind words, I'm really flattered! I'm glad it was useful and it cleared up concepts for you:) Great idea about an R tutorial, will definitely add it to my todo list!
This video just solved half of my problem in understanding PCA stats. To solve the other half is I need to translate the info to my actual research.
You are for sure principal component #1! You're the best at describing information ;)
I'm currently watching without logging into my Google account. 😊 However, halfway through, I made the decision to log in, hit the like button, and subscribe to your channel. 🎉 Thank you for your valuable content-it's truly helpful, and I encourage you to keep up the great work! 👍
I have watched so many videos trying to understand pca ..and this by far is the most interesting with fundamentals fully explained
Amazingly well explained
Hi thank you so much for explaining PCA in such a clear way. I've been really stressed about understanding it for my uni stats exam, but now I feel much more confident :)
Best explanation in TH-cam, awesome.
Very well explained, thank you!
wow!!! that was explained so nicely by you..... thank you!
So well explained. Thanks a bunch!
Best video to understand PCA plot 😊
Fantastic presentation.
This was excellent. Some people just know how to explain things
Thank you. Well explained!
You really understand what you were talking, big up
Wow! best PCA video on youtube.
Thank you very much for your clear explanation.
just on point!Loved it!
super elegant and clear explanations, thank you!
Thank you, I'm happy you found it useful:)
the best explanation. easy to understand.
Loved it! It's a really comprehensive explanation!😍
Simply excellent !
It would be great to have PCA explained conceptually, mathematically, as well as programmatically. When push comes to shove, we'll need to do it in a computer, running an algorithm that either we have to put in, or call from a Python library.
Thank you for all the work you do educating us!
Well explained. Thank You.
Thank you very much, very well explained
Very great example this was the exact example of what im doing too!
Thanks for making amazing video help me explain things I have been researched for days.
Very good high level video!
Amazing explanation
Amazing content, clearly explained! :)
Good explanation, Thats great!
Was going insane looking for an understandable explanation of "what" a PCA is, until I found this video! Thank you very much!
Thank you for your kind words!! Glad it helped:)
very well explaining
Nice, video, thanks!
Great I benefited a lot!
Very good explanation mam
Woow.. this is so helpful
Nice video😊
Lovely video, thank you for explaining!
Glad it was helpful! You're very welcome:)
very nice
i wish i could hug you, thank you so much
really nice, congratulations for your video! I follow you now :)
Nice, very nice
it's well explained for begginer to understand the plot,but if you wanna know how to do it,this video can't help you
Very well explained. Two questions:
How do they know which dimensions they have to combine into a PCA to explain most of the variance? The combinations are limitless especially for single cell sequencing analysis.
Can combining dimensions also reduce variance explanation? Like dimension 1 + dimension 2 explains 50% but dimension 1 + 3 explains 30%? How do you make sure this doesn't happen?
I'd love to have a tutorial on how to perform this on R. This was very well explained.
Great suggestion! I cover it a bit in the preprocessing video but maybe a specific video for PCA in R would be good - I'll keep it in mind! Thanks!
How to explain which factors contribute to PC1 and PC2? by biplot graph.
I have one question if I have 60(A1-A60) variables with a 2k sample size,
A1 is the first and A60 is last, in between these A10, A20, A30, A40, A50 and the confirmed output but for some of the samples the A19, A29 output doesn't exist, as A20 reached earlier, the data is of this type for some reasons.
Will PCA work in the same way as explained?
Please can u tell me how can we calculate principal loading. I am a bit confused to this part.
Hi
Good presentation on PCA. Can we apply PCA on a dataset that have numeric and categorical data? Also do we need to ensure that each variable follow a normal distribution if it does not what should we do? Also do we need to normalised each of variables? Appreciate your comments.
Hi, great questions. PCA is not recommended for categorical data - even if you one-hot encode it. For mixed data types, there are better alternatives like Multiple Factor Analysis available in the FactoMineR R package (FAMD()) or Multiple Factor Analysis (MFA()) is also an option. I haven't got experience with either but you can check the thread here: stats.stackexchange.com/questions/5774/can-principal-component-analysis-be-applied-to-datasets-containing-a-mix-of-cont
Yes, it is necessary to standardise data before performing PCA because PCA basically maximises the variance. So if you have some variables with a very large variance and some with little variance, it will give more importance to the variables with large variance. If you change the scale of one of your variables, e.g., weight of mice, from kg to g, the variance increases, and the variables 'weight' will go from having little impact to be the main feature that explains variance in your dataset. Standardising will do the trick since it makes the SD of all the variables the same (normalization does not make all variables to have the same variance). Hope this was clear!
Hi! I really love your explanation! Would it be possible to get a copy of the dataset? I need to teach PCA and i think this a nice example cause the relationships between the factors are easy to understand! Would definitely point them to this video!
Hi Liz, thanks for your feedback! Unfortunately I cannot share my dataset, not because I don't want to, but because there is no dataset! I just made up the categories and figures for illustration purposes, just cause it is easier to understand when the factors are 'obvious'. So sorry to disappoint you...
However, you can check out my post here in case it is helpful: biostatsquid.com/pca-simply-explained/
@@biostatsquid no worries thanks so much!
How to obtain the loadings? is it the same to eigenvectors or scaled coordinates?? in my geochemical software iogas the report of PCA contain this items: Correlation - Eigenvectors - Eigenvector Plots - Eigenvalues - Scree Plot - Scaled Coordinates - PC1 vs PC2 - PC1 vs PC3 - PC1 vs PC4 and so on... (the last is PC3 vs PC4). My input was 32 chemical elements previously transformed with CLR
Here is the ioGAS description for Scaled Coordinates:
"Created by scaling the length of the eigenvector to the eigenvalue. All eigenvectors have a length of 1 so scaling by the eigenvalue changes the lengths so that the length is proportional to the variance (eigenvalue) accounted for by that eigenvector.
Click on a PC header column to sort the scaled coordinates from lowest to highest or vice versa."
And for Eigenvectors:
"Eigenvectors are PCA coordinate values that correspond to the projected location of the original input variables onto the calculated PCA axes. PC1, or the first eigenvector, is a calculated line of best fit through the maximum direction of variation for the selected variables. The PC1 eigenvectors represent the value of each input put along this line. PC2 is a line of best fit through the maximum variation at right angles to PC1 so the PC2 eigenvalues are the original input variable values projected onto this axis, and so on for each of the number of principal components.
An eigenvector may be in either of two opposite directions. ioGAS will always choose the eigenvector whose first element is positive. Click on a PC header column to sort the eigenvectors from lowest to highest or vice versa."
Ahhh, I think the Loadings are equal to Scale Coordinates 😅
Thank you for your video. After you have assigned PC1 to PC5 ..., you show the PC matrix in order reflecting the amount of variation explained, where there are a variety of values listed under each PC from - 6 to +6. What do these values represent?
Hi! Thanks for your question! So the values are just an example, they don't necessarily go from -6 to +6. Basically, the values represent the 'contribution' of that variable to a specific PC. Since PCs are ranked by the variation of the dataset they explain (PC1 explains more than PC2, which in turn explains more than PC3...), variables with higher (more positive) or lower (more negative) scores for lower PCs (i.e., PC1) are 'more important', in other words, they explain more variability in the dataset. Hope this helped!
Thank you very much for your rapid reply and explanation. I thought that this was the case, but was not certain. As an extension of my question, do these + or - values under each PC align with a tick mark on the x:y and -x:-y axes? (for reference the axes you use to demonstrate these concepts around 5:10 to 5:30 minutes into your presentation). If "yes", and by way of feedback, having a scale on these axes would be helpful. I have watched 3 separate presentations on PCA today, and I have found yours the most useful. Thank you again, and in particular for responding to my question so quickly. Best wishes.
Hi thanks so much for your feedback! No, they're not! The tick marks represent increments of 1 (so 1, 2, 3, 4...) and I think my intention was to make them match the PC scores, but I must have changed the labels around to make it make sense with the biology and forgot to update the table. But they should match, so thanks for pointing that out! Will correct it if I ever do a part 2 on this:) Cheers @@brettlidbury4110
@@biostatsquid My pleasure and looking forward to the next installment (o:
Looking for a response from the Author - What is the signfiicance of a low PCA for a large biological data set? - Does a PC1 of
If PC1 is 20% it means it explains 20% of the variability of the dataset. You can then check which are the top contributing variables of PC1 to figure out what are the features of your dataset that explain most variability. In complex scenarios you might be happy with 20% of variability. For example, you are studying height in the human population, and want to figure out which genes contribute to height. You 'take' a sample of people with different heights, do RNAseq to figure out gene expression (this is a very simple example, but let's go with it). You do PCA on the gene expression counts of all genes in the human genome. PC1 explains 20% of variability (i.e., differences in height in the sample you took). Then you check and top PC1-contributing genes are X, Y, Z. So you know that X, Y, Z genes most probably play an important role in height. But of course this is only 20% of the variability of your data. What about the other 80%? Well, you forgot about other important factors that contribute to height, like diet, gender, genomic varaibility (not only transcriptomics, but also epigenetics, genomics might play an important role!) ... etc. Hope this made it a bit easier to understand!
Very helpful video but I'm not sure I understand when to use PCA variable need to be correlated or not?
Hi Niki, not sure if I understand your question, could you rephrase it, please?
@@biostatsquid Sorry it was not clear...I just wonder if there is a limitation at applying PCA only in cases of data where there is some correlation among the factors or some factors for example height and weight are correlated etc.
@@nikitrianta9896 Oh I see ! No, not at all, actually PCA allows you to gather insights about features describing our data - by looking at the coefficients of the features/variables for each PC you can find out if they are positively, negatively or not correlated.
If you want to visualise this you can draw a plot of the coefficients for PC1 vs PC2 (for example) for all features. For each feature, imagine (or draw) a vector with origin in (0, 0) to the point (coefficient PC1, coefficient PC2). Features that are positively correlated to each other have an angle between their vectors close to 0 degrees , if they are negatively correlated the angle between them is 180 and if they are not then the angle is close to 90 degrees.
Does this answer your question? :)
Do you by chance make time for appointments? I would be grateful. Thanks
Hi Ruth! Just send me an email describing your issue and I'll tell you if I can help:)
At ~3:58 you say the principal components explain 85% of the variance in life expectancy. I don't think that's right. I think it's 85% of the variance in the predictor variables. Or am I totally confused?
Whole world creator's godfather bless you all always and you all love and remember godfather with your pure hearts.