StatQuest: PCA in Python

แชร์
ฝัง
  • เผยแพร่เมื่อ 21 ก.ค. 2024
  • You asked for it, you got it! Now I walk you through how to do PCA in Python, step-by-step. It's not too bad, and I'll show you how to generate test data, do the analysis, draw fancy graphs and interpret the results. If you want to download the code, here's the link to the StatQuest GitHub:
    github.com/StatQuest/pca_demo...
    For a complete index of all the StatQuest videos, check out:
    statquest.org/video-index/
    If you'd like to support StatQuest, please consider...
    Buying The StatQuest Illustrated Guide to Machine Learning!!!
    PDF - statquest.gumroad.com/l/wvtmc
    Paperback - www.amazon.com/dp/B09ZCKR4H6
    Kindle eBook - www.amazon.com/dp/B09ZG79HXC
    Patreon: / statquest
    ...or...
    TH-cam Membership: / @statquest
    ...a cool StatQuest t-shirt or sweatshirt:
    shop.spreadshirt.com/statques...
    ...buying one or two of my songs (or go large and get a whole album!)
    joshuastarmer.bandcamp.com/
    ...or just donating to StatQuest!
    www.paypal.me/statquest
    Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
    / joshuastarmer
    0:00 Awesome song and introduction
    1:06 Load modules and generate data
    5:03 Scaling and centering data
    7:31 Use scikit for PCA
    8:34 Draw a scree plot
    9:18 Draw a PCA plot
    10:18 Examine the loading scores
    Correction:
    3:23 The array should only have wt through wt5, ko1 through ko5.
    #statquest #PCA

ความคิดเห็น • 399

  • @statquest
    @statquest  3 ปีที่แล้ว +14

    Correction:
    3:23 The array should only have wt through wt5, ko1 through ko5.
    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

    • @GoktugAsc123
      @GoktugAsc123 3 ปีที่แล้ว

      Thank you, I was mentioning 3:23. Your videos are great.
      I am a medical doctor from Turkey and currently, I am planning a career change to data science and I have been watching your videos to get prepared for a data scientist position. Could you create a few videos regarding data science interviews if it is relevant for your channel content? Best Regards, Göktuğ Aşcı, MD.

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      @@GoktugAsc123 I'll keep that in mind.

    • @statquest
      @statquest  3 ปีที่แล้ว +3

      @@keerthik3791 Unfortunately the random forest implementations for Python are really bad and they don't have all of the features. If you're going to use a random forest, I would highly recommend that you do it in R instead.

    • @keerthik2168
      @keerthik2168 3 ปีที่แล้ว

      @@statquest Thankyou for the suggestion. I am good at Python, MATLAB. Can I do random forest in MATLAB? Or is learning R necessary here?

    • @statquest
      @statquest  3 ปีที่แล้ว

      @@keerthik2168 I have no idea. I've never tried to do random forests in Matlab.

  • @pressiyamu8976
    @pressiyamu8976 4 ปีที่แล้ว +111

    Dude you deserve a humanitarian award.

  • @mohammedghouse235
    @mohammedghouse235 3 ปีที่แล้ว +16

    Not only the best PCA demonstration but also THE BEST introduction to Python. Hats off to you man!!

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      Thank you! :)

  • @advaitshirvaikar4751
    @advaitshirvaikar4751 3 ปีที่แล้ว +47

    Whenever I search for some machine learning based explanation, I add 'by statquest' in it ^_^. Keep up the great work :')

    • @statquest
      @statquest  3 ปีที่แล้ว +5

      Thank you very much!

    • @shaktishivalingam3880
      @shaktishivalingam3880 3 ปีที่แล้ว +5

      @@statquest It's True I do the same thing ..thank you for your hard work

  • @LittleScience
    @LittleScience 3 ปีที่แล้ว +4

    I have been dabbling in data science for a while now, and only now learned that pandas stand for "panel data" xd
    This channel never ceases to amaze

  • @mattheckel2609
    @mattheckel2609 3 ปีที่แล้ว +13

    "Note: We use samples as columns in this example because... but there is no requirement to do so."
    "Alternatively, we could have used..."
    "One last note about scaling with sklearn vs scale() in R"
    This is some of the gold that sets StatQuest apart. Thank you! ❤

    • @statquest
      @statquest  3 ปีที่แล้ว

      Thank you! :)

  • @hayskapoy
    @hayskapoy 4 ปีที่แล้ว +35

    Finally! You explain in the language I understand much better than English haha Thanks !!!

  • @vedparulekar478
    @vedparulekar478 10 หลายเดือนก่อน +2

    One of the best videos ever made on this topic. This channel has helped me a lot in understanding machine learning in greater detail. Keep up the good work !!

    • @statquest
      @statquest  10 หลายเดือนก่อน

      Thank you!

  • @x11y22z33me
    @x11y22z33me 2 ปีที่แล้ว +1

    Simply loving StatQuest. Concise, clear and fun videos. One point I noted while watching this video is that the latest version of sklearn PCA() will center the data for you, but not scale it. So if you just need centering for doing pca, you don't need to worry about preprocessing.

    • @statquest
      @statquest  2 ปีที่แล้ว

      Thanks for the update!

  • @reneeliu6676
    @reneeliu6676 5 ปีที่แล้ว +5

    I am watching the 1st minute and I'm already super excited. Thanks!!

    • @statquest
      @statquest  5 ปีที่แล้ว

      Hooray!!!!!! :)

  • @DATABOI
    @DATABOI 6 ปีที่แล้ว +74

    Python. Now you're speaking my language :)

    • @HK-sw3vi
      @HK-sw3vi 4 ปีที่แล้ว

      want me to take out my python?

    • @joicejoseph2176
      @joicejoseph2176 4 ปีที่แล้ว +2

      @@HK-sw3vi ...weirdo

  • @raphael3835
    @raphael3835 5 ปีที่แล้ว +3

    The only good step by step explanation I found on the web. Thank you so much!

    • @statquest
      @statquest  5 ปีที่แล้ว +1

      Hooray!!! Thank you so much! :)

  • @tl-lay
    @tl-lay 3 ปีที่แล้ว +2

    YOU ARE SAVING MY DEGREE I LOVE YOU SO MUCH I CANT EVEN BELIEVE THIS IS THE SAME MATERIAL IM LEARNING IN MY MACHINE LEARNING CLASS RIGHT NOW.

    • @statquest
      @statquest  3 ปีที่แล้ว

      Happy to help!

  • @oswaldocastro9600
    @oswaldocastro9600 5 ปีที่แล้ว +7

    Hi Josh... Simply incredible all StatQuest videos... Triple Bam!!!

    • @statquest
      @statquest  5 ปีที่แล้ว

      Thank you! :)

  • @shanmugapriyak7269
    @shanmugapriyak7269 4 ปีที่แล้ว +1

    Always can find a new and detailed explanation of steps from your videos! Thank you!

    • @statquest
      @statquest  4 ปีที่แล้ว

      Thank you! :)

  • @BrunetteViking
    @BrunetteViking 2 ปีที่แล้ว +1

    This channel is the best TH-cam channel that I discovered. Thank you, sir!

  • @jiangxu3895
    @jiangxu3895 3 ปีที่แล้ว +3

    Thank you Josh. Such practice is important and valuable!! And you really also taught some Python tricks that I don’t know.

    • @statquest
      @statquest  3 ปีที่แล้ว

      Thank you! :)

  • @amribrahim7850
    @amribrahim7850 3 ปีที่แล้ว +3

    Awesome. Please create more videos about how to implement the machine learning as well as data science concepts explained here into Python. That would be super helpful for us, in particular beginners.

    • @statquest
      @statquest  3 ปีที่แล้ว +2

      Thanks, will do!

  • @LincolnFrias
    @LincolnFrias 5 ปีที่แล้ว +2

    It's awesome to have the explanation based on python code. Thanks a lot!

    • @statquest
      @statquest  5 ปีที่แล้ว +1

      No problem. I'm doing a lot more python coding these days, so hopefully I'll more of these "in python" videos.

  • @christopheryogodzinski6860
    @christopheryogodzinski6860 6 ปีที่แล้ว +2

    Another Great StatsQuest in the books!

  • @antomartanto
    @antomartanto 3 ปีที่แล้ว +1

    You are one the best teacher that i've ever found. Thank you very much!

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      Thank you! :)

  • @einemailadressenbesitzerei8816
    @einemailadressenbesitzerei8816 6 ปีที่แล้ว

    Cool music and Best indroduction video series I saw on PCA on TH-cam.

  • @kamogelomaila3904
    @kamogelomaila3904 6 ปีที่แล้ว +1

    Hi Joshua, thanks for that. really helpful. i'm quite new to python myself, and i'm trying to compile a PCA across a range of macro-economic factors (inflation,gdp,fx, policy rate etc.,), now in all that you've done above where is the display of the PCA i.e: the newly uncorrelated data set, is it the loading scores you printed? or the wt, and ko variables you plotted? Thanks

  • @neptunesbounty1786
    @neptunesbounty1786 4 ปีที่แล้ว +2

    I learn so much better in Python for some reason, I think it's because it's more interactive and you can play around with the data! Good one. Stattttquueeeeeest.

    • @statquest
      @statquest  4 ปีที่แล้ว +2

      Thanks! There should be a lot more Python videos and learning material out soon.

    • @godsperson5571
      @godsperson5571 3 ปีที่แล้ว +1

      @@statquest looking forward to it :).

  • @RimaHandewi
    @RimaHandewi 3 หลายเดือนก่อน +1

    Wow, your explanation is so clearly!!

    • @statquest
      @statquest  3 หลายเดือนก่อน

      Thank you! 😃

  • @merrimac1
    @merrimac1 5 ปีที่แล้ว +8

    Thanks for the tutorial! One thing I don't understand is why the PC1 can separate the wt and ko samples. Their gene expression values are generated in a same way.

    • @3stepsahead704
      @3stepsahead704 2 ปีที่แล้ว

      Just stating I have the same question 2 years later.

  • @rohitrajora9832
    @rohitrajora9832 3 ปีที่แล้ว +2

    Really appreciate this and would love to see more concepts implemented in python.

  • @spag5296
    @spag5296 3 ปีที่แล้ว +2

    You've got the right formula for simple explanations. Teach me dawg

    • @statquest
      @statquest  3 ปีที่แล้ว

      Thank you! :)

  • @KikiBah
    @KikiBah 3 ปีที่แล้ว +1

    This was so clear, thanks! Finally I can do PCA in python, BAM 😊 You DA BEST!

  • @KnightPapa
    @KnightPapa 5 ปีที่แล้ว +2

    Thank you! This video helped a lot with what I'm trying to do.

  • @khastehshodam
    @khastehshodam 5 ปีที่แล้ว

    Hi Josh
    Thank you for the video. It was a great tutorial. Just one question. What you called in the python code as loading_score, isn't in fact component score? It was score for each record (gene). Please correct me if I am wrong but isn't loading score the correlation between original fields (wt1, wt2 ...etc) and components?
    Thank you

  • @timharris72
    @timharris72 6 ปีที่แล้ว

    This was a reallly good explanation using Python

  • @saiakhil4751
    @saiakhil4751 3 ปีที่แล้ว +1

    Wow Josh.. Thanks for that unpacking concept. I never knew that my whole life...

  • @samirsaci6723
    @samirsaci6723 2 ปีที่แล้ว +1

    I push the like button even before I play the video. Because Josh never fails to amaze me.

  • @geraldopontes37
    @geraldopontes37 3 ปีที่แล้ว +1

    Your videos are great! Thanks

  • @rabiabibi8634
    @rabiabibi8634 5 ปีที่แล้ว +5

    Hi Josh. The best PCA explanation. Thanks a lot :-) May GOD bless you 😊

    • @statquest
      @statquest  5 ปีที่แล้ว +1

      Thank you! :)

    • @pressiyamu8976
      @pressiyamu8976 4 ปีที่แล้ว

      Yes, May god bless you 100 times. May the troubles of today’s world not reach your doorstep. You’re a great person.

  • @nonalcoho
    @nonalcoho 3 ปีที่แล้ว +1

    I like the way you plot the ratio of each PC~~
    It is really easy to read!
    BAM~~~~~~~~~~

  • @epsilonprincipia9012
    @epsilonprincipia9012 5 ปีที่แล้ว +1

    Good explanation. Thank you so much.

  • @danielvmartins4635
    @danielvmartins4635 10 หลายเดือนก่อน +1

    Excellent work!!! 👏👏

    • @statquest
      @statquest  10 หลายเดือนก่อน

      Thanks a lot!

  • @IntegralDeLinha
    @IntegralDeLinha 2 ปีที่แล้ว +1

    Woww! That was absolutely awesome!!! Thank you so much!

    • @statquest
      @statquest  2 ปีที่แล้ว +1

      Glad you liked it!

  • @jack.1.
    @jack.1. 3 ปีที่แล้ว +2

    Wish there were more statquest coding in python videos, they are the best! Much prefer to regular content although that is still really high quality

  • @liranzaidman1610
    @liranzaidman1610 4 ปีที่แล้ว +2

    Amazing! this is so important, thanks a lot.

  • @revolution77N
    @revolution77N 5 ปีที่แล้ว

    Thank you very much! Super helpful!

  • @vipulsonawane7508
    @vipulsonawane7508 2 ปีที่แล้ว +1

    What a playlist, I simply loved it 😘

  • @tymothylim6550
    @tymothylim6550 2 ปีที่แล้ว +1

    Thanks for the great video! :)

  • @henkhbit5748
    @henkhbit5748 3 ปีที่แล้ว +1

    As always a great presentation and the python code just give the extra bite...

  • @fvviz409
    @fvviz409 4 ปีที่แล้ว +5

    MAKE MORE PYTHON CONTENT PLEASE I LOVE IT

    • @statquest
      @statquest  4 ปีที่แล้ว +2

      I'm working on it. :)

  • @danielcozetto421
    @danielcozetto421 2 ปีที่แล้ว +2

    Hello Josh, Thank you for the amazing video! Quick question, at 9:18 how can I adapt "index=[*wt, *ko] for an excel input? Lets say that we have the same variables (Genes vs wt/ko) but in an excel file. How can I add these labels to the final plot (9:47)? Thank you again!!

    • @statquest
      @statquest  2 ปีที่แล้ว

      I'm not sure I understand your question. You can export your data from excel and import it into python (or R or whatever). Or are you asking about something else?

  • @wuyanyun
    @wuyanyun 4 ปีที่แล้ว +3

    Thank you! I’ve been struggling with this problem for so long !

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      Hooray! I'm glad the video was helpful. :)

  • @guohanzhao7813
    @guohanzhao7813 6 ปีที่แล้ว

    COOOOOL, so easy to understand!

  • @zishanahmedshaikh
    @zishanahmedshaikh 6 ปีที่แล้ว

    Hi Joshua, Great Videos!

  • @jeremylv3029
    @jeremylv3029 3 ปีที่แล้ว +1

    Man, u r a gem. I will pay for the knowledge later after my graduation bro. lol

    • @statquest
      @statquest  3 ปีที่แล้ว

      Wow! Thank you! :)

  • @angeloperera2022
    @angeloperera2022 5 หลายเดือนก่อน +1

    Amazing video! I initially watched the video explaining PCA and i was mind-blown, thank you so much! I was hoping to ask if anyone on the comment section or even StatQuest if possible, would know how to implement PCA in a multivariate timeseries dataset and also "examine the loading scores" in such a dataset. Thanks in advance! :)
    P.S - extremely clueless on anything coding or ML, but Ive got to use PCA (and other dimensionality reduction methods) on my timeseries dataset. so would greatly appreciate any direction on how to proceed.

    • @statquest
      @statquest  5 หลายเดือนก่อน +1

      See: stats.stackexchange.com/questions/158281/can-pca-be-applied-for-time-series-data

  • @einemailadressenbesitzerei8816
    @einemailadressenbesitzerei8816 6 ปีที่แล้ว +4

    I am in the last semester of my bachelor in computer science and i want to work in datascience business, i know the basics about datawarehousing/information engineering and artificial intelligence. Maybe you have a tip for me?

  • @mramadan2009
    @mramadan2009 4 ปีที่แล้ว

    Hi Josh Thank you for your efforts,
    really statquest is a magnificent channel ,
    Could you please make video for Singular Value decomposition SVD.
    thanks

  • @ColeKillian
    @ColeKillian 4 ปีที่แล้ว +1

    Amazing video thank you very much

  • @trustmebaz
    @trustmebaz 6 ปีที่แล้ว +1

    Thank you Joshua for this wonderful explaination. Thanks a lot.
    I am using your code for generating a scree plot in the same way and I obtain this error: bar() missing 1 required positional argument: 'left'

    • @trustmebaz
      @trustmebaz 6 ปีที่แล้ว +1

      Yes, I was using the original code given. I am using Python3, could that be the issue?

  • @olehsorokin7963
    @olehsorokin7963 3 ปีที่แล้ว

    That's a cool one. The fact that observations are columns makes it so confusing though. I'm really used to the tidy data notation

  • @vasanthakumar1991
    @vasanthakumar1991 6 ปีที่แล้ว +1

    BAM!!! I understood what u said. I show my gratitude. But I have a query.
    I am confused with my dataset regarding which to consider a row and which as columns
    My dataset is regarding Phase measurement units (PMU) used in electrical grid or sort of the distribution lines we see around.
    One single PMU measures 21 electrical parameters for a timestamp.
    We use around Four PMU each measuring the 21 parameters at different locations at the same time continuously over a period of time.
    How can I arrange the above data for Performing PCU sir?

    • @vasanthakumar1991
      @vasanthakumar1991 6 ปีที่แล้ว

      Sir those two case you mentioned that PCU would work is what I am also interested in calculating apart from the combination of all of the PMUs time stamp.
      Can u mention how to arrange the data (Rows and columns) for both of the mentioned viable cases?
      Thanking you so much!!You are really awesome sir

  • @matthsant
    @matthsant 5 ปีที่แล้ว +1

    Excelent tutorial!!

    • @statquest
      @statquest  5 ปีที่แล้ว

      Thank you! :)

  • @prakhars962
    @prakhars962 2 ปีที่แล้ว +2

    this is so good

    • @statquest
      @statquest  2 ปีที่แล้ว +1

      Thank you!

  • @fmetaller
    @fmetaller 6 ปีที่แล้ว +2

    First I want to thank you for all the awesome videos on PCA. I wanted to experiment with the demo code you published but I'm having problems in the data generation. The asterisk method used to stack wt and ko series is not working.

    • @statquest
      @statquest  6 ปีที่แล้ว

      Which version of Python are you using? The code was written for Python 3 and I'm not sure the asterisk method works in Python 2.x.

    • @fmetaller
      @fmetaller 6 ปีที่แล้ว +2

      Oh, you are right, I was accidentally using Python 2. Now it works well in Python 3

    • @statquest
      @statquest  6 ปีที่แล้ว +1

      Hooray! I'm glad you got it working :)

  • @sayanbhattacharya3233
    @sayanbhattacharya3233 4 ปีที่แล้ว

    Please post some intuitions on sparse deconvolution and compressive sensing..Would love to understand your approach..❤️

  • @harryliu1005
    @harryliu1005 2 ปีที่แล้ว

    Hi Josh!, this is a very excellent video that helped me a lot!!!
    I have a question, what if PC3 PC4 is also essential? Do I need to draw 2 2-D graphs, or what do I need to do?

    • @statquest
      @statquest  2 ปีที่แล้ว

      If you want to draw the PCs and the data, then you'll have to draw multiple graphs. Or you can use the projections from the first 4 PCs and input to a dimension reduction algorithm like t-SNE: th-cam.com/video/NEaUSP4YerM/w-d-xo.html

  • @KomangWahyuTrisna
    @KomangWahyuTrisna 4 ปีที่แล้ว +1

    i really like your clear explanation. please do some videos about deep learning and NLP.

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      I'm working on them.

    • @KomangWahyuTrisna
      @KomangWahyuTrisna 4 ปีที่แล้ว

      @@statquest yeah! I am waiting for that

  • @metvava
    @metvava 4 หลายเดือนก่อน

    great video! thanks for these!!! have you done a redundancy analysis and dbRDA plot video? thank you for contributing to our education

    • @statquest
      @statquest  4 หลายเดือนก่อน

      I haven't done that yet.

    • @metvava
      @metvava 4 หลายเดือนก่อน +1

      @@statquest let us know if you ever do! It would be a double bam from me. It just clicks the way you explain! Thank you again for your content!!!

  • @richardlin6993
    @richardlin6993 4 ปีที่แล้ว

    Looking forward to Kernel PCA in Python or explanation!

  • @jiayoongchong2606
    @jiayoongchong2606 4 ปีที่แล้ว +2

    6:31 using scikit PCA
    8:35 plotting scree plot
    10:37 loading scores for each principal component

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      Thanks for the time point! I'll add those to the description to divide the video into chapters.

  • @dnuyc
    @dnuyc 3 ปีที่แล้ว

    Great tutorial, sorry if my question may be ammature, but how did they differentiate WT and KO apart in the final PCA, I thought the data set was randomly generated?

    • @statquest
      @statquest  3 ปีที่แล้ว

      Early on we gave the rows and columns names and kept track of them.

  • @shendrew42
    @shendrew42 6 ปีที่แล้ว +1

    This was awesome an easy to follow.
    I'm fairly new to coding in python 3, but when I try to run your code: plt.bar(x=range(1, len(per_var)+1), height=per_var, tick_label=labels)
    I get the error "TypeError: bar() missing 1 required positional argument: 'left'". Everything else work :)

    • @shendrew42
      @shendrew42 6 ปีที่แล้ว +3

      Actually just got it to work with: plt.bar(left=range(1, len(per_var)+1), height=per_var, tick_label=labels)
      Awesome video again. :)

  • @naviddavanikabir
    @naviddavanikabir 3 ปีที่แล้ว

    fantastic, like always.
    I wonder how Poisson distribution caused each wt samples and ko samples to be correlated with each other?

    • @statquest
      @statquest  3 ปีที่แล้ว

      Because we generated the data, I selected different lambda values for the wt from the ko samples.

  • @vaggeliskyrilas3525
    @vaggeliskyrilas3525 3 ปีที่แล้ว

    Firstly, very good video. Secondly I am running the code in some spectra I have and I get good PCA plot but the loading scores seems to be wrong. Have you any idea why?

  • @joyousmomentscollection
    @joyousmomentscollection 4 ปีที่แล้ว

    Superb video... How does PCA perform if it has one-hot encoded data(transformed from categorical data) and discrete data. I suppose Z-scaling cant be done for such data types.

    • @madhu1987ful
      @madhu1987ful 4 ปีที่แล้ว

      Scaling to be done only for numeric features
      Categorical features to be used before PCA after one-hot encoding....

  • @sarathkareti8993
    @sarathkareti8993 3 ปีที่แล้ว

    This video is really awesome! I am just confused on one thing, what are your predictors and what is your target?

    • @statquest
      @statquest  3 ปีที่แล้ว

      PCA does not have predictors and targets. All variables are just...variables. For more details about PCA, see: th-cam.com/video/FgakZw6K1QQ/w-d-xo.html

  • @vannanuon6077
    @vannanuon6077 6 ปีที่แล้ว

    Hi Joshua, Thank a lot for a clear explanation and you walk me well each step.
    I have one question relating to PCA of Scikitlearn. Actually, you have said in your clip, but would like to ask to get a bit clearer. When using PCA of Scikitlearn, we must do train and test, is it right? The one in your clip is just part of it, right? I ask this because the result that I do following your step is different with the result from other programmes (CANOCO). Many thanks and look forward to hearing from you soon.

    • @vannanuon6077
      @vannanuon6077 6 ปีที่แล้ว

      Thank a lot Joshua for your clear explanation. I hope and wish to see your new clip relating Train and Test PCA.

    • @vannanuon6077
      @vannanuon6077 6 ปีที่แล้ว

      I appologise for one more question.
      I use your script in your Video to run with the data from the link here ("archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"). It was run well except the last step (chart). It shows error as the following. do you know what happen with data or the script?
      TypeError: cannot convert the series to
      Thank and sorry again for another question.

    • @vannanuon6077
      @vannanuon6077 6 ปีที่แล้ว

      Thank a lot Joshua for the link. It is really useful.

  • @dafran500
    @dafran500 3 ปีที่แล้ว

    Hello, thanks for the videos. I think you explain great. I have a question. How can we make rotations with this package?

    • @statquest
      @statquest  3 ปีที่แล้ว

      You can multiply data by the loading values.

    • @dafran500
      @dafran500 3 ปีที่แล้ว

      @@statquest Thanks for the response. That will allow me to make rotations such as varimax, quartimax, etc? :(

  • @airthang
    @airthang 4 ปีที่แล้ว

    Hi, by any change, you have video about theory of PLS and how to implement it in Machine Learning?

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      I'll keep it in mind.

  • @weiqingwang1202
    @weiqingwang1202 4 ปีที่แล้ว

    Is loading score eigenvalues? Wish to see a more linear algebra method of explaining pca!

    • @statquest
      @statquest  4 ปีที่แล้ว

      For more details on how PCA works, see: th-cam.com/video/FgakZw6K1QQ/w-d-xo.html

  • @yareniyigun7236
    @yareniyigun7236 4 ปีที่แล้ว

    Cool video thanks! How can ı generate numbers basen on poisson dist. that wont change when ı rerun the code for dataframe?

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      np.random.seed(0) for more details, see: stackoverflow.com/questions/21494489/what-does-numpy-random-seed0-do

  • @kristinscott94
    @kristinscott94 5 ปีที่แล้ว

    I am using a pd dataframe with gene expression data instead of creating a fake dataset. I am having a hell of a time generating a loading matrix with the index i need (gene_symbol), which is stored in a column from my original dataframe (df). The line of code I am referring to is: loading_score = pd.Series(pca_components_[0], index = genes). I tried setting index like so: index = df.gene_symbol, index = df['gene_symbol'], index = df.loc['gene_symbol'], but no luck. I get errors that say DataFrame object as no attribute "gene_symbol". Anyone have an idea of how to get it to work?

  • @vizz2328
    @vizz2328 3 ปีที่แล้ว

    what happens if we give the n_components=d, d being the no of dimensions? Does PCA denoise the data because there won't be any reduction in dimensions?

    • @statquest
      @statquest  3 ปีที่แล้ว

      I don't think it will. However, there is still value because you can draw the scree plot and see how many PCs are really useful (it might only be a few, or it could be all of them).

  • @sevicore
    @sevicore 2 ปีที่แล้ว

    If u end up using the PCA data...would not cause data lakeage in our predictive model since scaling should be done after train test split?

    • @statquest
      @statquest  2 ปีที่แล้ว

      If you're using for machine learning, presumably you can come up with a standard scaling and centering procedure.

  • @yareniyigun7236
    @yareniyigun7236 4 ปีที่แล้ว

    Thanks for all the axplanations but I have something in mind. I'm trying to reduce number of variables and I guess I should do that according to loading scores. As the loading scores are super similar how can I do that? Is ıt going to be a meaningfull move?

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      If the loading scores are very similar, then removing variables might not work very well. You can also look at the scree plot to determine if PCA is doing a good job reducing dimensions in the first place.

  • @arufuredo
    @arufuredo 4 ปีที่แล้ว

    Hi Josh, thanks a lot for this video, it hit the point! I was wondering how to apply PCA in Python and came to this marvelous video.
    I made my own Jupyter Notebook following your instructions and came neatly. One minor problem (I really dunno if it is): my data came all the way around at the last steps. My WildType cluster was at the right, while the KO one was on the left. I tried several times because I thought it was due to randomness, but it always had the same shape. Any ideas on this?
    In other news, I'm from Argentina (I speak Spanish), so I was wondering if my Notebook was of any use to your Spanish-speaking viewers. If so, I would gladly share it!
    Cheers from Argentina, you've got a new Follower :)

    • @statquest
      @statquest  4 ปีที่แล้ว

      If the shape is the same, it's OK. The orientation is somewhat arbitrary. I'm sure your notebook would be helpful. You can share it on GitHub, or send it to me and I'll add it to this repository. You can contact me through my website: statquest.org/contact/

  • @azrahasan3796
    @azrahasan3796 3 ปีที่แล้ว

    Hi, You are a lifesaver. I am trying to do PCA analysis on my own data but since every demo video either use the databases and you created your own data. I am missing some crucial steps, especially in defining index when i am doing it with my data. Will it be too much to ask few more videos on machine learning where you use the excel sheet data from your laptop.

    • @azrahasan3796
      @azrahasan3796 3 ปีที่แล้ว

      I am a newbie in data science and programming. I am a Molecular Biologist who would love to learn machine learning.

    • @statquest
      @statquest  3 ปีที่แล้ว

      I'll keep that in mind for a future video.

  • @thatbackbenchdude
    @thatbackbenchdude 6 ปีที่แล้ว

    How to embed python in web application for machine learning and data mining concepts

    • @Jonasmelonas
      @Jonasmelonas 5 ปีที่แล้ว

      lmgtfy.com/?iie=1&q=How+to+embed+python+in+web+application+for+machine+learning+and+data+mining+concepts

  • @sagartomar4210
    @sagartomar4210 4 ปีที่แล้ว +1

    you are awesome bro

  • @ayeshaali6462
    @ayeshaali6462 2 ปีที่แล้ว

    loading_scores=pd.Series(pca.components_[0],index=genes)
    what should i write in place of genes

    • @statquest
      @statquest  2 ปีที่แล้ว

      If you changed the index, as described at 3:17, you should probably use the same thing you changed it to.

  • @bedegong08
    @bedegong08 6 ปีที่แล้ว

    minutes 3: 31: what is the equivalent of unpacking (stars *) on python2.7?

    • @pfreire27
      @pfreire27 6 ปีที่แล้ว

      Not sure if it still helps, but to unpack it in Python 2.7 you can replace *wt for wt[0:] (and the same for ko).

  • @janebond3263
    @janebond3263 3 ปีที่แล้ว

    It work! I did it! Finally, however, every dot in my plot doesnt have the label wt or ko. So , I couldn't analyze my data. Any suggestions about how can I fix it?

    • @statquest
      @statquest  3 ปีที่แล้ว

      Congratulations! Unfortunately, debugging code in youtube comments is awkward and hard to do. The best I can do is ask if you used my code or typed in your own. Try my mode first and see if it works for you.

  • @marioaguilar8735
    @marioaguilar8735 3 ปีที่แล้ว

    Hey, great video. Pretty clear explanation, I appreciate.
    I have a question to any one willing and able to answer:
    In case of not transpossing the data when scaling it in the beginning, why does the program throw an error later when creating the pca_df DataFrame? The error is the following:
    ValueError: Shape of passed values is (100, 10), indices imply (10, 10)
    Additionally, as curiosity, when not transpossing the data, PC2 has almost as much explanatory power as PC1. Otherwise, only PC1 captures variation. Does any one know the reason?
    Thanks

    • @statquest
      @statquest  3 ปีที่แล้ว

      Why are you not transposing the data? However, if you swap the roles for the genes and samples, each gene has a different mean value, so, across the genes, we just have a bunch of random values and there is no correlation. Thus, none of the PCs will stand out and account for significantly more variation than the others.

  • @maurolarrat
    @maurolarrat 5 ปีที่แล้ว +1

    Thank you!

    • @statquest
      @statquest  5 ปีที่แล้ว

      You're welcome! :)

  • @aranburu4716
    @aranburu4716 4 ปีที่แล้ว

    hello, your lessons are wonderful. can you make computer vision, deep learning, convolutional naural network with python programming language etc.

    • @statquest
      @statquest  4 ปีที่แล้ว +1

      I'm working on all of those.

  • @gbchrs
    @gbchrs 2 ปีที่แล้ว

    is centering included in sklearn's pca model and that's why there is no extra step to center?

    • @statquest
      @statquest  2 ปีที่แล้ว

      I believe so.

  • @3stepsahead704
    @3stepsahead704 2 ปีที่แล้ว

    Very concise, I will surely be coming back to this video, however I would like to know why PCA is able to group these two categories (wt and ko), when it's shown they are generated from the same random method. If all indexes were generated at the same time, I would get it, but as they are generated index by index, I seem not to be able to grasp it.

    • @statquest
      @statquest  2 ปีที่แล้ว

      The trick is at 3:48. For each group, wt and ko, we select a different parameter for the poisson distribution and generate 5 measurements from each of those two different distributions. One set is for wt and the other set is for ko.

    • @3stepsahead704
      @3stepsahead704 2 ปีที่แล้ว

      @@statquest I think my confusion comes from the fact that these will make the two groups different from one another (all w's different from ko's), but I wouldn't predict them to be similar within the group (wt1 is close in vertical to wt2, and to wt3...,), thus I tend to believe PCA should tell them apart, but not in exactly two groups (wt's vs ko's), I would predict more like two clouds instead of two "vertical line of points" in the 2-D.

    • @statquest
      @statquest  2 ปีที่แล้ว

      @@3stepsahead704 Remember how PCA actually works, it finds the axis that has the most variation (which is between WT and KO) and focuses on that. And then find the secondary differences (among the WT and KO). However, because the differences between WT and KO are big, the scale on the x-axis will be much bigger than the scale on the y-axis. Thus, the samples will appear to be in a vertical line rather than spaced apart like you might guess they should be. In short, check the scales of the axes, they will explain the difference between what you think you see and what you expect.

    • @3stepsahead704
      @3stepsahead704 2 ปีที่แล้ว +1

      @@statquest Thank you very much for taking the time to explain this. I now get it!

  • @innocenceesstt1
    @innocenceesstt1 3 ปีที่แล้ว

    Thank you very much for this tutorial. Please can you explain how to get correlation matrix

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      With numpy, you use corrcoef().

    • @innocenceesstt1
      @innocenceesstt1 3 ปีที่แล้ว +1

      @@statquest Thank you very much

  • @shreyjain6447
    @shreyjain6447 2 ปีที่แล้ว

    What if I get 4 variables with maximum variation in the scree plot? How would I then plot a PCA plot?

    • @statquest
      @statquest  2 ปีที่แล้ว

      You can draw multiple pca graphs (PC1 vs PC2, PC1 vs PC3 etc.)

  • @gmo2827
    @gmo2827 5 ปีที่แล้ว

    Hope you can make R and Python tutorials.

  • @advaitshirvaikar4751
    @advaitshirvaikar4751 3 ปีที่แล้ว

    Hey Josh, how do I find out which feature in the original dataset is to be removed(the one that least affects the variance im assuming)?
    I know we use PCA for the same, but I just can't understand how we select the unimportant feature from the original dataset using PCA.

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      You can set a threshold for the loading scores. All features with loading scores below that threshold can be discarded.

    • @advaitshirvaikar4751
      @advaitshirvaikar4751 3 ปีที่แล้ว +1

      @@statquest okay, thanks a lot!

  • @Cat_Sterling
    @Cat_Sterling ปีที่แล้ว

    Thank you!!! When we are speaking about variation in PCA, is that the same as variance?

    • @statquest
      @statquest  ปีที่แล้ว

      Yep.

    • @Cat_Sterling
      @Cat_Sterling ปีที่แล้ว

      @@statquest Thank you very much for the clarification! I googled it, and seems that it's two different things, but sometimes they can be used interchangeably or be the same thing.

    • @statquest
      @statquest  ปีที่แล้ว

      @@Cat_Sterling Yes, I guess it depends on how you want to use them and whether you divide by 'n' or 'n-1', but, at least on a conceptual level, they are the same.

    • @Cat_Sterling
      @Cat_Sterling ปีที่แล้ว +1

      @@statquest Thank you so much again! Really appreciate your reply! Your channel helped me so much!!!