Principal Component Analysis in Python | How to Apply PCA | Scree Plot, Biplot, Elbow & Kaisers Rule

แชร์
ฝัง
  • เผยแพร่เมื่อ 14 ต.ค. 2024
  • This video explains how to apply a Principal Component Analysis (PCA) in Python. More details: statisticsglob...
    The video is presented by Cansu Kebabci, a data scientist and statistician at Statistics Globe. Find more information about Cansu here: statisticsglob...
    In the video, Cansu explains the steps and application of a Principal Component Analysis in Python. Watch the video to learn more on this topic!
    Here can you find the previous videos of this series:
    Introduction to Principal Component Analysis (Pt. 1 - Theory): • Introduction to Princi...
    Principal Component Analysis in R Programming (Pt. 2 - PCA in R): • Principal Component An...
    Links to the tutorials mentioned in the video:
    PCA Using Correlation & Covariance Matrix (Examples): statisticsglob...
    Biplot for PCA Explained: statisticsglob...
    Python code of this video:
    Install libraries
    !pip install scikit-learn
    !pip install pandas
    !pip install matplotlib
    !pip install numpy
    Load Libraries & Modules
    import numpy as np
    import pandas as pd
    from sklearn.datasets import load_breast_cancer
    from sklearn.decomposition import PCA
    from sklearn.preprocessing import StandardScaler
    import matplotlib.pyplot as plt
    Load Breast Cancer Dataset
    breast_cancer = load_breast_cancer()
    Data Elements of breast_cancer
    breast_cancer.keys()
    breast_cancer.data.shape
    breast_cancer.feature_names
    Print Data in DataFrame Format
    DF = pd.DataFrame(data = breast_cancer.data[:, :10], # Create DataFrame DF
    columns = breast_cancer.feature_names[:10])
    DF.head(6) # Print first 6 rows of DF
    Standardize Data
    scaler = StandardScaler() # Create scaler
    data_scaled = scaler.fit_transform(DF) # Fit scaler
    print(data_scaled) # Print scaler
    Print Standardized Data in DataFrame Format
    DF_scaled = pd.DataFrame(data = data_scaled,
    columns = data.feature_names[:10])
    DF_scaled.head(6)
    Print Standardized Data in DataFrame Format
    DF_scaled = pd.DataFrame(data = data_scaled, # Create DataFrame DF_scaled
    columns = breast_cancer.feature_names[:10])
    DF_scaled.head(6) # Print first 6 rows of DF_scaled
    Ideal Number of Components
    pca = PCA(n_components = 10) # Create PCA object forming 10 PCs
    pca_trans = pca.fit_transform(DF_scaled) # Transform data
    print(pca_trans) # Print transformed data
    print(pca_trans.shape) # Print dimensions of transformed data
    prop_var = pca.explained_variance_ratio_ # Extract proportion of explained variance
    print(prop_var) # Print proportion of explained variance
    PC_number = np.arange(pca.n_components_) + 1 # Enumarate component numbers
    print(PC_number) # Print component numbers
    Scree Plot
    plt.figure(figsize=(10, 6)) # Set figure and size
    plt.plot(PC_number, # Plot prop var
    prop_var,
    'ro-')
    plt.title('Scree Plot (Elbow Method)', # Plot Annotations
    fontsize = 15)
    plt.xlabel('Component Number',
    fontsize = 15)
    plt.ylabel('Proportion of Variance',
    fontsize = 15)
    plt.grid() # Add grid lines
    plt.show() # Print graph
    #Alternative Scree Plot Data
    var = pca.explained_variance_ # Extract explained variance
    print(var) # Print explained variance
    The remaining code is unfortunately too long for a TH-cam description.
    Follow me on Social Media:
    Facebook - Statistics Globe Page: / statisticsglobecom
    Facebook - R Programming Group for Discussions & Questions: / statisticsglobe
    Facebook - Python Programming Group for Discussions & Questions: / statisticsglobepython
    LinkedIn - Statistics Globe Page: / statisticsglobe
    LinkedIn - R Programming Group for Discussions & Questions: / 12555223
    LinkedIn - Python Programming Group for Discussions & Questions: / 12673534
    Twitter: / joachimschork
    Instagram: / statisticsglobecom
    TikTok: / statisticsglobe

ความคิดเห็น • 24

  • @albertoavendano7196
    @albertoavendano7196 10 หลายเดือนก่อน +2

    Let me tell you something: This might me the most clearer PCA video of all... Simple, clear and we have the concepts in other videos... Which is great!!! Thank you a lot... Just one question, is the PCA useful for supervised learning as well or do we use RFECV for that? Those are the 2 method for feature reduction I used...

    • @cansustatisticsglobe
      @cansustatisticsglobe 10 หลายเดือนก่อน

      Hello Alberto,
      I'm deeply honored by your feedback. It's wonderful to know that diligent effort pays off. Regarding your question, Principal Component Analysis (PCA) and Recursive Feature Elimination with Cross-Validation (RFECV) are both techniques used for feature reduction, but they are useful in different contexts within the realm of supervised learning.
      PCA can be used in supervised learning but with caution. It's great for reducing the number of features and, hence, computation time. However, the principal components are linear combinations of the original features and might not have a straightforward interpretation. The main limitation in the context of supervised learning is that PCA does not consider the response variable. It only focuses on explaining the variance in the predictors, which might not always align with the predictive power with respect to the response variable.
      In summary, PCA can be used in supervised learning but would come with the aforementioned limitations.
      Best,
      Cansu

  • @darrylmorgan
    @darrylmorgan ปีที่แล้ว +1

    Thank you Cansu and Joachim for the awesome PCA in python tutorial :)

    • @cansustatisticsglobe
      @cansustatisticsglobe ปีที่แล้ว

      Hello Darryl,
      We are glad to hear that you liked the video.
      Have a good one!
      Cansu

  • @thaynalg
    @thaynalg ปีที่แล้ว +2

    That was so helpful, great job. I can't thank you enough.

    • @matthias.statisticsglobe
      @matthias.statisticsglobe ปีที่แล้ว

      Thank you very much for the nice words and your support. It's great to hear that the video has been helpful for you!

  • @TeunXt5
    @TeunXt5 6 หลายเดือนก่อน +1

    what program do you use to run python? I am using VS code, but this looks more convenient

    • @StatisticsGlobe
      @StatisticsGlobe  6 หลายเดือนก่อน

      Hey, we are using Jupyter Notebook in this video. I think it's great! :)

  • @TheSerbes
    @TheSerbes 18 วันที่ผ่านมา +1

    hi, I will make hourly time series with Lstm. I need to select features beforehand. How can I do that? Can I learn which parameter has the most effect with PCA?

    • @StatisticsGlobe
      @StatisticsGlobe  18 วันที่ผ่านมา

      Hey, to select features for your LSTM model, you can use techniques like feature importance from models (e.g., Random Forests), correlation analysis, or recursive feature elimination. PCA can help reduce dimensionality but won't directly tell you which features have the most effect.

    • @TheSerbes
      @TheSerbes 18 วันที่ผ่านมา

      @@StatisticsGlobe Ok, can tsfresh or shap be done?

    • @TheSerbes
      @TheSerbes 18 วันที่ผ่านมา

      @@StatisticsGlobe I will do Future Important but random forest cannot see the previous data. Do I need to open lag values or window for this?

  • @popuriann8983
    @popuriann8983 6 หลายเดือนก่อน +1

    Halo, thank you for the very comprehensive tutorial. However, I have some questions, if in my case I have 3 PC how to analyze it in biplot? and for the second question, I am a bit confused with the loading score, if it shows negative value, than what is that mean?

    • @StatisticsGlobe
      @StatisticsGlobe  6 หลายเดือนก่อน +1

      Hey, thanks for the kind words, glad the tutorial was helpful! You may have a look at 3D plots when you want to analyze 3 components: statisticsglobe.com/3d-plot-pca-python A negative loading score in PCA means that the original variable inversely correlates with the principal component. I hope this helps! Joachim

  • @mjacfardk
    @mjacfardk ปีที่แล้ว

    🙏 thank for your great Tutorial.

  • @AnkitKumar-i1f8b
    @AnkitKumar-i1f8b ปีที่แล้ว +1

    Hello thanks, suppose I have two replication data of several genotypes. In the first column mentioned genotype names and in the second R1..R2 in the alternative form and in the next column variables data, then How i can get the mean values of R1 and R2 together of all variables? thanks in advance!

    • @cansustatisticsglobe
      @cansustatisticsglobe ปีที่แล้ว

      Hello,
      You are welcome. I couldn't get your data setting. Could you please be more specific?
      Best,
      Cansu

    • @AnkitKumar-i1f8b
      @AnkitKumar-i1f8b ปีที่แล้ว +1

      @@cansustatisticsglobe thanks a lot for your reply, can I send you my file through email? I will explicitly mention this.

    • @cansustatisticsglobe
      @cansustatisticsglobe ปีที่แล้ว

      Hello @@AnkitKumar-i1f8b !
      Please try to explain it here so that the other visitors struggling with similar issues can benefit from the solution. I will do my best to help in the comments.
      Best,
      Cansu

  • @planq521
    @planq521 6 หลายเดือนก่อน +1

    what about if we use more than two components how can we able to plot them in to graph?

    • @StatisticsGlobe
      @StatisticsGlobe  6 หลายเดือนก่อน

      Hey, this really depends on what you want to analyze. However, when using more than two components in PCA, visualization typically involves plotting the first three principal components in 3D scatter plots. For more components, parallel coordinates plots or heatmaps can effectively represent higher-dimensional data.

  • @northernswedenstories1028
    @northernswedenstories1028 ปีที่แล้ว

    PCA and MCA is damn cool.