Principal Component Analysis in Python | How to Apply PCA | Scree Plot, Biplot, Elbow & Kaisers Rule

Statistics Globe

มุมมอง 4 361

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 14 ต.ค. 2024
This video explains how to apply a Principal Component Analysis (PCA) in Python. More details: statisticsglob...
The video is presented by Cansu Kebabci, a data scientist and statistician at Statistics Globe. Find more information about Cansu here: statisticsglob...
In the video, Cansu explains the steps and application of a Principal Component Analysis in Python. Watch the video to learn more on this topic!
Here can you find the previous videos of this series:
Introduction to Principal Component Analysis (Pt. 1 - Theory): • Introduction to Princi...
Principal Component Analysis in R Programming (Pt. 2 - PCA in R): • Principal Component An...
Links to the tutorials mentioned in the video:
PCA Using Correlation & Covariance Matrix (Examples): statisticsglob...
Biplot for PCA Explained: statisticsglob...
Python code of this video:
Install libraries
!pip install scikit-learn
!pip install pandas
!pip install matplotlib
!pip install numpy
Load Libraries & Modules
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
Load Breast Cancer Dataset
breast_cancer = load_breast_cancer()
Data Elements of breast_cancer
breast_cancer.keys()
breast_cancer.data.shape
breast_cancer.feature_names
Print Data in DataFrame Format
DF = pd.DataFrame(data = breast_cancer.data[:, :10], # Create DataFrame DF
columns = breast_cancer.feature_names[:10])
DF.head(6) # Print first 6 rows of DF
Standardize Data
scaler = StandardScaler() # Create scaler
data_scaled = scaler.fit_transform(DF) # Fit scaler
print(data_scaled) # Print scaler
Print Standardized Data in DataFrame Format
DF_scaled = pd.DataFrame(data = data_scaled,
columns = data.feature_names[:10])
DF_scaled.head(6)
Print Standardized Data in DataFrame Format
DF_scaled = pd.DataFrame(data = data_scaled, # Create DataFrame DF_scaled
columns = breast_cancer.feature_names[:10])
DF_scaled.head(6) # Print first 6 rows of DF_scaled
Ideal Number of Components
pca = PCA(n_components = 10) # Create PCA object forming 10 PCs
pca_trans = pca.fit_transform(DF_scaled) # Transform data
print(pca_trans) # Print transformed data
print(pca_trans.shape) # Print dimensions of transformed data
prop_var = pca.explained_variance_ratio_ # Extract proportion of explained variance
print(prop_var) # Print proportion of explained variance
PC_number = np.arange(pca.n_components_) + 1 # Enumarate component numbers
print(PC_number) # Print component numbers
Scree Plot
plt.figure(figsize=(10, 6)) # Set figure and size
plt.plot(PC_number, # Plot prop var
prop_var,
'ro-')
plt.title('Scree Plot (Elbow Method)', # Plot Annotations
fontsize = 15)
plt.xlabel('Component Number',
fontsize = 15)
plt.ylabel('Proportion of Variance',
fontsize = 15)
plt.grid() # Add grid lines
plt.show() # Print graph
#Alternative Scree Plot Data
var = pca.explained_variance_ # Extract explained variance
print(var) # Print explained variance
The remaining code is unfortunately too long for a TH-cam description.
Follow me on Social Media:
Facebook - Statistics Globe Page: / statisticsglobecom
Facebook - R Programming Group for Discussions & Questions: / statisticsglobe
Facebook - Python Programming Group for Discussions & Questions: / statisticsglobepython
LinkedIn - Statistics Globe Page: / statisticsglobe
LinkedIn - R Programming Group for Discussions & Questions: / 12555223
LinkedIn - Python Programming Group for Discussions & Questions: / 12673534
Twitter: / joachimschork
Instagram: / statisticsglobecom
TikTok: / statisticsglobe

ความคิดเห็น • 24

@albertoavendano7196 10 หลายเดือนก่อน ⁺²
Let me tell you something: This might me the most clearer PCA video of all... Simple, clear and we have the concepts in other videos... Which is great!!! Thank you a lot... Just one question, is the PCA useful for supervised learning as well or do we use RFECV for that? Those are the 2 method for feature reduction I used...
@cansustatisticsglobe 10 หลายเดือนก่อน
Hello Alberto,
I'm deeply honored by your feedback. It's wonderful to know that diligent effort pays off. Regarding your question, Principal Component Analysis (PCA) and Recursive Feature Elimination with Cross-Validation (RFECV) are both techniques used for feature reduction, but they are useful in different contexts within the realm of supervised learning.
PCA can be used in supervised learning but with caution. It's great for reducing the number of features and, hence, computation time. However, the principal components are linear combinations of the original features and might not have a straightforward interpretation. The main limitation in the context of supervised learning is that PCA does not consider the response variable. It only focuses on explaining the variance in the predictors, which might not always align with the predictive power with respect to the response variable.
In summary, PCA can be used in supervised learning but would come with the aforementioned limitations.
Best,
Cansu
@darrylmorgan ปีที่แล้ว ⁺¹
Thank you Cansu and Joachim for the awesome PCA in python tutorial :)
@cansustatisticsglobe ปีที่แล้ว
Hello Darryl,
We are glad to hear that you liked the video.
Have a good one!
Cansu
@thaynalg ปีที่แล้ว ⁺²
That was so helpful, great job. I can't thank you enough.
@matthias.statisticsglobe ปีที่แล้ว
Thank you very much for the nice words and your support. It's great to hear that the video has been helpful for you!
@TeunXt5 6 หลายเดือนก่อน ⁺¹
what program do you use to run python? I am using VS code, but this looks more convenient
@StatisticsGlobe 6 หลายเดือนก่อน
Hey, we are using Jupyter Notebook in this video. I think it's great! :)
@TheSerbes 18 วันที่ผ่านมา ⁺¹
hi, I will make hourly time series with Lstm. I need to select features beforehand. How can I do that? Can I learn which parameter has the most effect with PCA?
@StatisticsGlobe 18 วันที่ผ่านมา
Hey, to select features for your LSTM model, you can use techniques like feature importance from models (e.g., Random Forests), correlation analysis, or recursive feature elimination. PCA can help reduce dimensionality but won't directly tell you which features have the most effect.
@TheSerbes 18 วันที่ผ่านมา
@@StatisticsGlobe Ok, can tsfresh or shap be done?
@TheSerbes 18 วันที่ผ่านมา
@@StatisticsGlobe I will do Future Important but random forest cannot see the previous data. Do I need to open lag values or window for this?
@popuriann8983 6 หลายเดือนก่อน ⁺¹
Halo, thank you for the very comprehensive tutorial. However, I have some questions, if in my case I have 3 PC how to analyze it in biplot? and for the second question, I am a bit confused with the loading score, if it shows negative value, than what is that mean?
@StatisticsGlobe 6 หลายเดือนก่อน ⁺¹
Hey, thanks for the kind words, glad the tutorial was helpful! You may have a look at 3D plots when you want to analyze 3 components: statisticsglobe.com/3d-plot-pca-python A negative loading score in PCA means that the original variable inversely correlates with the principal component. I hope this helps! Joachim
@mjacfardk ปีที่แล้ว
🙏 thank for your great Tutorial.
@cansustatisticsglobe ปีที่แล้ว
You are very welcome!
Best,
Cansu
@AnkitKumar-i1f8b ปีที่แล้ว ⁺¹
Hello thanks, suppose I have two replication data of several genotypes. In the first column mentioned genotype names and in the second R1..R2 in the alternative form and in the next column variables data, then How i can get the mean values of R1 and R2 together of all variables? thanks in advance!
@cansustatisticsglobe ปีที่แล้ว
Hello,
You are welcome. I couldn't get your data setting. Could you please be more specific?
Best,
Cansu
@AnkitKumar-i1f8b ปีที่แล้ว ⁺¹
@@cansustatisticsglobe thanks a lot for your reply, can I send you my file through email? I will explicitly mention this.
@cansustatisticsglobe ปีที่แล้ว
Hello @@AnkitKumar-i1f8b !
Please try to explain it here so that the other visitors struggling with similar issues can benefit from the solution. I will do my best to help in the comments.
Best,
Cansu
@planq521 6 หลายเดือนก่อน ⁺¹
what about if we use more than two components how can we able to plot them in to graph?
@StatisticsGlobe 6 หลายเดือนก่อน
Hey, this really depends on what you want to analyze. However, when using more than two components in PCA, visualization typically involves plotting the first three principal components in 3D scatter plots. For more components, parallel coordinates plots or heatmaps can effectively represent higher-dimensional data.
@northernswedenstories1028 ปีที่แล้ว
PCA and MCA is damn cool.
@cansustatisticsglobe ปีที่แล้ว
Indeed!
Best,
Cansu

ต่อไป

เล่นอัตโนมัติ

Calculate Mean in Python (5 Examples) | pandas DataFrame & NumPy library | Get Row Average by Group