Anyone interested in a tutorial video on Bioinformatics (using Python)? Comments down below 👇 If you find value in this video, please give it a Like 👍 and Subscribe ❤️for more videos on Data Science. 😃
Hey, I love your videos! It gave me the envy to start my own TH-cam channel explaining (vulgarizing) the most used terms in artificial intelligence for everyone not in the field. So people know more about this "black box" of our domain. I do use some simple animations to help describe those terms in the simplest and most concise way possible. I also use an artificial intelligence voice since my own accent and voice wouldn't work well. (And I loved the idea of an AI teaching AI ahah!) This said, I'd love to collaborate with you in a video. Explaining a term you use in a video, like in the way of a little 30 sec - 1 min video from me incision in your video or anything you'd like to do! If you are interested or want to know more, feel free to check out some of my videos on my account or contact me! (sorry for this comment - message, I couldn't find any emails to message you directly)
yes! I don’t mind if you do a full tutorial of bioinformatics project, like ML application in real NGS data. I am also interested in learning scRNA, chiqseq from you. Sorry asked too much, lol. Love your videos, keep it up!
I’m taking an online class from a brick and mortar school. This was part of this weeks “lecture”. I have to say. This all seems thoughtful and very well presented . If I was part of this program and new want you were talking about I bet it would be great 👍
Thank you for this video, I've been struggling with understanding PCA for a good minute, but your video explained it extremely well! Please keep posting more like this.
Thank you data Professor, excellent intro! Suscribed, at once! It is important to analize the correlation matrix to identify highly intercorrelated variables and then the loadings, in order to interpret semantically each component: the meaning of PC, PC2 and PC3
Hi professor, thank you so much for such an education tutorial, I have a dataset with shape (99,25), how can I use PCA to select 10 or 8 best features that explain at least 90% variance of the dataset. In summary how I use PCA for feature selection, without transforming them into principal components ie PC1, PC2 ..... I just need the features for further classification
Thanks, one question though - How come most tutorials conduct the main 'sklearn PCA' function before determining how may components to use in the PCA itself? Wouldn't it be better to determine the variance ratio between the components BEFORE choosing how many components to use (e.g. 1 or 2 or 3)? Isn't it a potential waste of time to start with a PCA, then find out in the scree plot afterwards that you should (could) have used more/less components? I think I'm missing something.
This would've been more helpful if you explained how to determine the number of components. Because it seems like you just assumed it would be 3 because you knew there were three target labels (the three different species). If you didn't already have output/target labels and this was TRULY an unsupervised approach, it would have been useful to see how you arrive at 3 components.
Hello Professor, I really want to appreciate the beautiful work you are doing on your channel. I have watched some of your videos and i will say the simplicity with which you deliver your lectures blows me away. I am trying to do a PCA on some some data which have npy file but i have got no luck to do that as i don't know how to go about it when using npy file for PCA. I will appreciate your help to guide me. Thank you.
Hi, thanks for the kind words, and I am glad that you're finding this channel helpful. On to your question, as npy files are generated in numpy, after reading it in, can you try converting it to a pandas dataframe, then from there you can follow the steps mentioned in this video. Hope this helps.
Hi sir, thank you very much for the video! I have experimental data set of time dependent signals 800(time)x49(signals) voltage values. I used PCA and reduced it to 800x2.How can I reduce further and extract information from these set for ML application and is there any other feature extraction method that you can advice for signal feature extraction ?
As you said, your dataset of 800 x 49 it is 800 rows x 49 columns, so using PCA the columns reduces to 2 which are the 2 PCs, one important thing to consider is to calculate the cumulative variance of these 2 PCs (the sum of the explained variances for PC1 and PC2), if they are too low then you can consider to use more PCs, such that the cumulative variance is within acceptable value (hopefully 70%).
Thanks for the video. I have a question: I have >100K sensors each sensor has between 2000-4000 values. My DF looks like this ['sensorNr', 'values'] long table format. How can I classify/cluster my data series in two categories?
Dear Prof, PCA1, PCA2, PC3, represent which variables? certain variables or combination variables. It is very important to explain when we have a 3-D graph for non-technical users/audiences. Thanks in advance
Hi, as mentioned in the provided Jupyter notebook, the iris dataset used herein was from the sklearn library as follows: from sklearn import datasets iris = datasets.load_iris() X = iris.data Y = iris.target
@@DataProfessor ah OK thanks Professor. I'm quite new to Python but can I ask was this an example dataset already within the skylearn library or was it something you actually imported into that library? Regards Peter
@@petelok9969 Hi, no worries, with some practice, you’ll be on your way to master this. The one used in this tutorial, is an example dataset from the sklearn library. In spite of this, there are various ways to get data in, for instance you can use pandas to read in a csv file both locally and remotely from the cloud.
I don't know why but all the codes are executed on my dataset but the final scree plot is not appearing even though the code has successfully executed but the blank output is coming. What is the reason could be?
Incredible tutorial, congratulations! The 3d viz are awesome and help a lot to understand loadings x attributes relationship. I was wondering if isnt possible to access the scoring coeeficient matrix that is used internally by the Transform(X). Do you know how can I achieve that?
How do you get in to print the feature names and target names? When I do the same, using the same dataset, I get an error saying: 'DataFrame' object has no attribute 'feature_names'. The same goes in the step "Assigning Input (X) and Output (Y): It simply tells me, that the Dataframe has no attribute called "data" and "target".
This works if it's the example dataset from Scikit-learn. For all other dataset read from CSV files the feature names should automatically be read from the 1st line of the CSV file.
Great Video! I love your tutorial, it helps me alot in understanding of PCA. Btw, Can we eliminate the PC3 with only 3% of variance, and run the model again with n_components =2 , I think that 3% of variance seems to be not much critical to the model?!
Great video! Thanks a lot. I have a question, what should I do in order to classify data series (100k curves... with only two columns=['sensorid', 'values'], each sensor has 2000- 4000 point values). The idea is to "cluster/classify" each sensor curve in two groups!
can we also make it to plot 2d scatter plots of different combinations instead of 3d with a for loop in plotly or we need to use matplotlib for it ? also what could be the usage of pca for bioinformatics ?
Yes, we can definitely iterate through the features and display different combination of 2D scatter plots. There are many usage of PCA. It's really a useful algorithm. Here are some usage: PCA can generally be used to provide a visualisation of the distribution of the data samples. It also allows quick visualization and comparative analysis between 2 datasets whether they share similar distribution to one another. In bioinformatics and cheminformatics, PCA scores plot has been used for segmenting the dataset. Additionally, PC coefficients can also be used as a compressed form of all the variables used in the dataset. Furthermore, analysis of the PCA loadings plot may allow us to gain insights on the feature importance.
@@DataProfessor thanks for the answer, but how do we trust pca and other unsupervised techniques since when we change some parameters during pca and other unsupervised techniques like means and etc, the score plots can change slightly ? so do we also need to perform cross validation like we do it in supervised techniques ? or even you might knowthat orthogonality or the normalization and etc to the raw data can also change the results on score and loadings plots of the pca, so how do we decide which one is correct then ?
Dear Professor, can I ask would you consider these interpretations as more or less correct: Scores plot looks at what extent each PC captures the behaviour of each sample whereas the loadings plot is an indicator of how much each variable weights the PC. Also, is it possible to derive the loadings from the score values? Btw thanks very much, I'm getting so much out of your tutorials. Peter
hello! you demonstrated brilliantly everything there is to do with simple datasets where for each data sample, its features are single numbers. But can this be implemented for 3 dimensional dataset as well? more specifically, my dataset is of the form np.array(2000, 64, 400). here 2000 is number of data samples, 64 streams of 400 data each and i'm looking to reduce these 64 streams to some lower number...
how do you use this method when in spectral data analysis? For example, they are 4 different samples and they have values for various wavelengths. How do I reduce the wavelength feature names?
The spectral data could be used as input to PCA, the resulting scores plot would allow you to investigate the similarities or differences of the samples while the loadings plot would allow to investigate the relative importance of features to the observed similarity/differences shown by the scores plot.
Hi Nathan, let's say that our dataset is in a CSV file. Here is what we can do: 1. Read CSV file using pandas pd.read_csv() and assign it to a variable (e.g. df) 2. Using the dataset that is assigned to df (in step 1) and apply the following code to split the dataset to training and test set: X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2) 3. That's all, the above dataframe can be used by scikit-learn to build ML models Hope this helps 😃
I'm very amateur with Python and I always struggle with the first part of these tutorials. I have an excel file that I want to structure in the same way as the dataset you're importing into python that way the code works. How do I view the dataset you are importing to compare my file to that dataset?
So, by default, the first descriptor always includes the most variance, regardless of what that variable contains? That seems pretty flawed if that's the case. I'm dealing with a dataset where the first variable is zip code, which has little if anything to do with the target variable. It seems like you could sort your data however you want and the very first variable, will always be the "most important."
Hi, by the first descriptor I presume you're referring to the first principal component (PC1) and yes it contains the most variance where subsequent PCs will contain less and less. As for the contents of each of the individual PCs, no it does not represent the first descriptor in your dataset. Rather it is the fusion of all descriptors in your dataset where the signal from all of these descriptors are represented in the first few PCs whereas the noise will occupy the latter PCs.
@@DataProfessor Thank you for the clarification and for taking the time to respond. I think I had a misunderstanding of what the outputs of running a PCA really represented. I thought since I had 50 variables, that was why I'd have 50 principal components. Additionally, I thought the purpose of running a PCA was to trim down a dataset - so if you had 50, you realize you only "need" let's say 5, to represent the same amount of variance AND that those 5 correspond to variables in your dataset. Since this does not seem to be the case, I am at a loss for understanding how PCA saves you any time, since you have to run your PCA with your full dataset each time given there is nothing telling you how meaning each variable is. I'll have to do some more research. Thank you again.
I took n_components=10 , but their cumulative sum is still only 58%. What does it say about my data, or what should I do next? (My data contains 31000 rows and 280 features and target column has to categories.) Thanks!!
Hi sir, I'm trying to do pca in a different programming language.. If the loading values and scores have the same magnitude value but different signs ( positive or negative) does it matter ?
The values are the relative magnitude of likeness (similarities and dissimilarities) I would think the signs is indicative of the direction and spatial location of the data samples (in the scores plot) and variables (in the loadings plot).
Great question, we generally look at the cumulative variance if it gives more than 70% of the total variance then thise are the optimal number of PCs to use. Or you could follow the Haaland and Thomas approach where you look at the MSE versus number of PC snd if increasing the number of PC did not provide marked improvement then we select the optimal number to use at the point at which no further improvement of total variance are observed (the earliest point). Hope this helps.
hello, a question, i have problems runing this part, anyone please help Y_label = [] for i in Y: if i == 0: Y_label.append('Setosa') elif i == 1: Y_label.append('Versicolor') else: Y_label.append('Virginica') Species = pd.DataFrame(Y_label, columns=['Species'])
Hi, you can look into the loadings factor which compresses the original variables into a lower dimensional space accounted for by the new PCs (e.g. PC1, PC2, PC3, etc.)
@@DataProfessor Thanks, yeah. I thought about that later. So the data frame that shows the PC's will have the same number of rows as the number of columns of the data frame we are working with I think is what you're saying?
@@bryanchambers1964 Yes, the number of rows will stay the same while the number of columns will be significantly reduced (compressed from high dimension to low dimension; practically the number of columns will reduce).
Anyone interested in a tutorial video on Bioinformatics (using Python)? Comments down below 👇
If you find value in this video, please give it a Like 👍 and Subscribe ❤️for more videos on Data Science. 😃
I do 😁
@@marcofestu It's coming up 😃
Hey, I love your videos! It gave me the envy to start my own TH-cam channel explaining (vulgarizing) the most used terms in artificial intelligence for everyone not in the field. So people know more about this "black box" of our domain.
I do use some simple animations to help describe those terms in the simplest and most concise way possible. I also use an artificial intelligence voice since my own accent and voice wouldn't work well. (And I loved the idea of an AI teaching AI ahah!)
This said, I'd love to collaborate with you in a video. Explaining a term you use in a video, like in the way of a little 30 sec - 1 min video from me incision in your video or anything you'd like to do!
If you are interested or want to know more, feel free to check out some of my videos on my account or contact me! (sorry for this comment - message, I couldn't find any emails to message you directly)
Yes , I am . Working on protein-drug interactons using deep learning and machine learning.
yes! I don’t mind if you do a full tutorial of bioinformatics project, like ML application in real NGS data. I am also interested in learning scRNA, chiqseq from you. Sorry asked too much, lol. Love your videos, keep it up!
I’m taking an online class from a brick and mortar school. This was part of this weeks “lecture”.
I have to say. This all seems thoughtful and very well presented .
If I was part of this program and new want you were talking about I bet it would be great 👍
Thank you for this video, I've been struggling with understanding PCA for a good minute, but your video explained it extremely well! Please keep posting more like this.
You're very welcome! Glad it helped :)
Thank you data Professor, excellent intro! Suscribed, at once!
It is important to analize the correlation matrix to identify highly intercorrelated variables and then the loadings, in order to interpret semantically each component: the meaning of PC, PC2 and PC3
Welcome aboard!
Hi professor, thank you so much for such an education tutorial, I have a dataset with shape (99,25), how can I use PCA to select 10 or 8 best features that explain at least 90% variance of the dataset. In summary how I use PCA for feature selection, without transforming them into principal components ie PC1, PC2 ..... I just need the features for further classification
Thanks, one question though - How come most tutorials conduct the main 'sklearn PCA' function before determining how may components to use in the PCA itself? Wouldn't it be better to determine the variance ratio between the components BEFORE choosing how many components to use (e.g. 1 or 2 or 3)? Isn't it a potential waste of time to start with a PCA, then find out in the scree plot afterwards that you should (could) have used more/less components? I think I'm missing something.
Insane. I love it. Please continue with this videos
Glad it's helpful!
This would've been more helpful if you explained how to determine the number of components. Because it seems like you just assumed it would be 3 because you knew there were three target labels (the three different species). If you didn't already have output/target labels and this was TRULY an unsupervised approach, it would have been useful to see how you arrive at 3 components.
Hello Professor, I really want to appreciate the beautiful work you are doing on your channel. I have watched some of your videos and i will say the simplicity with which you deliver your lectures blows me away. I am trying to do a PCA on some some data which have npy file but i have got no luck to do that as i don't know how to go about it when using npy file for PCA. I will appreciate your help to guide me. Thank you.
Hi, thanks for the kind words, and I am glad that you're finding this channel helpful. On to your question, as npy files are generated in numpy, after reading it in, can you try converting it to a pandas dataframe, then from there you can follow the steps mentioned in this video. Hope this helps.
Very informative! Thanks for this important lesson 🙂 keep it up!
Thanks for the kind words of inspiration! 😃
This video helped me a lot! Thanks!
Interesting! I was just following an online course about PCA😄
Thanks for your comment! Glad you found us. 😃
which course do you take marcy?
I'd just like to remind people using pca to consider centering their data. Also consider both the variance and covariance pca
Hi sir, thank you very much for the video! I have experimental data set of time dependent signals 800(time)x49(signals) voltage values. I used PCA and reduced it to 800x2.How can I reduce further and extract information from these set for ML application and is there any other feature extraction method that you can advice for signal feature extraction ?
As you said, your dataset of 800 x 49 it is 800 rows x 49 columns, so using PCA the columns reduces to 2 which are the 2 PCs, one important thing to consider is to calculate the cumulative variance of these 2 PCs (the sum of the explained variances for PC1 and PC2), if they are too low then you can consider to use more PCs, such that the cumulative variance is within acceptable value (hopefully 70%).
my data is almost similar to yours! i'm looking to reduce dimensionality as well haha
Wow I loved those plots 😍
Yes, the look great and very informative too! 😃
Thanks for the video. I have a question: I have >100K sensors each sensor has between 2000-4000 values. My DF looks like this ['sensorNr', 'values'] long table format. How can I classify/cluster my data series in two categories?
Dear Prof,
PCA1, PCA2, PC3, represent which variables? certain variables or combination variables. It is very important to explain when we have a 3-D graph for non-technical users/audiences. Thanks in advance
thank you so much for the tutorial
Hi Professor, where is the input file, the iris dataset? It it a csv file somewhere on you GIT? Peter
Hi, as mentioned in the provided Jupyter notebook, the iris dataset used herein was from the sklearn library as follows:
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
Y = iris.target
@@DataProfessor ah OK thanks Professor. I'm quite new to Python but can I ask was this an example dataset already within the skylearn library or was it something you actually imported into that library?
Regards
Peter
@@petelok9969 Hi, no worries, with some practice, you’ll be on your way to master this. The one used in this tutorial, is an example dataset from the sklearn library. In spite of this, there are various ways to get data in, for instance you can use pandas to read in a csv file both locally and remotely from the cloud.
Excellent, thank you for sharing
My pleasure!
I don't know why but all the codes are executed on my dataset but the final scree plot is not appearing even though the code has successfully executed but the blank output is coming.
What is the reason could be?
Incredible tutorial, congratulations! The 3d viz are awesome and help a lot to understand loadings x attributes relationship. I was wondering if isnt possible to access the scoring coeeficient matrix that is used internally by the Transform(X). Do you know how can I achieve that?
hi, why not use the PCA function directly?
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca.fit(X)
That works too 😃
yes and why this function and other provides different outcomes then ?
How do you get in to print the feature names and target names? When I do the same, using the same dataset, I get an error saying: 'DataFrame' object has no attribute 'feature_names'. The same goes in the step "Assigning Input (X) and Output (Y): It simply tells me, that the Dataframe has no attribute called "data" and "target".
This works if it's the example dataset from Scikit-learn. For all other dataset read from CSV files the feature names should automatically be read from the 1st line of the CSV file.
@@DataProfessor Ahh I see. I used the CSV-file. Thanks a lot!
Great Video! I love your tutorial, it helps me alot in understanding of PCA. Btw, Can we eliminate the PC3 with only 3% of variance, and run the model again with n_components =2 , I think that 3% of variance seems to be not much critical to the model?!
Great video! Thanks a lot. I have a question, what should I do in order to classify data series (100k curves... with only two columns=['sensorid', 'values'], each sensor has 2000- 4000 point values). The idea is to "cluster/classify" each sensor curve in two groups!
can we also make it to plot 2d scatter plots of different combinations instead of 3d with a for loop in plotly or we need to use matplotlib for it ? also what could be the usage of pca for bioinformatics ?
Yes, we can definitely iterate through the features and display different combination of 2D scatter plots. There are many usage of PCA. It's really a useful algorithm. Here are some usage:
PCA can generally be used to provide a visualisation of the distribution of the data samples. It also allows quick visualization and comparative analysis between 2 datasets whether they share similar distribution to one another. In bioinformatics and cheminformatics, PCA scores plot has been used for segmenting the dataset. Additionally, PC coefficients can also be used as a compressed form of all the variables used in the dataset. Furthermore, analysis of the PCA loadings plot may allow us to gain insights on the feature importance.
@@DataProfessor thanks for the answer, but how do we trust pca and other unsupervised techniques since when we change some parameters during pca and other unsupervised techniques like means and etc, the score plots can change slightly ? so do we also need to perform cross validation like we do it in supervised techniques ? or even you might knowthat orthogonality or the normalization and etc to the raw data can also change the results on score and loadings plots of the pca, so how do we decide which one is correct then ?
Good Explanation. Thank you for sharing.
Thanks lot Professor
Dear Professor, can I ask would you consider these interpretations as more or less correct:
Scores plot looks at what extent each PC captures the behaviour of each sample whereas the loadings plot is an indicator of how much each variable weights the PC.
Also, is it possible to derive the loadings from the score values?
Btw thanks very much, I'm getting so much out of your tutorials.
Peter
hello! you demonstrated brilliantly everything there is to do with simple datasets where for each data sample, its features are single numbers. But can this be implemented for 3 dimensional dataset as well? more specifically, my dataset is of the form np.array(2000, 64, 400). here 2000 is number of data samples, 64 streams of 400 data each and i'm looking to reduce these 64 streams to some lower number...
great stuff data professor, would this work for 200 feature? thanks
Yes, and more as well.
how do you use this method when in spectral data analysis? For example, they are 4 different samples and they have values for various wavelengths. How do I reduce the wavelength feature names?
The spectral data could be used as input to PCA, the resulting scores plot would allow you to investigate the similarities or differences of the samples while the loadings plot would allow to investigate the relative importance of features to the observed similarity/differences shown by the scores plot.
How do you transform your own dataset into sklearns format?
Hi Nathan, let's say that our dataset is in a CSV file.
Here is what we can do:
1. Read CSV file using pandas pd.read_csv() and assign it to a variable (e.g. df)
2. Using the dataset that is assigned to df (in step 1) and apply the following code to split the dataset to training and test set:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
3. That's all, the above dataframe can be used by scikit-learn to build ML models
Hope this helps 😃
I'm very amateur with Python and I always struggle with the first part of these tutorials. I have an excel file that I want to structure in the same way as the dataset you're importing into python that way the code works. How do I view the dataset you are importing to compare my file to that dataset?
So, by default, the first descriptor always includes the most variance, regardless of what that variable contains? That seems pretty flawed if that's the case. I'm dealing with a dataset where the first variable is zip code, which has little if anything to do with the target variable. It seems like you could sort your data however you want and the very first variable, will always be the "most important."
Hi, by the first descriptor I presume you're referring to the first principal component (PC1) and yes it contains the most variance where subsequent PCs will contain less and less. As for the contents of each of the individual PCs, no it does not represent the first descriptor in your dataset. Rather it is the fusion of all descriptors in your dataset where the signal from all of these descriptors are represented in the first few PCs whereas the noise will occupy the latter PCs.
@@DataProfessor Thank you for the clarification and for taking the time to respond. I think I had a misunderstanding of what the outputs of running a PCA really represented. I thought since I had 50 variables, that was why I'd have 50 principal components. Additionally, I thought the purpose of running a PCA was to trim down a dataset - so if you had 50, you realize you only "need" let's say 5, to represent the same amount of variance AND that those 5 correspond to variables in your dataset. Since this does not seem to be the case, I am at a loss for understanding how PCA saves you any time, since you have to run your PCA with your full dataset each time given there is nothing telling you how meaning each variable is. I'll have to do some more research. Thank you again.
I took n_components=10 , but their cumulative sum is still only 58%. What does it say about my data, or what should I do next? (My data contains 31000 rows and 280 features and target column has to categories.)
Thanks!!
This indicates that the first 10 PC components account for 58% of the data’s variance.
Hi sir, I'm trying to do pca in a different programming language.. If the loading values and scores have the same magnitude value but different signs ( positive or negative) does it matter ?
The values are the relative magnitude of likeness (similarities and dissimilarities) I would think the signs is indicative of the direction and spatial location of the data samples (in the scores plot) and variables (in the loadings plot).
@@DataProfessor thank you very much
How could we know that the components need to be only 3 in the starting
pca = PCA(n_components=3)
Great question, we generally look at the cumulative variance if it gives more than 70% of the total variance then thise are the optimal number of PCs to use. Or you could follow the Haaland and Thomas approach where you look at the MSE versus number of PC snd if increasing the number of PC did not provide marked improvement then we select the optimal number to use at the point at which no further improvement of total variance are observed (the earliest point). Hope this helps.
How can we merge the 3D graphs (plots) of samples (flowers) and variables together? (To do so we must standarize the samples data, doesn't it?)
A professor indeed
Thanks!
Thanks professor 😊
A pleasure!
you are my hero!
Thanks!
hello, a question, i have problems runing this part, anyone please help
Y_label = []
for i in Y:
if i == 0:
Y_label.append('Setosa')
elif i == 1:
Y_label.append('Versicolor')
else:
Y_label.append('Virginica')
Species = pd.DataFrame(Y_label, columns=['Species'])
My gdown guy!
I like the vid but I'm trying to see how I can find out which columns of the original data frame are most related to each principle component.
Hi, you can look into the loadings factor which compresses the original variables into a lower dimensional space accounted for by the new PCs (e.g. PC1, PC2, PC3, etc.)
@@DataProfessor Thanks, yeah. I thought about that later. So the data frame that shows the PC's will have the same number of rows as the number of columns of the data frame we are working with I think is what you're saying?
@@bryanchambers1964 Yes, the number of rows will stay the same while the number of columns will be significantly reduced (compressed from high dimension to low dimension; practically the number of columns will reduce).
I need pca code in spyder 3.8