Hi everyone, thanks for watching! Here's the link to the code on my Github in case you want to try it for yourself (also feel free to add me as a friend on Github if you want 🙂) : github.com/mikesaint-antoine/Comp_Bio_Tutorials/tree/main/more_comp_bio/heatmap
You saved my life with this video hahahahahaha, very useful. If you want a suggestion of a new video, I suggest to teach how to make a heatmap in python by tanking a raw SRA sample and pass through all tha phases to reach the gene expression CSV
Hi! I had a quick question. How do I change the order of samples (columns) in the output image/pdf? I notice that sns.clustermap() is taking the arguments in an array but is rearranging them for the final heatmap.
Hi Varsha, good question. If you want to stop the columns from being rearranged into clusters, you need to set col_cluster=False in the sns.clustermap() function. So it should look like: sns_plot = sns.clustermap(data, xticklabels=sample_names, yticklabels= genes,col_cluster=False) Then the columns will stay in the original order without being clustered. Thanks for watching and let me know if you have any more questions! 🙂
My CSV has ID, Sequence, Quality Like: SRR6971.1.1, AAAATCGGGCAA, "[30, 30, 30, 30, 30, 30]" This array in quality is bigger, has 54 number 30 per line.
Hey Victor, sorry about the late response! Yeah when I'm making these videos I purposely make up mock datasets that are in a very nice, easy to work with format. But in real life sometimes the hard part is just working with the raw data to get it into the right format so that you can do the analysis. But sometimes you can find ones that are already formatted nicely. For example here's one: www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE150910 If you download the "GSE150910_gene-level_count_file.csv.gz" dataset from that page, it'll give you a raw count gene expression matrix, with each sample labeled according to the experimental condition. Then you can just apply a simple normalization, like converting to z-scores for each row, to be able to practice with the heatmap plots. Hope that helps! Thanks for watching my videos 🙂
Very helpful! I need to generate sample data sets for my student to analyze and this set is very similar to what I envisioned, you mentioned a link to your set and another to instructions for generating a set but I don’t see the links. Could you post them? Many thanks!!
Hi Amy! Yes this is the link I was talking about github.com/mikesaint-antoine/Comp_Bio_Tutorials/tree/main/more_comp_bio/heatmap the CSV file is called "fake_gene_expression_data.csv", and the Python code I used to generate is called "make_fake_data.py". I would say as a disclaimer though that this data is just a bunch of random integers I generated, and then I purposely made the "tumor" samples higher than the controls. So it isn't a realistic simulation of actual gene expression data, but just something quick and easy I made to demonstrate the heatmap plot. If I had more time I would have liked to do some kind of actual simulation to generate something more realistic, but I didn't think I had time for that. Maybe it will still be ok for your class though if the data doesn't have to be too realistic. Thanks for watching and let me know if you have any more questions about the code!
@@MikeSaintAntoine that’s fine! I can use it as a jumping off point. I can also grab files from the Cancer Genome Atlas as well. I teach a series of Biomedical Science courses to AP level students interested in health science. We do a very limited micro array experiment and I give them canned data sets to perform correlation analyses on but I want to make that unit more robust. Your data set would help me with that, plus I need a set to experiment with, thank you!
@@MikeSaintAntoine it’s a very popular program, we have 66 slots open to students from 7 big high schools. My students are scary smart. I’m always looking for activities to challenge them!
Great question! Those branches show how similar genes and samples are to each other in terms of expression patterns. For example, if two genes are close to each other in the branching, that means they had a similar expression pattern to each other across the samples. And if two samples are close to each other in the branching, that means they had a similar expression pattern across genes. Here's a good resource with some more information about the Seaborn library specifically, if you're interested: www.geeksforgeeks.org/hierarchically-clustered-heatmap-in-python-with-seaborn-clustermap/ Thanks for watching and let me know if you have any more questions!
hello sir. i am getting some errors with my dataset. i am following your method but i am at basic level of python and got a task like this. if you can help me i will be obliged.
Hi Chintu, thanks for watching! Yes I can try to help with that. Can you email your code and data to mikest@udel.edu? Then I will take a look and see if I can figure it out!
Hi Elan, good question! Yes, this is how you can read in the CSV file with Pandas: data = pd.read_csv("fake_gene_expression_data.csv",header=0, index_col=0) genes = data.index sample_names = data.columns Pandas is easier in the sense that it requires fewer lines of code, but personally I usually prefer to read in CSVs the old fashion way because I think that makes it easier to check over the data, save only what you need to save, and do any calculations or manipulations you need to do. Also I've found Pandas can get a bit screwed up if you're working with a sloppy dataset, like if it has missing data, NaNs, etc. But it can definitely be pretty convenient if you have a nice dataset. Thanks for watching and let me know if you have any more questions!
@@MikeSaintAntoine Absolutely! If you don't mind, I had another kind of general question. I'm a PhD candidate in biomed--I've been trying to explore more about data science and how to translate it into my own research. I was just wondering if you had any resources you'd recommend?
@@IshamaelMetal Sure! There are a lot of great resources out there that I've found super helpful in learning this stuff. A couple off the top of my head are: Caleb Curry's channel, for learning the basics of Python, Linux/command line, SQL, and data structures: th-cam.com/users/CalebTheVideoMaker2 Sentdex, for advanced Python and machine learning. His course on coding a neural network from scratch is amazing, and great for building intuition about deep learning: th-cam.com/users/sentdex StatQuest is great for explaining statistics and data science concepts in a way that's easy to understand: th-cam.com/users/joshstarmer There's also a great MIT course on TH-cam on systems biology. This is really more about math modeling than data science, but you might still find it useful: th-cam.com/video/gc3O2sKIsX4/w-d-xo.html Hope this helps! Let me know if you have any more questions, and good luck with your PhD program!
I tried this on Spyder (from Anaconda navigator) and the code seems to work (no error), but no heatmap was generated and got the message saying "UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance. warnings.warn(msg)." How should i fix it? :(
Hi CellRus! Hmm I don't really have any experience with Spyder, but maybe I can still help. Did you remember to include plt.show() at the end of your code? If that wasn't the issue, you can email me your code if you want (my email is mikest@udel.edu) and I'll take a look!
@@MikeSaintAntoine Hello, yes I did. I tried it on Jupyter notebook too and got the same message. The code is exactly the same as yours, but I have a feeling that it might be because there are too many datapoint in my dataset and maybe Spyder or Jupyter cannot handle it? It asks me to install fastcluster but I'm not sure how to do this. I'm aware that Anaconda Navigator has slightly different way to install packages or sth like that, but Im not sure how different.
@@CellRus Hmmm that sounds like it could be it. Just curious, how large is your dataset? How many rows and columns? I've never had a problem using this library with large gene expression datasets, but maybe it's a problem specifically with Spyder and Jupyter.
@@MikeSaintAntoine Yeah, your code was very easy to follow and I was super looking forwards to use it on my gene set but for some reason no plot was generated. My gene set has about 3000 genes with 17 columns. Maybe that's too much for Spyder?? I have no idea. Im very new to Python.
Hi everyone, thanks for watching! Here's the link to the code on my Github in case you want to try it for yourself (also feel free to add me as a friend on Github if you want 🙂) :
github.com/mikesaint-antoine/Comp_Bio_Tutorials/tree/main/more_comp_bio/heatmap
Whoa, this is cool- I had no idea this was possible to make using python
Thanks a lot! This is exactly what I was looking for!🙏
Thanks for watching! 🙂
You saved my life with this video hahahahahaha, very useful. If you want a suggestion of a new video, I suggest to teach how to make a heatmap in python by tanking a raw SRA sample and pass through all tha phases to reach the gene expression CSV
Yeah that's a good idea! I'll try to make a video on that in the future.
Thank you ,Your videos are very important!
Thanks for watching! 🙂
Hi! I had a quick question. How do I change the order of samples (columns) in the output image/pdf? I notice that sns.clustermap() is taking the arguments in an array but is rearranging them for the final heatmap.
Hi Varsha, good question. If you want to stop the columns from being rearranged into clusters, you need to set col_cluster=False in the sns.clustermap() function. So it should look like:
sns_plot = sns.clustermap(data, xticklabels=sample_names, yticklabels= genes,col_cluster=False)
Then the columns will stay in the original order without being clustered.
Thanks for watching and let me know if you have any more questions! 🙂
@@MikeSaintAntoine Thank you, it worked perfectly!
@@VarshaAkinepalli No problem, let me know if you have any more questions! 🙂
Your videos are very helpful! Thank you 😊
Thanks for watching!
Where can I find real data to test with this code? I tried to get a fastq from a sample on NCBI, converted it to CSV, and the format is different.
My CSV has ID, Sequence, Quality
Like:
SRR6971.1.1, AAAATCGGGCAA, "[30, 30, 30, 30, 30, 30]"
This array in quality is bigger, has 54 number 30 per line.
Hey Victor, sorry about the late response! Yeah when I'm making these videos I purposely make up mock datasets that are in a very nice, easy to work with format. But in real life sometimes the hard part is just working with the raw data to get it into the right format so that you can do the analysis.
But sometimes you can find ones that are already formatted nicely. For example here's one:
www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE150910
If you download the "GSE150910_gene-level_count_file.csv.gz" dataset from that page, it'll give you a raw count gene expression matrix, with each sample labeled according to the experimental condition. Then you can just apply a simple normalization, like converting to z-scores for each row, to be able to practice with the heatmap plots.
Hope that helps! Thanks for watching my videos 🙂
Very helpful! I need to generate sample data sets for my student to analyze and this set is very similar to what I envisioned, you mentioned a link to your set and another to instructions for generating a set but I don’t see the links. Could you post them? Many thanks!!
Hi Amy! Yes this is the link I was talking about
github.com/mikesaint-antoine/Comp_Bio_Tutorials/tree/main/more_comp_bio/heatmap
the CSV file is called "fake_gene_expression_data.csv", and the Python code I used to generate is called "make_fake_data.py".
I would say as a disclaimer though that this data is just a bunch of random integers I generated, and then I purposely made the "tumor" samples higher than the controls. So it isn't a realistic simulation of actual gene expression data, but just something quick and easy I made to demonstrate the heatmap plot. If I had more time I would have liked to do some kind of actual simulation to generate something more realistic, but I didn't think I had time for that.
Maybe it will still be ok for your class though if the data doesn't have to be too realistic.
Thanks for watching and let me know if you have any more questions about the code!
@@MikeSaintAntoine that’s fine! I can use it as a jumping off point. I can also grab files from the Cancer Genome Atlas as well. I teach a series of Biomedical Science courses to AP level students interested in health science. We do a very limited micro array experiment and I give them canned data sets to perform correlation analyses on but I want to make that unit more robust. Your data set would help me with that, plus I need a set to experiment with, thank you!
@@amymasi9110 That sounds awesome! I wish I could've taken a class like that in high school, sounds very cool.
@@MikeSaintAntoine it’s a very popular program, we have 66 slots open to students from 7 big high schools. My students are scary smart. I’m always looking for activities to challenge them!
Hi! I am new to the field. Can you please tell what are those evolutionary tree like branches mean?
Great question! Those branches show how similar genes and samples are to each other in terms of expression patterns. For example, if two genes are close to each other in the branching, that means they had a similar expression pattern to each other across the samples. And if two samples are close to each other in the branching, that means they had a similar expression pattern across genes.
Here's a good resource with some more information about the Seaborn library specifically, if you're interested:
www.geeksforgeeks.org/hierarchically-clustered-heatmap-in-python-with-seaborn-clustermap/
Thanks for watching and let me know if you have any more questions!
@@MikeSaintAntoine Thanks a lot 😬
hello sir.
i am getting some errors with my dataset. i am following your method but i am at basic level of python and got a task like this.
if you can help me i will be obliged.
Hi Chintu, thanks for watching! Yes I can try to help with that. Can you email your code and data to mikest@udel.edu? Then I will take a look and see if I can figure it out!
@@MikeSaintAntoine sure. And thank you.
Is there a way to replace the 'with open' function with pandas pd.read_csv?
Hi Elan, good question! Yes, this is how you can read in the CSV file with Pandas:
data = pd.read_csv("fake_gene_expression_data.csv",header=0, index_col=0)
genes = data.index
sample_names = data.columns
Pandas is easier in the sense that it requires fewer lines of code, but personally I usually prefer to read in CSVs the old fashion way because I think that makes it easier to check over the data, save only what you need to save, and do any calculations or manipulations you need to do. Also I've found Pandas can get a bit screwed up if you're working with a sloppy dataset, like if it has missing data, NaNs, etc. But it can definitely be pretty convenient if you have a nice dataset.
Thanks for watching and let me know if you have any more questions!
@@MikeSaintAntoine Perfect, thanks so much!
@@IshamaelMetal No problem, thanks for watching!
@@MikeSaintAntoine Absolutely! If you don't mind, I had another kind of general question. I'm a PhD candidate in biomed--I've been trying to explore more about data science and how to translate it into my own research. I was just wondering if you had any resources you'd recommend?
@@IshamaelMetal Sure! There are a lot of great resources out there that I've found super helpful in learning this stuff. A couple off the top of my head are:
Caleb Curry's channel, for learning the basics of Python, Linux/command line, SQL, and data structures:
th-cam.com/users/CalebTheVideoMaker2
Sentdex, for advanced Python and machine learning. His course on coding a neural network from scratch is amazing, and great for building intuition about deep learning:
th-cam.com/users/sentdex
StatQuest is great for explaining statistics and data science concepts in a way that's easy to understand:
th-cam.com/users/joshstarmer
There's also a great MIT course on TH-cam on systems biology. This is really more about math modeling than data science, but you might still find it useful:
th-cam.com/video/gc3O2sKIsX4/w-d-xo.html
Hope this helps! Let me know if you have any more questions, and good luck with your PhD program!
nice job :) thanks
Thanks for watching!
I tried this on Spyder (from Anaconda navigator) and the code seems to work (no error), but no heatmap was generated and got the message saying "UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
warnings.warn(msg)." How should i fix it? :(
Hi CellRus! Hmm I don't really have any experience with Spyder, but maybe I can still help. Did you remember to include plt.show() at the end of your code? If that wasn't the issue, you can email me your code if you want (my email is mikest@udel.edu) and I'll take a look!
@@MikeSaintAntoine Hello, yes I did. I tried it on Jupyter notebook too and got the same message. The code is exactly the same as yours, but I have a feeling that it might be because there are too many datapoint in my dataset and maybe Spyder or Jupyter cannot handle it? It asks me to install fastcluster but I'm not sure how to do this. I'm aware that Anaconda Navigator has slightly different way to install packages or sth like that, but Im not sure how different.
@@CellRus Hmmm that sounds like it could be it. Just curious, how large is your dataset? How many rows and columns? I've never had a problem using this library with large gene expression datasets, but maybe it's a problem specifically with Spyder and Jupyter.
@@MikeSaintAntoine Yeah, your code was very easy to follow and I was super looking forwards to use it on my gene set but for some reason no plot was generated. My gene set has about 3000 genes with 17 columns. Maybe that's too much for Spyder?? I have no idea. Im very new to Python.
@@CellRus I have the same issue.