Love all your tutorials! I’m stuck at the section where you do the NCB gene results to python. Did they change the code by chance because I can’t find that python file.
run python: import goatools goatools.__file__ find where the package is, then go to the directory and find 'cli' folder. run python in cli folder: import ncbi_gene_results_to_python as n2p n2p.ncbi_tsv_to_py('gene_result.txt', 'genes_NCBI_7227.py')
when I was trying to execute the code "!python C:\Users\anaconda3\Lib\site-packages\goatools\cli cbi_gene_results_to_python.py -o genes_ncbi.py gene_result.txt" The output file is not created.
Hi! Thank you so much for this tutorial, it's super useful for what I need!! I have two questions, if you don't mind helping :) How do you calculate the p and p_corr and what do they mean? Do you think there are dictionaries with more general go terms, like metabolism, cell division, migration, or do I have to make a dictionary p.e called metabolism that includes all the metabolic go terms? Thank you again!!
No problem! GO has multiple levels. It's like a tree with more specific terms branching out from more broad terms. You can filter the results however you want, e.g., to include only the broad terms. P value is from a fishers exact test. The package does that automatically. p_corr is just a multiple test correction, not sure off the top of my head which one they use but it is probably a BH. If you are interested in how to do p value correction i have a video that goes over just that.
Thanks for the great instructions and educational effort! I'm actually R user, trying to learn python for bioinformatic analysis. It looks like input genes for the analysis are 'gene symbols'. And I can see your example input genes are not capitalized. Mine is all capitalized and it seems to work. But I'm curious if there is any chance this will not work if the symbols are capitalized.
Hi this is a good question. Human genes are normally all capitalized and normally only the first letter of mouse genes are capitalized. Python is case sensitive so you will have to make sure they match. If you are new this might be a little challenging, but what you can do is map something like: df['gene symbols'] = df['gene symbols'].map(lambda x: x.upper())
Im not sure I have an easy answer for this. First you would need something to compare it to. Then you have to decide what value you are comparing. If you do it from the fisher/hypergeometric enrichment then it doesn't take into account the actual log-fold change of the gene itself, just if it was DE. Maybe you use GSEA and the enrichment score. See my other video(s) on GSEA. However, there might be a better answer out there that I am unaware of
@@sanbomics ok! I am just wondering because when you enter a set of genes for GO analysis on PANTHER, one of the column statistics given is "fold enrichment", so it may be a useful stat to add into the function of an output here
I misinterpreted your question. You mean log fold over what is expected by random chance? I'm not sure how panther does it, but you can likely do something similar with a hypergeometric distribution (see my hypergeometric video). In my opinion fold enrichment is somewhat redundant to other statistics. If you can find out how panther does it, I can likely give you a better pythonic answer
would it be possible for goatools to manipulate ensemble IDs instead of NCBI entrez IDs?
Love all your tutorials! I’m stuck at the section where you do the NCB gene results to python. Did they change the code by chance because I can’t find that python file.
run python:
import goatools
goatools.__file__
find where the package is, then go to the directory and find 'cli' folder.
run python in cli folder:
import ncbi_gene_results_to_python as n2p
n2p.ncbi_tsv_to_py('gene_result.txt', 'genes_NCBI_7227.py')
when I was trying to execute the code "!python C:\Users\anaconda3\Lib\site-packages\goatools\cli
cbi_gene_results_to_python.py -o genes_ncbi.py gene_result.txt" The output file is not created.
This is indeed a great help to bring everything to Python. Thank you for helping us. Community will benefit a lot through your instructional videos.
Thank you! I agree. I hope to see more and more transition from R to Python.
Hi! Thank you so much for this tutorial, it's super useful for what I need!! I have two questions, if you don't mind helping :) How do you calculate the p and p_corr and what do they mean? Do you think there are dictionaries with more general go terms, like metabolism, cell division, migration, or do I have to make a dictionary p.e called metabolism that includes all the metabolic go terms? Thank you again!!
No problem! GO has multiple levels. It's like a tree with more specific terms branching out from more broad terms. You can filter the results however you want, e.g., to include only the broad terms. P value is from a fishers exact test. The package does that automatically. p_corr is just a multiple test correction, not sure off the top of my head which one they use but it is probably a BH. If you are interested in how to do p value correction i have a video that goes over just that.
so red color indicate positive while blue is negative ?
The color represents the significance. These were all upregulated pathways.
Thanks for the great instructions and educational effort! I'm actually R user, trying to learn python for bioinformatic analysis. It looks like input genes for the analysis are 'gene symbols'. And I can see your example input genes are not capitalized. Mine is all capitalized and it seems to work. But I'm curious if there is any chance this will not work if the symbols are capitalized.
Hi this is a good question. Human genes are normally all capitalized and normally only the first letter of mouse genes are capitalized. Python is case sensitive so you will have to make sure they match. If you are new this might be a little challenging, but what you can do is map something like:
df['gene symbols'] = df['gene symbols'].map(lambda x: x.upper())
But also if you use the human database you won't need to change the capitalization I think
What would be a way to generate and graph the "log fold enrichment" of the GO term?
Im not sure I have an easy answer for this. First you would need something to compare it to. Then you have to decide what value you are comparing. If you do it from the fisher/hypergeometric enrichment then it doesn't take into account the actual log-fold change of the gene itself, just if it was DE. Maybe you use GSEA and the enrichment score. See my other video(s) on GSEA. However, there might be a better answer out there that I am unaware of
@@sanbomics ok! I am just wondering because when you enter a set of genes for GO analysis on PANTHER, one of the column statistics given is "fold enrichment", so it may be a useful stat to add into the function of an output here
I misinterpreted your question. You mean log fold over what is expected by random chance? I'm not sure how panther does it, but you can likely do something similar with a hypergeometric distribution (see my hypergeometric video). In my opinion fold enrichment is somewhat redundant to other statistics. If you can find out how panther does it, I can likely give you a better pythonic answer
Also, what does "per" mean in the axis? I understand it is number of genes over number of go terms but what does that mean?
Percent of genes in the GO term that were in your DE genes.
What does p_corr mean?
This is a great video; thank you very much. I think !mv is for Linux. How can we move the created file to the default import location in Windows?
move
how would you prob for genesets that are downregulated?
Hi! You should run it on the downregulated genes separately.
Nice content to create reproducible codes. Thanks.