Based on the videos I've seen from your channel, everything is really great! Everything we bioinformatics beginners need most: theory explained without complication, in a way that is easy to understand, and guiding us step by step through the process with a list of videos in logical order. I just have to thank you for all your commitment and 100% accurate work on these videos, PLEASE continue! I will credit you in several of my presentations, thank you very much!
I am writing a thesis that is partially reliant on bioinformatics and have no experience with deseq2. This video was immensely helpful in getting me up to speed in general understanding. Thank you very much!
I am so thankful for your tutorials.. can you please make one video on like, how to manage so many genes and how to come to some conclusion after getting so many genes
Really very helpful video tutorial. I appreciate the effort you made in explaining the DESeq2 background statistics. You explain them perfectly and in a very simple manner. very helpful for us. Thanks a lot and keep sharing such informative videos.
Thank you so much! This is very clear. I like the series of your video talking about the logic behind each bioinformatic package. I think it's extremely important for me with biology background to know the basics of each package and identify the best tool to use when I get my data.
Really love that, after watching lots of videos on TH-cam, finally I understood what's going on by ur video, I only could not understand the MLE part, if it is feasible for u please make a video to elaborate it in more detail. Thanks a lot
hello, I truly appreciate your videos and explanations. They are very clear and concise. I do have a request though for a future video. Could you do a how-to on gene set analysis using a GO class annotation and how to filter the desired genes from the completed DE analysis data frame. Thank you for all you do, Keep it up!
Thanks a lot! Though can you explain how normalised values are calculated? Say for gene A that is 2 and 10 (for untreated and treated) how are these 4.016 and 5.62 received? Thanks again
I am confused between the normalization method explained in this video and the normalization method explain in another video [Difference between RPKM/FPKM and TPM | RNA-Seq Normalization Methods | Bioinformatics 101]. Which normalization is correct?
Is this the video where design factor was explained? I'm coming from another of your videos where you say "if you don't know design factor, look at my previous video" but you never said which one. I think this one was a good candidate, however, I am still very confused as to how to use the design factor.. that was x= 0 or x =1? or what was that when you added two conditions? I'm super lost with the last 2 seconds of explanation there.. if you have another video explaining this, which one is it? Thanks! Everything else is on point!
Hi there, i just wanted to ask that if we can use DEseq analysis for unpaired data. I have 11 samples of normal (control) and about 160 tumor samples. Or we should go with paired data?
I may have missed it, but what do we do in with the reeplicates? You mentioned the replicates in the study design segment (00:38), but the calculations you display are about one group. Should we take the mean of the samples and make them into one column? one column for the treated (mean of the b1, b2 and b3 for t1 and b1, b2 and b3 for t2) and one column for untreated (mean of the B1, B2 and B3 for T1 and B1, B2 and B3 for T2)?
Apologies if I wasn't clear in my video, there are ways to handle technical replicates. Check this section out from DESeq2 vignette: bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#collapsing-technical-replicates With regards to biological replicates, you should NOT collapse biological replicates.
I might be mistaken but are you shure the values for calculating the median in step 3 (est. size factors) are correct? When i calculate them with R i get 0.45 for instance for the normalizatiom factor untreated. Shouldn‘t the median be one of the values? Apart from that: great video, helped me a lot!
Hey, the calculations in the video are correct. But maybe you were confused because those are medians, not means. In the case of 4 values, you have to take two values in the middle and then the average of them;) So we take 0.45 and 0.55 and get 0.50.
Great video, but I'm still confused about the dispersion α. For one gene, the α was estimated separately in the control group and treatment group (So, there are 2 α for one gene)? Or there is only one α for each gene which means the mean and the variance were calculated cross the control and treatment group?
Really love your video and is inspiring me to also try making my own videos and test my knowledge. At 14:30 there's an error when you are estimating the size factor. The geometric mean is calculated by the mean of the natural log of the counts (ln because that is what DESeq2 uses). Taking the log turns the Pi symbol in the paper into a sigma of logs. Might also be good to mention that it isn't square root if you have more than two conditions. If I'm wrong though, someone please let me know!
Thank you for pointing out that error. You are right, DESeq2 uses natural logs and it would be 1/nth power of the total of multiplied terms. I should have mentioned it. However, the values barely differ with the method chosen. Just for the explanation, I chose the multiplying method because it has fewer steps which makes it easier to understand and gets the point across :) Geometric mean with log method: log(2) + log(10) = 2.99/2 = 2.718281828459^1.495 OR exp(1.495) (taking antilog) = 4.459337 Geometric mean with multiply method: sqrt(2*10) = 4.472136
That's a great point and shows why we take the log! With large outliers the averages of logs are less affected than regular averages but doesn't change when the values are close. Also do you plan on making a video on the dispersion in DESeq2 in more detail? There's so much more in the paper I didn't understand at all.
Hi Dr. Khushbu, Thankyou for the very informative videos. Learning a lot from these. I had a query, if we have a time series of treated and untreated samples, should the pairs of treated and untreated at each time point be considered separately for estimating size factors?
Thanks for this info, I am just a bit lost especially when I try to calculate using gene D which resulted to GM of 0 and reference values of 0. Wouldnt the following steps result to 0 (assuming that values /0 are just placed as 0)?
Very excellent explanation. Thank you! I am too new to the field. I have questions regarding how we can use or what values we will use to make heatmap, Venn diagram, etc. In 15.49, once we get median of ratio and normalize our samples with this value to obtain norm_values for each gene of each sample. Before I use these value to plot heatmap. Do I need to again transform to log2? Or do I need to convert to z-Score? if yes, how to get z-score for each gene in each sample? Sorry for asking so many questions. Thanks in advance!
Hi, Really like your video. thank you for the channel once again. Its a blessing. I have a small doubt. @11:27 you said that since gene D is not expressed in treated condition the total of 42 from untreated needs to be divided amoung the expressed 3 genes, causing it to be inflated. How is that, could you please explain? Thanks in advance
Hi may I ask if we have n=3 biological replicates/2 groups how can we put in 2 groups? Just calculate mean of read counts for each genes in each group?
Hello, this was an awesome and very informative video! I've been trying to learn more about CRISPR screen analysis (specifically MAGeCK). Are you familiar at all with analysis of CRISPR screens and would you say that the concepts in this video would be transferable? Thank you so much!
question regarding the coefficients for the fitting the linear model - from my understanding, based on this explanation, the linear model can accommodate theoretically infinite number of coefficients. in the vignette for deseq2, michael love mentions that while deseq2 can do this, it is perhaps easier to concatenate multiple factors into a single variable and have deseq2 perform its linear modeling this way. can you explain why this is the case? and how this can extend from a 2-factor design to a n-number design and so forth?
@@Bioinformagician in the vignette, the subheading is under "interactions"; copied and pasted from the vignette, love writes: Initial note: Many users begin to add interaction terms to the design formula, when in fact a much simpler approach would give all the results tables that are desired. We will explain this approach first, because it is much simpler to perform. If the comparisons of interest are, for example, the effect of a condition for different sets of samples, a simpler approach than adding interaction terms explicitly to the design formula is to perform the following steps: combine the factors of interest into a single factor with all combinations of the original factors change the design to include just this factor, e.g. ~ group Using this design is similar to adding an interaction term, in that it models multiple condition effects which can be easily extracted with results.
Thank you for pointing me to this. I want to bring in a little context here, without it can be misleading. I have tried to explain it here: khushbupatel.notion.site/Interaction-terms-DESeq2-5a4a75b83adc4fe89576e6ee9b00daf0 Hope this clears your confusion and answers your question. Thanks! :)
Hello mam. Your video is very helpful especially for beginners like me. I have some queries and I would be very grateful if you can help me out. We got RNAseq done from a company and they have provided us with analyzed data. My queries are : 1. They have provided PCA plot and they have mentioned the following, "DESeq2 generates PCA plot based on a matrix of normalized read counts,the result typically depends only on the few most strongly expressed transcripts because of showing largest absolute differences between control and treated samples." The plot they provided showed very high variance among the biological replicates of one treatment group (due to lower read count in some samples). Is there any way to get around this by considering some other features (apart from read counts) to compute variances ? 2. They have also provided RPKM values of various genes that are unique to specific treatment groups. I observed some of the genes had 'zero' reads in some of the replicates of the same treatment group. Can we consider these genes for our analyses ? 3. I also observed completely identical RPKM values for many genes in the list (identical even upto 9 decimal places). What could be the reason for this and can we proceed with the analyses of such genes ? Any help from your side would be highly appreciated. 😊
1. Do you happen to know how low are the read counts among biological replicates of that one treatment group? You could perhaps take a look a pre-alignment and post-alignment QC especially total number of reads and total number of uniquely mapped reads for each sample. Another way to identify noisy/problematic samples is to use a distance matrix to get similarities or dissimilarities across samples. 2. You could get total counts for genes across all samples and see if these genes with 0 reads have consistent low read counts across other samples as well. We would ideally want to remove genes with less than 10 total read counts across all samples. You could be more stringent and set a higher number. 3. This seems suspicious. I would recommend to generate RPKM/TPM values yourself.
Thank you very much for your response mam. I am very new to such data types. I am learning everything from scratch so I will try my best to carry out whatever you suggested.
In step 1 to calculate geometric mean, we take square root of product of counts in all samples. For sample D, product of 30 x 0 = 0. Square root of 0 is 0. Hence 0.
yes, but then, in Step 2, you divide 30/0 (which is infinity) and even 0/0 (which is undefined) - so why you get 0's for untreated/ref and treated/ref? is this some kind of a convention or just a mistake?
@@pgresner It’s a mistake. They should be Inf instead of 0s. I didn’t mention a very important point, non-finite values (i.e Inf, -Inf and NaN) are filtered out and not used to calculate the median. Thank you for pointing it out, I shall put a note about this in the description.
Based on the videos I've seen from your channel, everything is really great! Everything we bioinformatics beginners need most: theory explained without complication, in a way that is easy to understand, and guiding us step by step through the process with a list of videos in logical order. I just have to thank you for all your commitment and 100% accurate work on these videos, PLEASE continue! I will credit you in several of my presentations, thank you very much!
I really appreciate your kind words, really encourages me to keep doing this :) Thank you very much!
You explain this in very clear and logical way! Appreciate it
It is superrrrr helpful!!!!!!! This is the best video about DESeq for someone with zero background like me!
I am writing a thesis that is partially reliant on bioinformatics and have no experience with deseq2. This video was immensely helpful in getting me up to speed in general understanding. Thank you very much!
You are a great teacher! I enjoy watching the detailed theory and analysis. Your tutorial is very helpful. Cheers!
Very impressive. More scientists should be engaging the way you do.
I am so thankful for your tutorials.. can you please make one video on like, how to manage so many genes and how to come to some conclusion after getting so many genes
Simply excellent. Everything was explained using lucid examples. Very good for beginners.
Very well done! I watched lots of video on DESeq2 nobody explains the underlying math!
Really very helpful video tutorial. I appreciate the effort you made in explaining the DESeq2 background statistics. You explain them perfectly and in a very simple manner. very helpful for us. Thanks a lot and keep sharing such informative videos.
Thank you! This was helpful! My study design was complex, as I was looking at 4 different conditions, with one reference level.
this is amazingly helpful as a beginner, thank you
Very helpful, clear and accurate explanation. Thank you.
This was incredibly helpful. I plan to watch it again and take detailed notes along the way. Thank You!
Thank you so much! This is very clear. I like the series of your video talking about the logic behind each bioinformatic package. I think it's extremely important for me with biology background to know the basics of each package and identify the best tool to use when I get my data.
Thank you for being so helpful to everyone!
U made it so simplified..loved ur explanation..thank you
Really love that, after watching lots of videos on TH-cam, finally I understood what's going on by ur video, I only could not understand the MLE part, if it is feasible for u please make a video to elaborate it in more detail.
Thanks a lot
I will think about making a separate video explaining MLE. Thanks :)
Amazing job explaining.
Great video and explanation!!!
really really thanks ma'am, it's amazing, I owe you.
Really superb! Thank you!
Very smart person. Great explanation
hello, I truly appreciate your videos and explanations. They are very clear and concise. I do have a request though for a future video. Could you do a how-to on gene set analysis using a GO class annotation and how to filter the desired genes from the completed DE analysis data frame. Thank you for all you do, Keep it up!
This is such a valuable and informative video, thanks so much!
very detailed and simplest explanation. 👌
Keep up the good work! Would love to see a tutorial on edgeR time-series differential analysis.
Will plan a video covering this. Thanks for the suggestion :)
That was epic.. Many thanks
This is SO helpful! Thank you!!!
Thanks a lot! Though can you explain how normalised values are calculated? Say for gene A that is 2 and 10 (for untreated and treated) how are these 4.016 and 5.62 received? Thanks again
I am confused between the normalization method explained in this video and the normalization method explain in another video [Difference between RPKM/FPKM and TPM | RNA-Seq Normalization Methods | Bioinformatics 101]. Which normalization is correct?
Is this the video where design factor was explained? I'm coming from another of your videos where you say "if you don't know design factor, look at my previous video" but you never said which one. I think this one was a good candidate, however, I am still very confused as to how to use the design factor.. that was x= 0 or x =1? or what was that when you added two conditions? I'm super lost with the last 2 seconds of explanation there.. if you have another video explaining this, which one is it? Thanks! Everything else is on point!
Hi there, i just wanted to ask that if we can use DEseq analysis for unpaired data. I have 11 samples of normal (control) and about 160 tumor samples. Or we should go with paired data?
Great video! Congrats!
it was great 100 out of 100.
Awesome video!
very helpful!! Thanks for teaching!
I may have missed it, but what do we do in with the reeplicates? You mentioned the replicates in the study design segment (00:38), but the calculations you display are about one group. Should we take the mean of the samples and make them into one column? one column for the treated (mean of the b1, b2 and b3 for t1 and b1, b2 and b3 for t2) and one column for untreated (mean of the B1, B2 and B3 for T1 and B1, B2 and B3 for T2)?
Apologies if I wasn't clear in my video, there are ways to handle technical replicates. Check this section out from DESeq2 vignette: bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#collapsing-technical-replicates
With regards to biological replicates, you should NOT collapse biological replicates.
Hey Khushbu, really nice explaination.😊
Could i ask you what are the range of x and y axis you used in mean vs variance plot at 6:27 min
I might be mistaken but are you shure the values for calculating the median in step 3 (est. size factors) are correct? When i calculate them with R i get 0.45 for instance for the normalizatiom factor untreated. Shouldn‘t the median be one of the values? Apart from that: great video, helped me a lot!
Thanks for reaching out! I am sure, the median of values 0, 0.45, 0.55, 0.58 is 0.5. I calculated it using R as well.
Hey, the calculations in the video are correct. But maybe you were confused because those are medians, not means. In the case of 4 values, you have to take two values in the middle and then the average of them;) So we take 0.45 and 0.55 and get 0.50.
Thanks for that video, You are genius :)
Great video, but I'm still confused about the dispersion α. For one gene, the α was estimated separately in the control group and treatment group (So, there are 2 α for one gene)?
Or there is only one α for each gene which means the mean and the variance were calculated cross the control and treatment group?
As far as my understanding goes, it the latter. The mean and variance is calculated across all groups, so there is only one α for each gene.
Really love your video and is inspiring me to also try making my own videos and test my knowledge. At 14:30 there's an error when you are estimating the size factor. The geometric mean is calculated by the mean of the natural log of the counts (ln because that is what DESeq2 uses). Taking the log turns the Pi symbol in the paper into a sigma of logs. Might also be good to mention that it isn't square root if you have more than two conditions. If I'm wrong though, someone please let me know!
Thank you for pointing out that error. You are right, DESeq2 uses natural logs and it would be 1/nth power of the total of multiplied terms. I should have mentioned it. However, the values barely differ with the method chosen. Just for the explanation, I chose the multiplying method because it has fewer steps which makes it easier to understand and gets the point across :)
Geometric mean with log method:
log(2) + log(10)
= 2.99/2
= 2.718281828459^1.495 OR exp(1.495) (taking antilog)
= 4.459337
Geometric mean with multiply method:
sqrt(2*10)
= 4.472136
That's a great point and shows why we take the log! With large outliers the averages of logs are less affected than regular averages but doesn't change when the values are close. Also do you plan on making a video on the dispersion in DESeq2 in more detail? There's so much more in the paper I didn't understand at all.
@@kevinradja I will surely think about making a video on dispersion in more detail :)
Hi Dr. Khushbu,
Thankyou for the very informative videos. Learning a lot from these. I had a query, if we have a time series of treated and untreated samples, should the pairs of treated and untreated at each time point be considered separately for estimating size factors?
Thank you for this video !!
Before getting the counts... do we need to align our reads?
Thank you so mutch, the paper about this algorithm is complex asf
Thanks for this info, I am just a bit lost especially when I try to calculate using gene D which resulted to GM of 0 and reference values of 0. Wouldnt the following steps result to 0 (assuming that values /0 are just placed as 0)?
This is super helpful!
Very clear!
Very excellent explanation. Thank you! I am too new to the field. I have questions regarding how we can use or what values we will use to make heatmap, Venn diagram, etc. In 15.49, once we get median of ratio and normalize our samples with this value to obtain norm_values for each gene of each sample. Before I use these value to plot heatmap. Do I need to again transform to log2? Or do I need to convert to z-Score? if yes, how to get z-score for each gene in each sample? Sorry for asking so many questions. Thanks in advance!
for visualizations, you need to scale (ie. calculate z scores). Just use the scale() function in R.
Amazing💯
Hi,
Really like your video. thank you for the channel once again. Its a blessing.
I have a small doubt.
@11:27 you said that since gene D is not expressed in treated condition the total of 42 from untreated needs to be divided amoung the expressed 3 genes, causing it to be inflated. How is that, could you please explain?
Thanks in advance
Hi may I ask if we have n=3 biological replicates/2 groups how can we put in 2 groups? Just calculate mean of read counts for each genes in each group?
excellent video.
this is secretly genius
crispy clear
Mam please do put videos for how to do DGE for raw 16srDNA paired end data in fastq format ?
Hello, this was an awesome and very informative video! I've been trying to learn more about CRISPR screen analysis (specifically MAGeCK). Are you familiar at all with analysis of CRISPR screens and would you say that the concepts in this video would be transferable? Thank you so much!
Unfortunately, I have not worked with CRISPR screen data before, so I am unable to answer whether these concepts are transferable.
@@Bioinformagician No problem!
Excellent!!!
In 22:53, why do you say that "y - B0 = log(y) - log (B0)" ???? isn't that incorrect?
very nice
wow that was magical
can I perform deseq2 in galaxy for finding differentially expressed mirnas
Awesome !
Hi can you pls explain density in count plot?
question regarding the coefficients for the fitting the linear model - from my understanding, based on this explanation, the linear model can accommodate theoretically infinite number of coefficients. in the vignette for deseq2, michael love mentions that while deseq2 can do this, it is perhaps easier to concatenate multiple factors into a single variable and have deseq2 perform its linear modeling this way. can you explain why this is the case? and how this can extend from a 2-factor design to a n-number design and so forth?
Can you point me to the section in the vignette where Michael Love talks about concatenating multiple factors into a single variable?
@@Bioinformagician in the vignette, the subheading is under "interactions"; copied and pasted from the vignette, love writes:
Initial note: Many users begin to add interaction terms to the design formula, when in fact a much simpler approach would give all the results tables that are desired. We will explain this approach first, because it is much simpler to perform. If the comparisons of interest are, for example, the effect of a condition for different sets of samples, a simpler approach than adding interaction terms explicitly to the design formula is to perform the following steps:
combine the factors of interest into a single factor with all combinations of the original factors
change the design to include just this factor, e.g. ~ group
Using this design is similar to adding an interaction term, in that it models multiple condition effects which can be easily extracted with results.
Thank you for pointing me to this.
I want to bring in a little context here, without it can be misleading.
I have tried to explain it here: khushbupatel.notion.site/Interaction-terms-DESeq2-5a4a75b83adc4fe89576e6ee9b00daf0
Hope this clears your confusion and answers your question. Thanks! :)
I use docker and command line to run deseq2. How to save plots to png files?
Hello mam. Your video is very helpful especially for beginners like me. I have some queries and I would be very grateful if you can help me out. We got RNAseq done from a company and they have provided us with analyzed data. My queries are :
1. They have provided PCA plot and they have mentioned the following, "DESeq2 generates PCA plot based on a matrix of normalized read counts,the result typically depends only on the few most strongly expressed transcripts because of showing largest absolute differences between control and treated samples." The plot they provided showed very high variance among the biological replicates of one treatment group (due to lower read count in some samples). Is there any way to get around this by considering some other features (apart from read counts) to compute variances ?
2. They have also provided RPKM values of various genes that are unique to specific treatment groups. I observed some of the genes had 'zero' reads in some of the replicates of the same treatment group. Can we consider these genes for our analyses ?
3. I also observed completely identical RPKM values for many genes in the list (identical even upto 9 decimal places). What could be the reason for this and can we proceed with the analyses of such genes ?
Any help from your side would be highly appreciated. 😊
1. Do you happen to know how low are the read counts among biological replicates of that one treatment group? You could perhaps take a look a pre-alignment and post-alignment QC especially total number of reads and total number of uniquely mapped reads for each sample. Another way to identify noisy/problematic samples is to use a distance matrix to get similarities or dissimilarities across samples.
2. You could get total counts for genes across all samples and see if these genes with 0 reads have consistent low read counts across other samples as well. We would ideally want to remove genes with less than 10 total read counts across all samples. You could be more stringent and set a higher number.
3. This seems suspicious. I would recommend to generate RPKM/TPM values yourself.
Thank you very much for your response mam. I am very new to such data types. I am learning everything from scratch so I will try my best to carry out whatever you suggested.
Very helpful!
13:40 step1
Thank you so much mam
thank you
Mam can u help me analyse rna sequence database using deseq2 tool pls
Awesome! but how is 2/0.5 = 4.016...? isn't it just 4? (16:14) and same with the other numbers from the untreated.
You’re right. The discrepancy is due to rounding off. If you don’t round the numbers, you would get 4.016 instead of 4
bro i love you
When you normalize counts, and have 0/0 (your sample D), why do you assign 0?
In step 1 to calculate geometric mean, we take square root of product of counts in all samples. For sample D, product of 30 x 0 = 0. Square root of 0 is 0. Hence 0.
yes, but then, in Step 2, you divide 30/0 (which is infinity) and even 0/0 (which is undefined) - so why you get 0's for untreated/ref and treated/ref? is this some kind of a convention or just a mistake?
@@pgresner It’s a mistake. They should be Inf instead of 0s. I didn’t mention a very important point, non-finite values (i.e Inf, -Inf and NaN) are filtered out and not used to calculate the median. Thank you for pointing it out, I shall put a note about this in the description.
Can you show the source pipeline code? My brain are overheated
Generalized linear model equation explanation was not very basic, otherwise a great presentation
Biocutieisian