Bioinformatics - Visualizing Counts Data

Alex Soupir

มุมมอง 9 305

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 9 ต.ค. 2020
In this video we are going to continue where we left off after importing the counts data into R! Please Subscribe if you find these helpful! Gives me a sense of how much people can use the information provided. Thank you all!
An important step in working with next generation sequencing data is always visualizing the information that you have. Visualizing is probably the easiest way to see if there is something wrong with the data or if there are some indication that it has worked out so far. Often, it's easier to take in a plot than it is take in a table with a ton of values.
Here, we will explore the counts data of all the samples by plotting histograms of the libraries as well as individual samples. Also, we plot replicates against each other to determine if there are consistencies between replicates of the same treatment. A great way to turn the many gene values into something easier to see is using a principle component plot (PCA) - this can be great for viewing data if you have sequencing runs from different days or different people handling the sample processing you'll likely see some division of the samples with a PCA plot.
Finally, the coolest looking plot - Heat Map. For this there was a lot to type in so I copied it from my notes window, but its all available:
github.com/ACSoupir/Bioinform...
If you want some reading for next time, you can look ahead to the full project or read from the DESeq2 Vignette:
Project: github.com/ACSoupir/Bioinform...
DESeq2: www.bioconductor.org/packages...
As always, I didn't go to school for bioinformatics. Bioinformatics is something that I find fun and interesting, turning sequencing data into some tangible evidence. I use these videos as experience talking and teaching, as well as learning from others through comments and suggestions. These are meant to be a learning experience for everyone.
The image at the bottom left of the thumbnail is modified from AllGenetics.EU.
Please consider contributing to my Patreon where I may do merch and gather ideas for future content:
/ alexsoupir
แนวปฏิบัติและการใช้ชีวิต

ความคิดเห็น • 33

@xz1685 3 ปีที่แล้ว ⁺⁷
Thank you so much for the valuable videos! These could be a series of paid courses but you let us learn it free. I really appreciate it!
@alexsoupir ปีที่แล้ว
You're very welcome! Love helping others learn.
@biogamer93 2 ปีที่แล้ว ⁺¹
Amazing work!!! Would love a tutorial with EdgeR, featureCounts and how to prep your data for volcano plots in a case of DEG control vs a condition
@thesparrowtalks5019 3 ปีที่แล้ว ⁺⁴
Thanks a lot brother. Your video's are highly useful for newbies like me.. keep up your good work.
@alexsoupir 3 ปีที่แล้ว ⁺³
Hey thanks! That was the main driver behind starting to make them - help people who are new to it and don't know where to begin. Plus it's nice to help people out because we all can learn something along the way.
@nezuko1957 2 ปีที่แล้ว ⁺¹
Lot of error code i'm receiving with R, i'm not familiar with it. So could not fix it.
@florenciamascardi4932 3 ปีที่แล้ว ⁺¹
This has been of great help. Thank you!!!!!
@liqingjin8929 3 ปีที่แล้ว
Thanks Alex, very Helpful!
@soumyarao8006 2 ปีที่แล้ว
Hi Alex, thanks a lot for the great tutorials! What parameters do we need to change if we want a heatmap to show all the gene names on the graph?
@alvaroruiztabas5627 2 ปีที่แล้ว
Very helpful videos. Thanks you very much and I will need to mention you on acknowledge on my final degree project xd
@marziyehsalehi2290 ปีที่แล้ว
Thank you so much for this comprehensive tutorial. I run the codes exactly how you did, but at the end, when i wanna make the heatmap, it does not work and it gives me this error: Error in match.arg(method) :
'arg' should be one of “pearson”, “kendall”, “spearman” How can i fix it?
@marceno4963 3 ปีที่แล้ว
Hi Alex! this is very helpful :) thanks
@wiggiag 3 ปีที่แล้ว ⁺¹
Great stuff and you are helping me with my data. One suggestion is to make the global environment box bigger so we can see more variable and functions.
When I use your code to create the heatmap with my datamatrix only 25 of the 50 gene names appear. How can I either make the boxes in the heatmap bigger and/or make all 50 gene names appear on the right-Y-axis?
@alexsoupir 3 ปีที่แล้ว
In order to do this, you would need to look at the function help for drawing the heatmap. Unfortunately I don't know this off the top of my head but with a little searching it should possible.
Alternatively, a 'rough' way to do it is to make the drawing area bigger for your plot or to save it to a device (like png or pdf) and make the figure really large. I don't have the code for this off the top of my head either but should be an easy google.
Start with looking at the function help for something to do with font sizes - that will look and handle better than the second option for sure.
@amrsalaheldinabdallahhammo663 ปีที่แล้ว
How to get the read length from the count data file? is it by summary(countdata)
@sanjaisrao484 2 ปีที่แล้ว
Thank you very much
@anonymoustiger 3 ปีที่แล้ว
thanks so much for this. can you explain the coding for the gsub() prt?
im familiar with gsub but the _Rep$ vs _rep$ vs REP$ I am new to. $ tells you youre at the end of the string. what is the _REP vs _Rep vs _rep for?
@alexsoupir 3 ปีที่แล้ว
So what gain is looking for here is the string ending in rep. The `$` is an anchor meaning those have to be the final characters of the string. Basically we are removing everything from the end methodically so all thats left is the treatment, or group, that the sample came from. Does this make sense?
Edit: usually there's some tag at the end of the string with "rep" or a derivative we want to remove. So could be 'p53knockout_rep1', '..._rep2', etc.
@learningtime1367 2 ปีที่แล้ว
Thank you for the great video!! Wanted to ask what happens if we put log2countdata in the countdata argument of DESeqDataSetFromMatrix?
@alexsoupir 2 ปีที่แล้ว ⁺¹
As far as i know, might be incorrect, deseq2 like raw counts. Because deseq2 does an internal median normalization. If you have previously transformed counts i hear from the biostatistics dept here that voom is valuable and does a great job.
@learningtime1367 2 ปีที่แล้ว
@@alexsoupir thanks for the explanation, makes sense. Would you do a video on how to download raw counts from ncbi geo? I'm confused whether to take the raw counts given on the ncbi geo page at the bottom in "supplementary files section" or to get sequence reads from SRA and do aligning etc(which I don't know)
@alexsoupir 2 ปีที่แล้ว ⁺¹
@@learningtime1367 i sure can look into it! I've been incredibly busy and struggling personally but i agree that this would be valuable. There's so much data out there waiting to be analyzed for discoveries.
In the future i might do a video about a curated data set for cancer that we are working on which starts with geo array files. Personally, i think so data should be provided raw because I'm the case of rnaseq aligners are always getting better and you might find something that isn't in a precalculated counts fine. Thankfully, creating such scripts to perform these are easy and with current computing capabilities can be taken care of in only hours.
Thank you for the suggestion. I think it would be useful to the community!
@learningtime1367 2 ปีที่แล้ว
@@alexsoupir thank you so much Alex for all the hard work you put for everyone! Truly grateful and hope to see your channel grow👍
@MTMTMT_ 3 ปีที่แล้ว
I think that during the colData = data.frame(cbind(...... line
the current R version set dataframes to stringsAsFactors = FALSE
causing a DESeq2 warning : In DESeqDataSet(se, design = design, ignoreRank) :
some variables in design formula are characters, converting to factors
setting the dataframe to stringsAsFactors = TRUE to solve the problem
colData
@alexsoupir 3 ปีที่แล้ว
Good catch! Since it was a warning I didn't pay much attention but that is smart to set it as a factor from the beginning. Alternatively, p53 and treatment could be set `as.factor()` too and that should also fix the warning.
Thanks for mentioning this with a fix. I appreciate it!
@abdelrahmanmahany133 5 หลายเดือนก่อน
Thanks for the great video. I have a different thing when visualizing heat map. the Y axis showed numbers instead of gene names as the video. what could be the problem?
@alexsoupir 5 หลายเดือนก่อน
Hm.. I would check the row/column names of the matrix that you are using for plotting the data. In some functions, like ComplexHeatmap, it takes the y or x axis names from the matrix that is passed in. This also (usually) removes the need to pass in another vector with specific values wanting to be displayed though can be done in certain functions.
@abdelrahmanmahany133 5 หลายเดือนก่อน
@@alexsoupir I think the problem was in running the variable identifying chunks. As I opened the project today, run all the chunks again and the heatmap is drawn well as expected.
@alexsoupir 5 หลายเดือนก่อน ⁺¹
@@abdelrahmanmahany133 I usually refer to that as "magic". No idea how it ended up working, but it does now so can't complain.
@abdelrahmanmahany133 5 หลายเดือนก่อน
@@alexsoupirExactly 😅
@ZahidHussain-xb8it 2 ปีที่แล้ว
How to deal with non-integers data with decimal. The matrix function doesn't work on non-integers count data. Please need your help
@alexsoupir 2 ปีที่แล้ว
Hey, Zahid.
For non-integer data, i would recommend looking at limma voom or edgeR. The reason deseq2 requires integer days is that, internally, it makes some assumptions and normalizes it's own way. A few people I've talked to really like the voom method because it sounds like it normalizes better to a gaussian distribution and makes statistical tests more powerful (if i remember what they said correctly?). Hope this helps!
@ZahidHussain-xb8it 2 ปีที่แล้ว
@@alexsoupir thank you so much for your valuable suggestion. I will try edgeR or loom for my data.

ต่อไป

เล่นอัตโนมัติ

Bioinformatics - Understanding Trimmomatic