Normalization methods for single-cell RNA-Seq data (high-level overview)

Florian Wagner

มุมมอง 11 541

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 18 พ.ค. 2021
In this video, I provide a high-level overview over different scRNA-Seq normalization methods. In particular, I discuss the differences between log transforms, square root transforms, and Pearson residuals. My Twitter: / flo_compbio
DOI of this video (for citations): doi.org/10.5281/zenodo.4772518
While discussing the scaling step, I forgot to mention that scaling should be done to the median transcript count of all cells in the dataset (approx. 9,000 in the example), not to an arbitrary number like 1 or 1,000,000. Otherwise, this can really throw off the following transformation step and lead to completely useless analysis results.
Further reading
-------------------------
1. "Validation of noise models for single-cell transcriptomics" (Grün et al., 2015) doi.org/10.1038/nmeth.2930
2. "Comprehensive Integration of Single-Cell Data" (Stuart et al., 2019) doi.org/10.1016/j.cell.2019.0...
3. "K-nearest neighbor smoothing for high-throughput single-cell RNA-Seq data" (Wagner et al., 2018) doi.org/10.1101/217737
4. "Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression" (Hafemeister and Satija, 2019) doi.org/10.1186/s13059-019-18...
5. "Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data" (Lause et al., 2021) doi.org/10.1101/2020.12.01.40...
Data sources
-------------------------
1. Technical noise experiment: "Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells" (Klein et al., 2015) doi.org/10.1016/j.cell.2015.0...
2. PBMC data: "10k PBMCs from a Healthy Donor (v3 chemistry)" (10x Genomics) support.10xgenomics.com/singl...
วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 45

@shahidwani7586 2 ปีที่แล้ว ⁺²
Hey Florian, this is the best video regarding explanation the single cell normalization. 👍🏽👍🏽👍🏽
@abasu000 3 ปีที่แล้ว ⁺⁵
Clear and accessible explanation- thanks for the tutorial.
@derricmorgan2282 3 ปีที่แล้ว ⁺²
Thanks a lot, really useful even after having read several papers and articles concerning the matter.
@muratseker6406 2 ปีที่แล้ว ⁺¹
Thank you for the video it is clearly explained! Looking forward to see more video on scRNA :)
@andreacurtabbi7319 หลายเดือนก่อน
Great content
@nikakhoshnevis6574 11 หลายเดือนก่อน
Thank you so so much. Very much informative and made things clear that I was confused about.
@marcelochocki6281 2 ปีที่แล้ว ⁺¹
Thank you so much for that video. Keep going :)
@asunnyday3749 4 หลายเดือนก่อน
Well done
@offswitcher3159 2 ปีที่แล้ว
great video!!
@sailingintosunshine ปีที่แล้ว
really helpful, thanks!
@tommasogiacomello7870 ปีที่แล้ว ⁺¹
Hi! Really clear explanation thanks a lot it was very useful, I have a question: how do i choose the scaling factor?
@user-ck3ki9hq9t 9 หลายเดือนก่อน
Wow, you're one of my new YT favorites work wise. Just FYI, they are adding ads with shamelessness!
@user-ck3ki9hq9t 9 หลายเดือนก่อน
Just a question about scaling. Shouldn't the amount of RNA be used in cell type characterization, or in quality control? Seems weird to scale it all away.
@sfmambero ปีที่แล้ว ⁺¹
Thank you for the clear explanation!
The toy examples really helped in understanding the effects of the different types of normalization.
What did you mean by “clipping” though when you talked about Pearson residuals?
@florianwagner1255 ปีที่แล้ว ⁺¹
I was referring to a situation where the evidence of non-uniform expression for a gene is so strong, that the Pearson residuals become very large. This happens for example if there is a very cell type that has very high and specific expression of certain genes (e.g., hemoglobin genes in a few red blood cells that are contaminants in PBMC samples). "Clipping at X" means setting all values larger than a certain number X to X. The motivation for "clipping" is the idea that there isn't any benefit to letting Pearson residuals grow arbitrarily large, and it may result in strange outliers in certain analyses. I don't think clipping is always necessary, but it is something that has been described in the literature, so I mentioned it here.
@sfmambero ปีที่แล้ว
@@florianwagner1255 Understood. Thank you again!
@jordanwilson8277 2 ปีที่แล้ว ⁺¹
Awesome!
@mohamedrefaat197 2 ปีที่แล้ว
Thanks a ton!
@timazebardast1096 2 ปีที่แล้ว
Great, Thanks.
@Mirabell97 2 ปีที่แล้ว ⁺²
Hey! Thanks for the great explanation, helped a ton! Did I get that correctly, that for Pearson-residual based normalization, no Scaling is done prior to the multiplication with the weight?
@florianwagner1255 2 ปีที่แล้ว ⁺¹
No, in the way that I've explained it, the same scaling applies. I'm always using this method to get rid of "efficiency noise", which would otherwise throw off these very simple approaches to normalization.
@Mirabell97 2 ปีที่แล้ว
@@florianwagner1255 thanks a lot!
@Yglandir 3 ปีที่แล้ว ⁺⁴
Hi Florian,
Thanks for the great explanation.
I have a question though: in your last example concerning pearson residual how do you get to these numbers? If I try to follow your formular mentioned on the slide before, I recieve different results. Did you simplify the mentioned formular and used instead the formular stated in Hafemeister and Satija (2019) or Lause et al (2021) for calculations? Did you do something else or am I just confused?
@florianwagner1255 3 ปีที่แล้ว ⁺²
Thank you! I could be confused, you could be confused, or we could both be confused :) Can you tell me why you think my math is off? For gene 1 I calculated a mean of 4, so you divide all the measurements by sqrt(4)=2. 8/2=4. For gene 2 I calculated a mean of 0.09, so you divide all the measurements by sqrt(0.09)=0.3. 4.5/0.3=15. Does that make sense?
@Yglandir 3 ปีที่แล้ว
@@florianwagner1255 Thanks for your quick response! My confusion originates in the question how do you calculate the mean expression for each gene? For me the mean of gene 1 is (0+8+8)/3 = 5.333 and gene 2 (0+0+4.5)/3 = 1.5.
Therefore "my" pearson residuals are 8/sqrt(5.333) = 3.46 (gene1) and 4.5/sqrt(1.5) = 3.67 (gene 2).
@Yglandir 3 ปีที่แล้ว ⁺¹
I think I finally found my mistake! I did not take the percentage into account. If I do than my mean for gene1 is (0*0.5+8*4.8+8*0.02)/3 = 4. And following the same logic 4.5*0.02/3 = 0.09.
Thanks for helping finding my mistake! =)
@florianwagner1255 3 ปีที่แล้ว ⁺¹
@@Yglandir oh I think you are ignoring the cell type proportions specified in the example... Gene 1 has an expression of 8 in exactly 50% of the cells and 0 in the other 50%, so the mean is 4. Similarly, Gene 2 is only expressed in 2% of the cells. I hope that makes sense.
@user-ck3ki9hq9t 9 หลายเดือนก่อน
Hi Yglandir! Thanks for open honest questioning - scientists need to do this more. Might I ask where you're from?
@bio_mark 3 ปีที่แล้ว ⁺²
Hi Florian, thank you for your clear explanation. I am not an expert on rna seq analysis and I am trying to learn on my own. I just have one (maybe big?) question. How would you conduct differential expression analysis after scaling and transforming your data as you explained. I know DEseq2 from R cannot be used with previously normalized data. Which R pipeline would you do after this? thank you
@bio_mark 3 ปีที่แล้ว
or it is possible to conitnue with DESEQ2 after these steps? thank you
@florianwagner1255 3 ปีที่แล้ว ⁺¹
Hi Marcos, I think most of the things I talk about in this video are not directly relevant to differential expression (DE) analysis. I think in many cases you probably still want to do a scaling step, but I don't think the transformations are very useful in the context of DE analysis. I wouldn't claim to be an expert on DE analysis of scRNA-Seq data, but I think this website might be interesting for you: biocellgen-public.svi.edu.au/mig_2019_scrnaseq-workshop/public/dechapter.html
@bio_mark 3 ปีที่แล้ว
@@florianwagner1255 thank you for your reply and for the material!
@SuperMixedd 4 หลายเดือนก่อน
@@bio_mark deseq2 always works on counts, so you'd be better off with raw counts if you work with 10x data
@wasima4463 ปีที่แล้ว
examples data structures are transposed from the theoretical data structure (1:38) which creates confusion
@davidvanbergen2283 ปีที่แล้ว
Thanks for the great explanation! One question: why considering the delta (10:52) and not the fold-change? (In my understanding fold-change is more biologically relevant.)
@florianwagner1255 ปีที่แล้ว
Thank you! I am discussing fold changes while I'm talking about the examples on the slide.
@muratseker6406 2 ปีที่แล้ว
when we look at the raw data, how can we have an idea how the raw data across every cell look like? So that we can determine like in your example?
@pariaalipour61 ปีที่แล้ว
Thank you so much for this helpful video. I got two questions if you don't mind. First, does it matter the order of doing Normalization and Scaling? you mentioned scaling first however, in Satija vignette Normalization is done first what is the difference?. Second, what I realized is that normalization is separate from scaling. in this case, is normalization same as transformation?
@florianwagner1255 ปีที่แล้ว ⁺¹
Thank you! The goal of the scaling step is to get rid of efficiency noise and convert from absolute expression levels to concentrations. This needs to be done first, because the transformation step is non-linear, so scaling after transformation doesn't have the same effect. Yes, "normalization" is sometimes used to mean transformation, but I've defined the term here to include both scaling and transformation, which I thought is more common.
@pariaalipour61 ปีที่แล้ว
@@florianwagner1255 Thanks a lot for your explanation. Sorry, I'm trying to compare with Seurat vignette. I think the scaling+ transformation you mentioned here is done by NormalizeData in Seurat. Please correct me if I'm wrong. But what about the ScaleData in Seurat. did you mention it? or it's sth else?
@florianwagner1255 ปีที่แล้ว ⁺¹
@@pariaalipour61 Yes I think you're right, NormalizeData does both scaling and transformation. ScaleData does something completely different, it subtracts the mean of each gene and divides by its standard deviation, which is usually called (feature) standardization or z-score normalization: satijalab.org/seurat/reference/scaledata
@pariaalipour61 ปีที่แล้ว
@@florianwagner1255 you didn't mention that. Do you think it's not necessary for downstream analysis?
@pancake9191 2 ปีที่แล้ว
For your example at 10:15, if you assume this matrix has already been thru scaling, why are the total number of reads in two cells still so different?
@SunilDhasmana 2 หลายเดือนก่อน
1:56 @florianwagner1255 Could you please explain how to generate this plot with 10X scRNA-seq data in R?

ต่อไป

เล่นอัตโนมัติ

How to perform PCA on single-cell RNA-Seq data in three simple steps