Single-cell background decontamination in R and Python with SoupX

แชร์
ฝัง
  • เผยแพร่เมื่อ 2 มิ.ย. 2024
  • SoupX is an essential tool for ambient RNA decontamination in single-cell RNA sequencing data. Ambient RNA in solution is partitioned into droplets and confounds downstream analyses. This concise tutorial covers SoupX implementation for both R (Seurat objects) and Python (Scanpy objects), offering step-by-step guidance and expert insights to improve data quality and accuracy in your single-cell analyses.
    Other tools exist such as DecontX, CellBender, SCVI-scar, etc. They all do more or less the same thing but perform differently depending on the situation.
    GitHub:
    github.com/mousepixels/sanbom...
    0:00 - Introduction
    1:05 - R and Seurat
    5:50 - Python and Scanpy
    10:54 - Results
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 43

  • @avp300
    @avp300 ปีที่แล้ว +1

    Thanks, Great video!

  • @hyeokome
    @hyeokome ปีที่แล้ว +1

    Another great tutorial!! :) Do you think this could be (or should be) a part of the preprocessing pipeline for bulk rna-seq data or pseudo-bulk data like Visium spatial data?

    • @sanbomics
      @sanbomics  ปีที่แล้ว +2

      Thanks! Good question. Ambient RNA is very specific to droplet based sequencing. So no need for removal in bulk or visium.

  • @freshwaterman
    @freshwaterman 8 หลายเดือนก่อน

    Thanks for great tutorial, Would you please, cover the "SCVI-scar" for python-only workflow? Many thanks in advance.

  • @user-bg3oj3cq3y
    @user-bg3oj3cq3y 3 หลายเดือนก่อน

    Thanks a lot for this video, this really great! I have a problem with the rpy2 version compatibility with anndata2ri, could you share which version did you use for this? Thanks in advance!

  • @konstantinleskov
    @konstantinleskov 11 หลายเดือนก่อน

    Thanks for the video. What method would you recommend to denoise BD Rhapsody-generated scRNAseq and ABseq data? Similarly to 10x, it has counts for empty wells. How would you identify which wells are truly empty and what is the best way to integrate this data with the existent 10x-oriented denoise packages?

    • @sanbomics
      @sanbomics  11 หลายเดือนก่อน +1

      I'm not too familiar with that workflow, but if you know there are 1) identified empty wells with background reads and 2) wells with doublets I don't think the underlying theory will be much different. Does the BD workflow not identify wells that are empty? You can try employing a similar strategy to the 10x elbow if you know for certain there are some.

  • @paulinakaczorowska3402
    @paulinakaczorowska3402 5 หลายเดือนก่อน

    Hello! Thank you for your tutorial. I am still very new to scSeq. I am encountering some issues with setClusters(sc, sobj$soup_group). Before, I used FindClusters and after checking data_list[1]$my sample[[]] I can clearly see the i have values in soup_group column. Yet i get the error NAs found in cluster names. Ensure a (non-na) mapping to cluster is provided for each cell. Any ideas how could i fix that? Thank you!

  • @yjk-kjy
    @yjk-kjy 10 หลายเดือนก่อน

    thanks so much for this great video! quick question about your make_soup function (in your python notebook) - why are you specifying the soup profile instead of using the default sc = SoupChannel(data, raw) function? And why are you using filtered_feature_bc_matrix to create the soup profile (soupProf = data.frame(row.names = rownames(data), est = rowSums(data)/sum(data), counts = rowSums(data)))? Shouldn't we use the empty droplets from the raw matrix to estimate the soup profile?

    • @sanbomics
      @sanbomics  10 หลายเดือนก่อน

      Great question. It actually has to do with passing the data from Python to R. If you look at my R code I use defaults. But when you tunnel it to R you need to pass those extra arguments and manual soup profile. It's still calculating the background from the empty droplets. Here is a references that uses the same technique: www.sc-best-practices.org/preprocessing_visualization/quality_control.html

    • @yjk-kjy
      @yjk-kjy 10 หลายเดือนก่อน

      @@sanbomics thanks so much for the response! yes I saw the R code and the Theis lab single cell documentation (also awesome) which is also why I was a bit confused. It is similar to a potion of the SoupX package documentation/demonstration where the author writes "Usually the only reason to not have estimateSoup run automatically is if you want to change the default parameters or have some other way of calculating the soup profile. One case where you may want to do the latter is if you only have the table of counts available and not the empty droplets. In this case you can proceed by running
      toc = sc$toc
      scNoDrops = SoupChannel(toc, toc, calcSoupProfile = FALSE)
      # Calculate soup profile
      soupProf = data.frame(row.names = rownames(toc), est = rowSums(toc)/sum(toc), counts = rowSums(toc))
      scNoDrops = setSoupProfile(scNoDrops, soupProf)
      (cran.r-project.org/web/packages/SoupX/vignettes/pbmcTutorial.html). The author's explanation tracks to me because you're not using the unfiltered data in any part of the code calculating the soup profile. Sorry this question is so long I'm still new to this! And thank you again for all of your amazing tutorials!

  • @carolinepass9112
    @carolinepass9112 ปีที่แล้ว

    Hi, great video! How would I export the filtered soupX files for each of the sample_ids for downstream analysis on each independently?
    Thanks!

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Did I have them as a list of adata objects named adata_list? You can do the analysis directly on the objects within the list. Each item in the vector/list is a seurat/scanpy object. You can access them individually from the indexed list or merge them together. I might be able to help more if you told me exactly what you are trying to accomplish. Hope this helps!

    • @carolinepass9112
      @carolinepass9112 ปีที่แล้ว

      @@sanbomics Thank you for the fast response! I am new working in python so not extremely comfortable navigating around it quite yet... But I am working with six datasets that I performed the soupX code on (in python) and I am trying to now generate individual files so I can perform doublet removal and filtering on each sample individually. Essentially I am trying to generate individual h5 data files of the samples that are the "new" datasets with SoupX filtering. Hope this makes sense!

  • @Fereshteh_Fallah_2411
    @Fereshteh_Fallah_2411 2 หลายเดือนก่อน

    thanks a lot for this video,. However, I've encountered a challenge in my own research that I hope you or the community might be able to shed some light on. I'm currently working with the GSE161529 dataset, which unfortunately only provides one matrix per sample, leaving me without access to the raw count data. How I can proceed with them?

  • @efstratioskirtsios298
    @efstratioskirtsios298 11 หลายเดือนก่อน

    Life saving. Would have been amazing if we had the code for plotting ideas in R as well to.effectively explore the impact of ambient RNA correction. 🎉

    • @sanbomics
      @sanbomics  11 หลายเดือนก่อน

      Thank you! Sorry! It shouldn't be too difficult as a lot of scanpy/seurat are intuitively similar if you know the underlying structure of the object

  • @pankajagarwal7804
    @pankajagarwal7804 9 หลายเดือนก่อน

    Thanks for the great tutorial. I have output from drop-seq (not 10X) which only produces one output matrix. Can I use SoupX with drop-seq output. Thanks.

    • @sanbomics
      @sanbomics  9 หลายเดือนก่อน +1

      Does the matrix contain even empty droplets? It should be much bigger than your actual estimated number of cells. If so, then you can. But, if not, you can't. From what I remember the dropseq pipeline drops cells below a certain count or number of genes. So good chance you can't.

  • @gjones33
    @gjones33 5 หลายเดือนก่อน

    Amazing video!
    Is there a problem if we use SCTransform to normalize instead when you make your soup groups?
    Also I am using seurat V5 and when i try to map the outs back onto my original seurat objects and set them to default it keeps giving me errors regarding use of "counts'. changing it to layer = counts doesnt fix it--any idea why this keeps happening?

    • @sanbomics
      @sanbomics  4 หลายเดือนก่อน

      Nope, that is fine. Soup groups is just a way of putting similar cells together. How you get there can vary. Which way is the best is open for discussion.
      No idea, I would have to see the code and data. Were you able to figure it out? Sorry for slow response.

  • @ruruyang914
    @ruruyang914 หลายเดือนก่อน

    Thanks for great video. I am curious how should I use SoupX for downstream or analysed Seurat object. I feel it is too much work to go back to start from very beginning. Thanks

    • @sanbomics
      @sanbomics  20 วันที่ผ่านมา

      Unfortunately to use soupX requires it to happen at the beginning. You could keep the work you have done so far and add the corrected counts in a new layer/assay if you want for downstream analysis you still need, but that isn't ideal.

  • @marwanmohamed6575
    @marwanmohamed6575 6 หลายเดือนก่อน

    Hey Sam, Is ambient rna 10x specific ( drop seq) or it also exist in other single cell sequencing method

    • @sanbomics
      @sanbomics  5 หลายเดือนก่อน

      Nope! Any single-cell method where single-cell droplets are made from a cell solution! Any instance where there could be RNA floating in solution when the cell is lysed and processed. Some do not have this issue though.

  • @jojomagicjourney
    @jojomagicjourney 11 หลายเดือนก่อน

    hi this might be a stupid question. what value did you give for nmads? in the outliner function? is it a default value in soupx? Thanks

    • @sanbomics
      @sanbomics  11 หลายเดือนก่อน +1

      MADs was just how I did quality control. It is independent of soupx itself. I think I used 5 and 3

    • @jojomagicjourney
      @jojomagicjourney 11 หลายเดือนก่อน

      @@sanbomics hi, i am confused by the {sample id}, what if i do not have sample id, just one sample tissue?

  • @katarinavalentincic9621
    @katarinavalentincic9621 ปีที่แล้ว

    hi! is it possible to remove certain genes by their names from adata since they were previously determined as ambient genes?

    • @katarinavalentincic9621
      @katarinavalentincic9621 ปีที่แล้ว

      and in python if possible haha :)

    • @sanbomics
      @sanbomics  ปีที่แล้ว +2

      yup! adata[:, ~adata.var.index.isin(genes_to_remove)]

    • @katarinavalentincic9621
      @katarinavalentincic9621 ปีที่แล้ว

      @@sanbomics thanks! :) Also your tutorials are great. Another question if you don't mind: can you subset data object to cells that are present in another data object, cause there they were already filtered by QC

  • @Brickkzz
    @Brickkzz ปีที่แล้ว

    Can batch effect normalisation be done in python? I'm trying to integrate data from multiple experiments

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Yup, there are several ways. SCVI, scanorama, etc. I think the first two are better than any R alternative as well

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Check out my integration comparison video

  • @simonerossi4714
    @simonerossi4714 ปีที่แล้ว

    Hi! I am having some issues reading the files of the matrix in Rstudio, it looks like the raw matrix is much bigger than the filtered one so the creation of the soup gives me an error where the number of rows between the 2 matrices is different, i tried to convert the matrix in a dataframe but it seems to be too big and the memory runs out. Did anyone have the same problem? are the two matrices supposed to be equal? thank you!!

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Hmmm.. they SHOULD be different because the raw one contains all droplets not just the filtered ones. Not sure why are you getting that error. Did you have any luck yet?

    • @simonerossi4714
      @simonerossi4714 ปีที่แล้ว

      @@sanbomics Unfortunately no, apparently there is an error during the creation of the soup object because the matrices have different numbers of rows, I checked both of them and the number of rows is equal so I guess it is a problem that occurs during the reading of the files

    • @giorgiacaspani8273
      @giorgiacaspani8273 11 หลายเดือนก่อน +1

      Yeah, I was having the same problem. SoupChannel() requires the table of droplets (tod = raw_feature_bc_matrix) and the table of counts, i.e. just those columns of tod that contain cells (toc = filtered_feature_bc_matrix) to have the same number of genes. In his pre-processing workflow, @Sanbomics only removed outlier cells, while I suspect you (like me) also removed some outlier genes in the process. If this is the case, the function will not work.

    • @jianwu4593
      @jianwu4593 11 หลายเดือนก่อน +1

      you need to make sure not to filter out any genes during preprocessing, typically by setting min.cells=0 when creating a seurat object.

  • @lly6115
    @lly6115 ปีที่แล้ว

    Thank you! now someone please make this python oriented.

    • @sanbomics
      @sanbomics  ปีที่แล้ว +1

      Yeah rpy2 can be bulky. You can check out SCVI+scAR for a python only alternative