Complete single-cell RNAseq analysis walkthrough | Advanced introduction

แชร์
ฝัง
  • เผยแพร่เมื่อ 2 มิ.ย. 2024
  • This is a comprehensive introduction into single-cell analysis in python. I recreate the main single cell analyses from a recent Nature publication. I explain the basics of single-cell sequencing analysis and also introduce more advanced topics. I cover doublet removal, preprocessing, integration, clustering, cell identification, differential expression, gene-set enrichment, non-parametric statistical testing, single-cell gene signature scoring, plotting, and more. This tutorial is suitable for both advance and new single-cell users. I use the scanpy and SCVI packages heavily.
    Notebook:
    github.com/mousepixels/sanbom...
    Reference:
    www.nature.com/articles/s4158...
    0:00 intro
    1:18 data
    6:35 doublet removal
    13:03 preprocessing
    23:12 Clustering
    27:42 Integration
    39:56 label cell types
    58:28 Analysis
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 308

  • @remia5
    @remia5 ปีที่แล้ว +7

    I'm glad you followed through with your promise to make this long video. It's already received thousands of view in just a few months. Cool! Please, keep them coming!

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      I've been in a bit of a hiatus, but more will be coming after the start of the new year!

  • @hyang333
    @hyang333 ปีที่แล้ว +11

    This is the best tutorial of scRNA-seq analysis with python I've ever seen!

    • @sanbomics
      @sanbomics  ปีที่แล้ว +1

      Thank you! :)

    • @vladi1475S
      @vladi1475S ปีที่แล้ว

      I agree 100%!! The best tutorial scRNA-seq analysis in Python!!!! THANK YOU!!

  • @2kvirag
    @2kvirag ปีที่แล้ว +3

    Such a useful resource especially because the tutorials covering Scanpy are fewer when compared to Seurat. Great job and thanks a ton.

    • @sanbomics
      @sanbomics  ปีที่แล้ว +1

      Thanks for the kind words! Seurat is great too, but I personally like scanpy much more.

    • @2kvirag
      @2kvirag ปีที่แล้ว +1

      @@sanbomics Absolutely. I think it is a question of Python versus R and not Scanpy versus Seurat :)

    • @sanbomics
      @sanbomics  ปีที่แล้ว +1

      haha yes you are right

  • @mocabeentrill
    @mocabeentrill ปีที่แล้ว +13

    Shout out to you bro! After years of wet lab practice, I'm transitioning to bioinformatics and you're one of my inspirations.

    • @sanbomics
      @sanbomics  ปีที่แล้ว +3

      That's how I started out too! about 5 or so years ago. I started as the one doing the library prep/sequencing. I still step into the lab every once in a while xD

  • @that_guy4690
    @that_guy4690 4 หลายเดือนก่อน

    You are a legend! Thank you so much for your work

  • @mostafaismail4253
    @mostafaismail4253 ปีที่แล้ว +1

    You are great , please don't stop, big support ❤️❤️❤️

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Thank you!

    • @LiptonTiptonTea
      @LiptonTiptonTea ปีที่แล้ว

      Agreed. You`re a gifted teacher, keep it up.

  • @johachinjoel2076
    @johachinjoel2076 2 หลายเดือนก่อน

    Thanks a lot. My PI asked me to analyzed some single-cell data. I had no clue on how to approach the problem as I am mostly a wet lab researcher. You helped me tremendously.

    • @sanbomics
      @sanbomics  หลายเดือนก่อน

      Glad I could help!

  • @cryan3240
    @cryan3240 9 หลายเดือนก่อน

    Thank you dude, this really helps me a lot!

  • @yuewang9772
    @yuewang9772 ปีที่แล้ว

    Thank you so much for this amazing tutorial!!! This helps a lot!

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Glad it helped!

  • @fabiomarchiano
    @fabiomarchiano หลายเดือนก่อน

    Thank you so much, great and valuable resource!keep it up 🔥

  • @user-pk2ec4pl3w
    @user-pk2ec4pl3w 4 หลายเดือนก่อน

    you are an angel for biology students

  • @daffy_duck_phd
    @daffy_duck_phd ปีที่แล้ว +24

    This video is so helpful, I'm just starting a PhD in bioinformatics and solid resources are scarce, so thank you! The only thing i would say is that a little more explanation as to why you are doing some of the things you are doing would be super helpful for a beginner like me. but honestly this is a great video and looking forward to binge watching the rest of your videos

    • @sanbomics
      @sanbomics  ปีที่แล้ว +2

      Thank you! Unfortunately, I have to cut out a lot of explanations because nobody wants to want a 4 hour video haha. (I also don't want to edit a 4 hour video xD) It might be helpful for you to go through it slowly: understand what I'm doing in each line, bring up the docs for each command, etc.

    • @daffy_duck_phd
      @daffy_duck_phd ปีที่แล้ว

      @@sanbomics I totally understand! I went through the video again and made note of anything i didn't understand. Thanks again and looking forward to future videos!

    • @chrisjmolina
      @chrisjmolina ปีที่แล้ว +3

      Great video! I would also love to see the 4 hour director’s cut :)

    • @olyabielska764
      @olyabielska764 11 หลายเดือนก่อน +2

      @@sanbomics I'd love to watch a 4hour video with a lot of explanations :) I am a molecular biologist and I have a difficult relationship with bioinformatics...

  • @user-lv6rj7mc8s
    @user-lv6rj7mc8s ปีที่แล้ว +1

    It was 11: 00 p.m. in China when I clicked on this video, and it was already 2: 00 a.m.,
    This is the most exciting single cell tutorial i' ve ever seen
    you are so good!!!

    • @sanbomics
      @sanbomics  ปีที่แล้ว +2

      Thank you!!!

    • @user-lv6rj7mc8s
      @user-lv6rj7mc8s ปีที่แล้ว

      @@sanbomics hello,I have new qusetions . If I do a DE with SCVI ,how I can use this scVI result to do gsea which is metioned in your 'easy gsea in python' video

  • @erinjane1200
    @erinjane1200 7 หลายเดือนก่อน

    Thank you sooo much💗💗💗 It's so helpful!

  • @Aviad3587
    @Aviad3587 ปีที่แล้ว

    this is unbelievable... such a grate video!

  • @ryanreis5830
    @ryanreis5830 ปีที่แล้ว +1

    Fantastic video -- thank you for what you do for the community! Any plans to do something similar using R or are you making the switch to python?

    • @sanbomics
      @sanbomics  ปีที่แล้ว +1

      I don't know if i'll do a scRNA video in R. But I will probably do a scATAC or sc ATAC+RNA video in R in the near future.

    • @preciouschigamba1742
      @preciouschigamba1742 ปีที่แล้ว

      @@sanbomics please do for R too
      i'm having challenges

  • @AM-fw6jl
    @AM-fw6jl ปีที่แล้ว

    Thank you! Super appreciated.

  • @demetronix
    @demetronix 10 หลายเดือนก่อน

    incredible resource thank you so much for this!

    • @sanbomics
      @sanbomics  10 หลายเดือนก่อน

      You're welcome :)

  • @ayaqz3144
    @ayaqz3144 4 หลายเดือนก่อน

    thank you very much , you are doing very great job

  • @emanueleraggi272
    @emanueleraggi272 ปีที่แล้ว +2

    Thank you for this amazing video. Is there any possibility for a PCA and ANOVA analysis for the next tutorials? Thanks for sharing your knowledge!

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Sure! I'll keep this in mind for an upcoming video

  • @nadavyklein
    @nadavyklein ปีที่แล้ว

    Thank you for the video, really helpful and teaching :)

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Glad it was helpful!

  • @giorgiatosoni2307
    @giorgiatosoni2307 ปีที่แล้ว +1

    Thank you so much for the super clear videos, you are making my life much easier!!
    I was wondering, what are the (dis)advantages of scVI over, for example, Harmony? I find it very difficult to understand which tool would work better for data integration, especially for human samples in which the variability across individuals is huge!

    • @sanbomics
      @sanbomics  ปีที่แล้ว +2

      It's hard to say without doing a direct comparison. And likely one might preform better in a given context than the other. Both are good. I prefer scVI because it is in python and does a lot more than just integration. scVI does a great job when there is big variation between individuals and even very large differences between the technology used to make the libraries. It will likely be a preference-based choice. Of course, if you want to follow along with the video you will have to use scvi xD

  • @Max-so2ij
    @Max-so2ij ปีที่แล้ว

    thank you for your amazing video!

  • @ilyasimutin
    @ilyasimutin ปีที่แล้ว

    Fantastic!!

  • @irfanalahi380
    @irfanalahi380 ปีที่แล้ว

    I think this is the best video in scRNA-seq. Thank you so much. Wondering if it is possible to make a similar tutorial on WGBS data analysis?

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Thank you! It is possible but I might not get to something like that for a while, sorry :(

    • @irfanalahi380
      @irfanalahi380 ปีที่แล้ว

      @@sanbomics Thanks. If it takes a while, can you suggest some books or other resources with practical end-to-end code examples (like yours) on WGBS data?

  • @jalv1499
    @jalv1499 8 หลายเดือนก่อน

    Thank you for this amazingly helpful video! In this video, you also gave an example of differential gene expression analysis between two conditions rather than celltypes, what's the difference between DEG analysis across conditions and pseudobulk DEG analysis in another video you created? Which approach would you recommend? Thank you!

    • @sanbomics
      @sanbomics  7 หลายเดือนก่อน

      I personally use pseudobulk now exclusively, mostly because diffxypy can be a pain to work with sometimes and pseudobulk is a more recognizable technique (hard to say which is actually better though).

  • @jenniliu4842
    @jenniliu4842 7 หลายเดือนก่อน

    Hi! This is an amazing tutorial. Thank you for your comprehensive walk-through.
    I have one question: Why are we filtering the genes twice: In line 9 and 48?
    Is it because you accounted for the doublets?
    Follow-up question on that: I just recently started but have never considered processing for doublets. What impact does it have to do it/not to do it?

    • @sanbomics
      @sanbomics  6 หลายเดือนก่อน

      The first filtering is just to make the data smaller for faster/better processing for double removal. I then reload the raw counts so I have all the genes instead of a small subset. Some data have more doublets than others. There are little clusters that will form of just doublets and you will have contamination in your other clusters of random spurious genes. Better to remove but not always necessary depending on your technology and doublet rate.

  • @xishengliu5290
    @xishengliu5290 ปีที่แล้ว

    thank you for sharing, here I got a problem showing that module 'scvi' has no attribute 'model'. Do you know what could be the possible problem? I reinstall the scvi, and it is the newest version, but still the same.

  • @lly6115
    @lly6115 ปีที่แล้ว

    let's just give this man a round of applause.

  • @rajashreechakraborty747
    @rajashreechakraborty747 7 หลายเดือนก่อน

    You are a life saver!!!

    • @sanbomics
      @sanbomics  7 หลายเดือนก่อน

      At that point I would just import the metadata from csv, assuming you have a file that has all the condition data

  • @gijs106
    @gijs106 11 หลายเดือนก่อน +1

    Thank you, this video is of great help to me. I was wondering, in the initial processing of only one sample, you used sc.pp.scale to scale the data. However, when processing/integrating multiple samples, I did not see explicit scaling of the data, only normalization and log-transformation, is that correct? Or am I missing the step where this happens? Thanks again!

    • @sanbomics
      @sanbomics  11 หลายเดือนก่อน +1

      You are correct. It is because I used scvi to integrate which gives you normalized counts and embeddings for clustering. You will almost ever only use scaled data for clustering and UMAP. If you have embeddings form scvi (or something else) you will use that instead and don't need to scale.

    • @gijs106
      @gijs106 11 หลายเดือนก่อน

      @@sanbomics That clarifies things for me. Thanks for taking the time to reply, I appreciate it!

  • @ismailgumustop7527
    @ismailgumustop7527 5 หลายเดือนก่อน +1

    Thank you so much for the tutorial! It's quite useful for hands-on learning of scRNA Seq. I have some questions related to "integration" part. While I was working on my dataset, I realized the number of cells (observations) on some samples was quite different. For example:
    Sample 1 - 1600 cells
    Sample 2 - 560 cells
    Sample 3 - 3000 cells
    1. I would like to know does this affects my analysis.
    2. Do I need to apply something like scaling or oversampling or else?
    I would be grateful if you can help.
    Best

    • @sanbomics
      @sanbomics  4 หลายเดือนก่อน +1

      For scvi the total number of cells in the training datasets matters more than the number of cells in individual samples. If you only have about 5k cells then you will have to keep the total number of features pretty low in the model (e.g.

  • @dardas15
    @dardas15 ปีที่แล้ว +1

    Thanks this is great! General question- is there a tutorial if you want to pick a specific cluster and reanalyze it specifically - for example isolating the CD8+ T-cell cluster and then identifying subpopulations of cd8+ T-cells within that cluster? Thanks again!

    • @sanbomics
      @sanbomics  ปีที่แล้ว +1

      Yeah you can definitely do that. What I would do is just subset the adata based on the CD8+ label then reset it to adata.raw. Then you can reprocess it. Or if you need the true raw counts, reload the data and just use the cell ids from the CD8+ labeled cells to subset the fresh data. I have a video for filtering adata if you don't know how to do these

    • @dardas15
      @dardas15 ปีที่แล้ว

      @@sanbomics perfect thanks!

    • @Sub-C-160
      @Sub-C-160 5 หลายเดือนก่อน

      @@sanbomics In the tutorial after concatenation, we saved the normalized and log transformed counts to adata.raw. So with reprocess it after subsetting the adata you mean starting with highly variable genes and train the scvi model again?

  • @quentinchartreux6085
    @quentinchartreux6085 8 หลายเดือนก่อน

    Very great video. I was wondering if, after identifying the cell types, when you want to perform the diff expression between two conditions only in this cell type, should we do a subset and redo a scvi model ?

    • @sanbomics
      @sanbomics  7 หลายเดือนก่อน

      Thats a great question and I am not sure I know the right answer. On one hand you will have fewer samples to train the model, but the model may be more specific for the cell types you are interested in. If you try both I would be interested to know how they compare

  • @toshiyukiitai1067
    @toshiyukiitai1067 ปีที่แล้ว

    Thank you for this great tutorial and other videos. I have been learning a lot from your tutorials. I have one question about cell-cycle scoring and regression, is it a must process for scRNA-seq or optional? If it is better to do it, which is better, before or after integrating multiple datasets?

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      I'm not sure it is necessary, but it can be useful in some situations especially if you are looking at readily cycling cells. You can calculate the score on each individual sample and add it to obs. When you are training the SCVI model you can include the scores as continuous covariates.

    • @toshiyukiitai1067
      @toshiyukiitai1067 ปีที่แล้ว +1

      @@sanbomics Thank you for your reply! As you mentioned, I looked for the information, but I have not found a good answer on how to handle the cell-cycling score. Many thanks again for your advice and many videos. I am a physician-scientist and a newbie in bioinformatics. Your videos are so helpful!

    • @chrisjmolina
      @chrisjmolina 8 หลายเดือนก่อน

      I would also be interested in how to handle the cell cycle scores@@toshiyukiitai1067

  • @henryren2790
    @henryren2790 11 หลายเดือนก่อน

    Thanks for the walkthrough. How did you get the pulldown manual for all the different type of filetypes under 'sc.read' at 5:04?

    • @henryren2790
      @henryren2790 11 หลายเดือนก่อน

      I see, I used "tab" key to get the pulldown manual... is that a common practice in python? or it's just a trick in scanpy?

    • @sanbomics
      @sanbomics  11 หลายเดือนก่อน

      Its a trick for inside jupyter notebook. You can try to use tab to autocomplete anything you are typing. Which includes modules you have loaded. Another nice trick is to use the ? after a function to see the manual of it. for example: sc.read_h5ad?

  • @YC-ut1ff
    @YC-ut1ff ปีที่แล้ว +1

    Thank you for this helpful video!!! I found that the loss of model(37:17) is actually quite high. Would this influence the performace of this model?

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      I wouldn't worry about it. I had a lot of cells and to conserve time it automatically decreases the number of epochs. If you were more patient potentially you could increase the number and see if the loss also decreases. There are other parameters you could fine tune as well. But again, the data seem integrated well and I have never seen any issues arising with default settings.

  • @romansmirnov2531
    @romansmirnov2531 ปีที่แล้ว

    Thanks for the useful tutorial!
    I'm wondering, is it possible to use diffxpy for markers identification? if so, could you please give an example of how it can be done?

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Yup! You will just have to compare the subpopulation to the rest of the cells.

  • @trinhthuylinh7719
    @trinhthuylinh7719 ปีที่แล้ว

    Hi, thank you so much for doing this video! It is really helpful. I have a glitch in my practice with another dataset at the DE analysis with diffxpy. The dataset I used has negative values so I assumed it was regressed during normalization. However, there is no .to_adata or .toarray attributes to my object. How should I deal with this situation?

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Unfortunately you will need the raw data. So there are a few options: 1) Regenerate the counts is you have access to the fastq files. 2) Use their processed data for analysis and just skip the preprocessing, scaling, and integration. Start at PCA, neighbors, UMAP, etc... which means you'll be stuck with basic DE analysis within scanpy or seurat

  • @irfanalahi380
    @irfanalahi380 ปีที่แล้ว

    Thanks again for the awesome video. I have a question. In the def pp(csv_path) function you read a csv file twice. Can we avoid reading the same file twice? I think file reading takes some time. Thanks.

    • @sanbomics
      @sanbomics  ปีที่แล้ว +1

      Yup! you can make a copy instead after reading it in the first time. How I have it will add an extra ~20 sec per sample.

  • @3stoogettes
    @3stoogettes 5 หลายเดือนก่อน

    you are a HERO

  • @Saed7630
    @Saed7630 9 หลายเดือนก่อน

    Great job! However, it's crucial to consider the significant questions we're aiming to address through these intricate scripting processes. Is the effort invested truly worthwhile?

    • @sanbomics
      @sanbomics  8 หลายเดือนก่อน

      Is there any point to anything in life?

    • @MrSureshbob
      @MrSureshbob 3 หลายเดือนก่อน

      This is an interesting comment. I guess you should not be watching this channel if you think so. These videos are life savers for some of the beginners like me.

  • @Sub-C-160
    @Sub-C-160 5 หลายเดือนก่อน

    Very good tutorial! I have one issue with concatenating adata objects. When I use sc.concat() I loose the mt labeling and also all the qc metrics in the adata.var object. Do you have an idea what could be the problem?

    • @sanbomics
      @sanbomics  5 หลายเดือนก่อน +1

      Yes! You can change how it is merged. Chaning outer/inner will change if things are dropped or not if they don't exist in all the datasets. Should be an option in the sc.concat() function itself

  • @abhayrastogi590
    @abhayrastogi590 ปีที่แล้ว +1

    Hi Sam, ammazing video. Do you plan to make a similar walkthrough for spatial as well? It would be very helpful.

    • @sanbomics
      @sanbomics  ปีที่แล้ว +1

      I have a brief video on spatial, but it is not nearly as in depth. Maybe in the future I can do a more comprehensive one

  • @katarinavalentincic9621
    @katarinavalentincic9621 ปีที่แล้ว +1

    great video, invaluable for beginners with coding like myself! What if you have adata which hasn't been filtered yet and you want to filter it to retain only cells that are present in another Anndata object previous_adata, because they have already been filtered with QC and to do that based on indices of those cells in previous_adata?

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      you can filter it directly by passing the list of barcodes: adata[barcodes_to_keep] or doing adata[adata.obs.index.isin(barcoes_to_keep)]. The latter wont reorder your data. In both cases they have to match 1:1 so if you did any concatenation you will have to remove the appended suffix

  • @aydin434
    @aydin434 ปีที่แล้ว +1

    Thank you for this very helpful tutorial. I have a problem that I want to ask. When I train the data, it takes too much time since I don't have a GPU. Is there any alternative step to remove doublets?

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Great question. While it isn't as robust at catching the doublets, when you filter the outliers during preprocessing you are theoretically catching some of the doublets. You can increase the cutoff for n_genes_by_counts to the top 5%. While it's not ideal, a lot of people don't end up removing doublets specifically (even though they should).

    • @aydin434
      @aydin434 ปีที่แล้ว

      Thank you very much!

  • @mst63th
    @mst63th ปีที่แล้ว +1

    Thanks for the comprehensive video. I'm wondering if it is possible to use the output of other technologies such as Fluidigm C1, and do the same workflow you describe here?

    • @sanbomics
      @sanbomics  ปีที่แล้ว +1

      I'm not sure it will work with things like C1. Those are relative qPCR values right?

    • @mst63th
      @mst63th ปีที่แล้ว

      @@sanbomics I'm not 100% sure. I want to reproduce the data from a study that previously has done with Surat, and the RNA-seq data is publicly available on GEO NCBI.

    • @mst63th
      @mst63th ปีที่แล้ว

      the GEO id is GSE81608

    • @sanbomics
      @sanbomics  ปีที่แล้ว +1

      I took a look and it should work just fine! Basically anything that can be done in seurat will also work here.

    • @mst63th
      @mst63th ปีที่แล้ว

      @@sanbomics Thanks a lot for the reply.

  • @sahilsukla
    @sahilsukla 7 หลายเดือนก่อน

    wonderful explanation. Just one query, should we not remove MT & Ribosomal genes before preparing models for identifying doublets ?

    • @sanbomics
      @sanbomics  7 หลายเดือนก่อน

      No, I don't think so. Doublet removal is independent of any biological knowledge. Removing specific genes because they are MT/ribosomal wouldn't affect anything and only remove potentially useful features for identifying doublets

  • @user-gd9ul4wg1s
    @user-gd9ul4wg1s 5 หลายเดือนก่อน

    Do we have to get markers using scanpy in order to proceed with model.differential.expression by scVI? Also, is there a way to navigate the markers generated in both dataframes, like for example generate an csv file or something to see all the genes in all clusters?

    • @sanbomics
      @sanbomics  5 หลายเดือนก่อน +1

      They are independent of each other so you can do only one if you want. And you should be able to get a dataframe for both. sc.get.rank_gene_groups_df (not sure if that is spelled right) and the de dataframe i showed in the tutorial for scvi.

  • @harryliu1005
    @harryliu1005 10 หลายเดือนก่อน

    This video is absolutely helpful! Thanks a lot ! However it’s my first time to learn rnaseq, where should I begin so that I can learn all stuff about rnaseq systematically? I already learned cells,dna and rna and I have some programming experience too:)

    • @sanbomics
      @sanbomics  10 หลายเดือนก่อน

      Maybe start with bulk RNAseq analysis and go from there. Learning from experience and troubleshooting is a great way to start.

  • @kitdordkhar4964
    @kitdordkhar4964 ปีที่แล้ว

    #Enjoyed watching the tutorial. #scRNAseq :-)

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Glad you liked it! :)

  • @shreyaslabhsetwar6083
    @shreyaslabhsetwar6083 ปีที่แล้ว

    Hey, when we do markers.logfoldchanges > .5, are we only including the upregulated genes? If we wish to extract markers for downregulated genes, would it be markers.logfoldchanges < -0.5 ?

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Exactly. or in your initial filtering you can do abs(data) > 0.5

    • @shreyaslabhsetwar6083
      @shreyaslabhsetwar6083 ปีที่แล้ว

      @@sanbomics Got it. Thanks!!

  • @user-yy6to5qe3o
    @user-yy6to5qe3o ปีที่แล้ว +1

    Is there a reason one uses filtered feature matrix vs raw feature matrix? what I understand the filtered feature file is already quality controlled data done by the cell ranger software, wouldn't it be better to use the raw feature matrix to do quality control?

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      The 10x raw feature matrix includes all droplets even ones that were not considered cells by cellranger. This is the first line of defense but the thresholds they use simple metrics that aren't going to catch everything. No point in using the much larger datafile for no reason when the cells are already considered garbage by cellranger. There may be very niche reasons to use it, but not for typical analysis.

    • @user-yy6to5qe3o
      @user-yy6to5qe3o ปีที่แล้ว

      @@sanbomics Got it, thank you so much for you reply! Looking forward to your future videos :)

  • @dabinjeong9560
    @dabinjeong9560 3 หลายเดือนก่อน

    Thank you so much for your video! It was very helpful. I have one question about importing ribosomal genes from broad. After I read the genes with pandas, the output showed something about copyright from broad. Do you have any advice on how to resolve this issue? Thank you!

    • @sanbomics
      @sanbomics  3 หลายเดือนก่อน

      No idea, I haven't seen this issue yet. You can skip the ribosomal part for now. I'll check it out

  • @blackmatti86
    @blackmatti86 ปีที่แล้ว +2

    This is amazing! Is there a pipeline for scATAC-seq data analysis using Python? It seems there is a lot of info about scRNA-seq but not so much for ATAC.. Thank you! 🙏

    • @sanbomics
      @sanbomics  ปีที่แล้ว +2

      I've actually been doing a decent bit of scATACseq analysis recently, but in R with seurat/signac. It can be done in scanpy too, but I haven't gotten around to trying it yet. I keep on wanting to do a scATAC video, but I wasn't sure if it would get many views because scATAC is still underperformed relative to scRNA. Though, I will probably do one in the next month or so.

    • @ilyasimutin
      @ilyasimutin ปีที่แล้ว +2

      Episcanpy for a python usage, but in R the greatest option is ArchR

    • @blackmatti86
      @blackmatti86 ปีที่แล้ว +1

      @@ilyasimutin haven’t heard good things about episcanpy tbh. Tried Signac and it was really good. Would like give ArchR a go 👍🏼

    • @blackmatti86
      @blackmatti86 ปีที่แล้ว +2

      @@sanbomics I think you’d get many views since there’s a lot of scRNA tutorials out there but nothing for scATAC 🤷🏻‍♂️ looking forward to it!

    • @sanbomics
      @sanbomics  ปีที่แล้ว +2

      Thats a good point.. might just do one on signac. Maybe even a multimodal dataset. Looking forward to a better alternative in python.

  • @user-gd9ul4wg1s
    @user-gd9ul4wg1s 9 หลายเดือนก่อน

    Hey Sam, can't thank you enough for your tutorials, they're honestly life-saving. Following this one, I have trouble to "be creative" about sample identifiers when integrating. I have 4 samples in an H5 format, not CSV files as the samples you've used. I tried several times, but I failed to set the samples up the way you've done them. Is there a way you can follow up on that?

    • @sanbomics
      @sanbomics  8 หลายเดือนก่อน

      you should be able to load the h5 files with sc.read_10x_h5

    • @chrisjmolina
      @chrisjmolina 8 หลายเดือนก่อน +1

      @sanbomics Hi Sam, I'm having similar problems with h5 files. I'm on a Mac (python v3.10) and used the code your presented here, then tried the code you shared in your scvi integration yet still having the same problem.
      When I run this:
      def pp(path):
      adata = sc.read_10x_h5(path)
      sc.pp.filter_cells(adata, min_genes=300)
      adata.obs['Sample'] = path.split('_')[0] #D21r3_sample_filtered_feature_bc_matrix.h5
      return adata
      out = []
      for file in os.listdir('/Users/cm/Desktop/SingleCell/'):
      out.append(pp('/Users/cm/Desktop/SingleCell/' + file))
      I get this error:
      Cell In[98], line 3
      1 out = []
      2 for file in files:
      ----> 3 out.append(pp('/Users/chrismolina/Desktop/CM1CM3Force100k/' + file))
      Cell In[91], line 2, in pp(path)
      1 def pp(path):
      ----> 2 adata = sc.read_10x_h5(path)
      3 adata.var_names_make_unique()
      5 sc.pp.filter_cells(adata, min_genes=300)
      File /opt/homebrew/Caskroom/mambaforge/base/envs/NewEnv/lib/python3.10/site-packages/scanpy/readwrite.py:179, in read_10x_h5(filename, genome, gex_only, backup_url)
      177 if not is_present:
      178 logg.debug(f'... did not find original file {filename}')
      --> 179 with h5py.File(str(filename), 'r') as f:
      180 v3 = '/matrix' in f
      OSError: Unable to open file (file signature not found)
      I'm able to open all the h5 files individually, so I don't think the files are corrupted. Do you have any advice for how to troubleshoot the code?

  • @wumutcakir
    @wumutcakir ปีที่แล้ว +1

    Thank you for this amazing video. I encounter problem in sc.pp.highly_variable_genes function. When I run the code, No module named 'skmisc' error has appeared. I tried to install the required package using pip install scikit-misc, but it does not solve my problem. What is your suggestion?

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Try following this thread: github.com/scverse/scanpy/issues/2073

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      if you cant get it to work, I will run a pip freeze for you on my environment.. maybe its some weird version issue or something

    • @wumutcakir
      @wumutcakir ปีที่แล้ว +1

      Thank you very much. It solves my problem.

    • @jsm640
      @jsm640 ปีที่แล้ว

      @@wumutcakir I met the same problem one day ago. The methods suggested by Sanbomics may be helpful, but I solved it by changing the order of the installations of the packages. I run the following:
      conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
      conda install scvi-tools -c conda-forge
      conda install -c conda-forge scanpy python-igraph leidenalg
      It works fine now.

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Nice! Thanks for posting your solution

  • @user-eu6tf5fk7v
    @user-eu6tf5fk7v 11 หลายเดือนก่อน

    I have a problem in this step: SOLO = scvi.external.SOLO.from_scvi_model(vae). The error is: AttributeError: Can only use .str accessor with string values! I can't solve this problem. What is your suggestion? Thank you very much!

  • @shilpasy
    @shilpasy ปีที่แล้ว

    Great video, thnak you so much. So many cool graphs in python! However, there are too many issues while installing scvi tools and even after successful installation, almost at each step there is some debugging involved. I am using windows, is it related to windows os specifically?

    • @shilpasy
      @shilpasy ปีที่แล้ว

      Just an update on this one, I tried to install this on colab nb, after installing many dependencies individually, it finally worked and I was able to import. But during "solo.train()" step, I get this error: Monitored metric validation_loss = nan is not finite. Previous best value was inf. Signaling Trainer to stop.
      The df (solo.predict()) has NaN values.

    • @sanbomics
      @sanbomics  ปีที่แล้ว +1

      Yeah probably has to do with windows. I have never had any trouble on Ubuntu 18+

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Do you get that same solo error with a different dataset?

  • @muhammadabduh8629
    @muhammadabduh8629 8 วันที่ผ่านมา

    great video, I learn a lot. But i was wondering what device you use? I did the same analysis but i could load the model because im out of memory.

    • @sanbomics
      @sanbomics  3 วันที่ผ่านมา

      this computer has 128 gb memory. But you can try the analysis with fewer samples if you want to follow along still

  • @chrisdoan3210
    @chrisdoan3210 ปีที่แล้ว

    Hi Mark, my data is not in csv format as you, so I can use read_10x_mtx() to read the data into Jupyter Notebook, is that correct?

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Yup! scanpy has like 5 different ways to read in the data

  • @yaseminsucu416
    @yaseminsucu416 4 หลายเดือนก่อน

    Hi Sam, thank you so much for the helpful tutorial! I am trying to replicate your analysis with using the same data and following the code cell by cell, and it's been a great learning journey so far! I do have a question though, while training the data for duplication predictions, and for producing the UMAP, I did get different results than yours, and as well as when I re-run the code same thing happens. I am assuming since there is random initialization while doing the training models, the values are gonna be slightly different each time and that will eventually cause different UMAP configuration, etc. I am curious though, how can I be sure that I am on the right track? 😄 Your feedback would be helpful! Cheers, and thanks so much for the tutorials!

    • @sanbomics
      @sanbomics  4 หลายเดือนก่อน

      Hi! It sounds like you are doing things right. You are correct in assuming that each time you train the model it will be a little different. I think there is an option to set a specific seed, but I am not sure if that will keep it 100% consistent. You should hopefully see high overlap (>90%) when calling doublets multiple times.

  • @user-lo3uj6lt7k
    @user-lo3uj6lt7k 7 หลายเดือนก่อน

    I really like your videos! It's really helpful for beginners like me. Thank you so much! I have question about scvi model. I'm a little confused about PCA, tSNE, UMAP which are generally done in seurat. So if we use scvi to do dimentionality reduction, then we don't have to do tSNE, right? In your video, you use scvi to correct different covariates after integration. Does scvi also do dimentionality reduction?

    • @sanbomics
      @sanbomics  6 หลายเดือนก่อน

      Yup, scvi will give you embeddings which you then can compute the neighborhood graph from. tSNE or UMAP are still necessary if you plan to visualize the data in that way. They use the neighborhood graph. scVI --> neighbors --> UMAP/tsne. With seurat you are used to variable genes --> pca --> neighbors --> UMAP/tsne.

    • @user-lo3uj6lt7k
      @user-lo3uj6lt7k 6 หลายเดือนก่อน

      thanks!@@sanbomics

  • @CaveCrack
    @CaveCrack หลายเดือนก่อน

    Thank you very much for these tutorials. There seems to be a typo in the integration step (function pp) you are using mouse (mt-) mitochondrial genes instead of MT-, the same issue exists in the github notebook.

    • @sanbomics
      @sanbomics  3 วันที่ผ่านมา

      hmm let me look into this. Thanks for pointing it out

  • @daehwankim4432
    @daehwankim4432 5 หลายเดือนก่อน

    This is what I exactly need at this moment! Thank you so much for sharing your knowledge! I have a quick question. Is there any good way to do this analysis on GPU? How can I apply the GPU to this analysis? Thanks!

    • @sanbomics
      @sanbomics  5 หลายเดือนก่อน

      A lot of these analyses are being sped up with a GPU already especially if you are using SCVI. There was a recent software drop converting a lot of single cell functions with scanpy to GPU, but I forget what it is called off the top of my head. Shouldn't be too hard to find though.

  • @user-id2hy1fv8e
    @user-id2hy1fv8e 11 หลายเดือนก่อน

    Would you recommend using scvi differential expression over scanpy rank gene groups for DE?

    • @sanbomics
      @sanbomics  11 หลายเดือนก่อน +1

      I would recommend lots of things over rank gene groups of scanpy or find markers of seurat. scvi, diffxpy, or pseudobulk are probably your best three options. scvi is probably the easiest

  • @laloulymounia9266
    @laloulymounia9266 หลายเดือนก่อน

    Hi, thanks again for the tutorial, would it be possible for you to make a tutorial on how to annotate the umap clusters automatically on python ? I ended up having 40 clusters on my breast cancer scRNA seq data set which includes before and after treatment data. I'm having trouble annotating it manually, with the cd4/cd8 being the benchmark for the resolution. I tried loading the data in R so I can get an idea of what Single R would do, however I think I messed up during the conversion process. I'm not sure whether you'd have any useful tutorials online to help ? Thanks

    • @sanbomics
      @sanbomics  หลายเดือนก่อน

      Actually, that will be in my next video. Sometime in the next couple weeks

  • @nourlarifi1689
    @nourlarifi1689 10 หลายเดือนก่อน

    thank you very much for this tutorial . I have 2 questions :
    1/ should we always apply 1e4 for data normalization or we can pick other values ?
    2/ how I can store the plot generated using this command line in a file : sc.pl.umap(adata, color = ['leiden', 'Sample'], frameon = False)

    • @sanbomics
      @sanbomics  10 หลายเดือนก่อน

      It's actually recommended to just use log1p now with no target value. So use the default normalize command with no target. There is a "save" argument you can add, eg, save = 'thing.png'

    • @nourlarifi1689
      @nourlarifi1689 10 หลายเดือนก่อน

      @@sanbomics thank you for responding
      You mean I apply directly
      sc.pp.normalize_total(adata)
      sc.pp.log1p(adata)
      without adding target_sum = 1e4 ?

    • @sanbomics
      @sanbomics  10 หลายเดือนก่อน

      exactly!

  • @yuanjizhang9753
    @yuanjizhang9753 ปีที่แล้ว

    Thanks for the excellent tutorial. I tried to follow, but couldn't install scvi-tools, probably due to some conflicts of different versions. Can you please tell versions of python, scanpy, scvi-tools, and leidenalg? Thanks!

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      try making a fresh conda environment with python=3.9 or 3.8

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      conda create -n new_env python=3.9

    • @yuanjizhang9753
      @yuanjizhang9753 ปีที่แล้ว

      @@sanbomics python3.9 worked. Thanks!

    • @jarsinjars
      @jarsinjars 11 หลายเดือนก่อน

      Hey Yuanji, what versions of these packages did you use with python 3.9? I think scanpy seems to work for me but scvi-tools seems to be janky. Thank you!

  • @mehdipourrostami5206
    @mehdipourrostami5206 ปีที่แล้ว

    Thank you for your awesome videos, can you suggest any software for automated cell annotation for mouse lung cells?

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Yeah sure. I have some videos already for this. For both R and python. Just be careful because if the reference populations don't line up well you will introduce error.

    • @mehdipourrostami5206
      @mehdipourrostami5206 ปีที่แล้ว

      @@sanbomics that is right, I followed one of your videos on the mouse data that I have and I got some crazy umaps that did not make sense to me. anyhow thanks for all your effort. I have learned a lot.

  • @sadrahakim1272
    @sadrahakim1272 6 หลายเดือนก่อน

    Great tutorial!! Can we do this on a Count Matrix? I mean instead of having expression matrix, we have count matrix (rows to be cells and cols to be genes).

    • @sanbomics
      @sanbomics  5 หลายเดือนก่อน

      Hi! I think I may be misunderstanding the question. Those should be the same thing.

  • @daniel98carvalho
    @daniel98carvalho 10 หลายเดือนก่อน

    Quick question here. I am working with scRNA-seq data from the lung of cynomolgus monkeys and human. The idea is that my PI and I would like to integrate the data and then use this data for creation of an R Shiny app where scientists can do various bioinformatics functions on the data. The monkey data is from two studies for males and females and the human data is also stratified by male and female. Is it ideal to integrate them within species to correct for batch effects (integrate male and female cyno, and integrate male and female human), and then integrate across the species? I am doing all this work in Scanpy by the way. Any insight would be greatly appreciated!

    • @sanbomics
      @sanbomics  10 หลายเดือนก่อน +1

      I would integrate them all in the same step. Just make sure the species is a categorical variable in your model setup. You will probably want to map the gene symbols (var.index) to shared orthologs first though.

    • @daniel98carvalho
      @daniel98carvalho 10 หลายเดือนก่อน

      @@sanbomics I have a dataframe of homologous genes from BioMart that I was planning on filtering both datasets on before concatenating them. That way the genes should be the "same" for both species

    • @sanbomics
      @sanbomics  10 หลายเดือนก่อน +1

      perfect. Good luck!

    • @daniel98carvalho
      @daniel98carvalho 10 หลายเดือนก่อน

      @@sanbomics I finally finished the integrated analysis of two cyno monkey samples with 28 human samples. Geez, it wasn't very pleasant at certain times but wow-Scanpy and scVI are insanely powerful. I literally never want to go back to using Seurat. I think you're right...Python is better. Thanks so much for the tutorial! Awesome, awesome stuff. Cheers from Boston.

  • @zainziad3915
    @zainziad3915 10 หลายเดือนก่อน

    Isn't using adata = sc.concat(out) going to throw away a lot of genes since it uses an inner join by default? Shouldn't we use join='outer'?

    • @sanbomics
      @sanbomics  10 หลายเดือนก่อน

      If your data are similar concat will only get rid of a small number of genes are are not expressed in both samples. If they are different (or if you are worried), don't filter genes until after concating. The latter is what I typically do now and will throw away no genes.

  • @sebastianmoreno9096
    @sebastianmoreno9096 ปีที่แล้ว

    Thanks a lot ! This video is brilliant :)! Really useful! I was just wondering why regressing out after selecting highly variably genes and not before. When I regress out cell cycle after HVG I still get a cluster of cell cycle running leiden. I don't see that kind of cluster if I regress out before HVG. Is there any reason to regress out after HVG? Thanks a lot for this inspirational video!

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Good question and interesting observation. I have only ever seen regression after variable features. Im guessing its two parts: 1) regression influences the finding of variable features and 2) theoretically the variable features do a better job describing your data and you shouldn't see a difference.. but you do, so that is very interesting to me. What if you increase the number of variable features?

    • @sebastianmoreno9096
      @sebastianmoreno9096 ปีที่แล้ว

      @@sanbomics Hi! Thanks for your response. I have used different number of variable features always with the same results. If I regress out before HVG, I don't see any cluster related to what I'm regressing out. The problem is that the gene expression values changed and now I have negative values for some genes :/

  • @mst63th
    @mst63th ปีที่แล้ว

    Can we concatenate multiple samples and then run the pre-processing if we have multiple samples? Do you think it makes a difference compared to your approach?

    • @sanbomics
      @sanbomics  ปีที่แล้ว +1

      You technically can. But, it's better to preprocess the samples individually before concatenation. For example, If you concatenate two samples, imagine one sample has a different MT% distribution than the other. If you now QC based on the combined distribution you will only remove dead cells from the one with overall higher MT%, which may just be due to technical differences. Also, the doublet removal procedure I use only works on individual samples

    • @mst63th
      @mst63th ปีที่แล้ว +1

      @@sanbomics I got your point, sounds reasonable. Thanks

  • @Amanda-re2vt
    @Amanda-re2vt 4 วันที่ผ่านมา

    Hi Sam, do you have a video of how you’re downloading the data from NCBI (papers) because that part I don’t understand.

    • @sanbomics
      @sanbomics  3 วันที่ผ่านมา

      I don't have a video. But I tweet about it sometimes if you follow me on twitter. I may make something like this in the future. You can check out my most recent video series for another example from a different dataset

  • @MySanthush
    @MySanthush หลายเดือนก่อน

    Thank you. The best tutorial. I am new to this field. Could you please tell me how to split the UMAP by condition (after integration) to see a particular gene?

    • @sanbomics
      @sanbomics  3 วันที่ผ่านมา

      You can do something like adata[adata.obs['Condition'] == 'Sick'] with Condtition being the column name in obs

  • @jorge1869
    @jorge1869 9 หลายเดือนก่อน

    A more complete analysis from beginning to end would be interesting. In other words, from the moment you receive the raw data, until you analyze it with scanpy. In this way it would be more useful for those who start in this world. Greetings

    • @sanbomics
      @sanbomics  9 หลายเดือนก่อน

      Hi! I do have a couple of introductory videos that go over running CellRanger, for example.

  • @tarkkrloglu2406
    @tarkkrloglu2406 6 หลายเดือนก่อน

    Hi, Thank you for tutorial.
    I have got one question. When I was working about model.train(), I take an error.
    This error:ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 128])
    If I change batch size , ıt works. Howewer, Deafault parametrs isnt work. What is the wrong ?

    • @user-lo3uj6lt7k
      @user-lo3uj6lt7k 6 หลายเดือนก่อน

      I got the same error.

    • @sanbomics
      @sanbomics  6 หลายเดือนก่อน

      Check out this and see if this solves it: discourse.scverse.org/t/solo-scvi-train-error-related-to-batch-size/1591

  • @sergestsofack3376
    @sergestsofack3376 หลายเดือนก่อน +1

    nice video, where can I found these codes so I can just copy and paste ?

    • @mehdiraouine2979
      @mehdiraouine2979 หลายเดือนก่อน

      it's in the description, there is a link to github

  • @newyorkdiary7573
    @newyorkdiary7573 ปีที่แล้ว

    Hi , Great video. Whenever i install scvi , problem starts. Sometime numpy, them h5dy ... too much problem. Tried for a month, couldn't complete for once. Can you help ?

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Ahh thats frustrating. What version of python are you using? Are you doing this in a virtual environment, like conda? What operating system are you on?

    • @newyorkdiary7573
      @newyorkdiary7573 ปีที่แล้ว

      @@sanbomics I was using Python 3.11, using Conda in PC. Windows 11

  • @efstratioskirtsios298
    @efstratioskirtsios298 10 หลายเดือนก่อน

    Any videos/help with scRNA-seq DEG analysis in R? Seems that there is not a robust consensus on what packages are the best to use :( Any opinions + recommendations? Can someone use DEseq2 through seurat directly?

    • @sanbomics
      @sanbomics  10 หลายเดือนก่อน

      You can do pseudobulk and use EdgeR or Deseq2. I think there should be a decent bit of stuff online to help you with that. Those are what I recommend. Good luck!

  • @laloulymounia9266
    @laloulymounia9266 2 หลายเดือนก่อน

    Thanks',I had trouble at first because of the .T that transposed a file i downloaded from geo that was already in the correct format !

    • @sanbomics
      @sanbomics  หลายเดือนก่อน

      Ahh yeah good catch! Unfortunately there is not standard so you one time you might have to and another you might not

  • @victorassis9078
    @victorassis9078 7 หลายเดือนก่อน

    Hey! I’m running out of memory when integrating samples. How can I complete this tutorial? Is there any technique to reduce memory usage?

    • @sanbomics
      @sanbomics  6 หลายเดือนก่อน

      Yes! Make sure to convert them to sparse matricies after loading in each dataset. This will reduce memory required a lot. But you can also load in fewer cells if that still isn't enough. You are still going to run into issues if you run anything that requires converting the sparse matrix to dense though

  • @user-be3rx6ho5z
    @user-be3rx6ho5z 5 หลายเดือนก่อน

    The video is soooo helpful. you are my life saver. However, I wanted to try to using diffxpy especially 'wald test'. the error 'ZeroDivisionError: float division by zero' is happending when I use this code:
    res = de.test.wald(data = subset,
    formula_loc= '~ 1 + cell_type',
    factor_loc_totest='cell_type'
    )
    I am using MacOS. and in Github, some people is undergoing same problem, guessing the problem occurs when we use MacOS.
    do you have any other solution?

    • @sanbomics
      @sanbomics  5 หลายเดือนก่อน

      I've bassically given up on diffxpy because it always seems to throw errors for no reason some times. I reccomend doing pseudobulk instead. Check out one of my more recent pseudobulk videos

  • @zhengguanwang4337
    @zhengguanwang4337 4 หลายเดือนก่อน

    great!!!

  • @saraalidadiani5881
    @saraalidadiani5881 9 วันที่ผ่านมา

    thank you for the nice video, regarding to the part for making the cell typ fraction plot (form this part of the code till end of this part: adata.obs.groupby(['sample']).count()) may you also please explain how to do it in R with the Seurat objecet? thanks

  • @mst63th
    @mst63th ปีที่แล้ว

    In many cases, the available data in public repositories is not separated. For example, control samples and the treatment were provided in one CSV file with no additional info about which part of the data related to which group. How should we deal with this if we want to compare the two conditions?

    • @sanbomics
      @sanbomics  ปีที่แล้ว +1

      They have no identifiers at all? Hmm, that is just those people who published being lazy or not knowing better. If you look through the paper and see no way to identify them you will likely have to go back to the fastq and rerun the data. Also double check all the supplemental files in the main text.

    • @mst63th
      @mst63th ปีที่แล้ว

      @@sanbomics Yes, exactly; re-running the data from fastq needs accessing the HPC system, which, most of the time not possible for home users.

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Can you link the paper you are asking about?

    • @mst63th
      @mst63th ปีที่แล้ว

      @@sanbomics pubmed.ncbi.nlm.nih.gov/27667665/
      Here is the paper.

    • @sanbomics
      @sanbomics  ปีที่แล้ว +1

      Interesting study. That is very unfortunate they don't differentiate the cells. You can always email the corresponding author

  • @mehdiraouine2979
    @mehdiraouine2979 หลายเดือนก่อน

    Is there a laptop you recommend for scRNA seq analysis without having to use a Cloud ? Is M1pro/max chip mac book a viable option ?

    • @sanbomics
      @sanbomics  หลายเดือนก่อน

      I would say the biggest limitation is going to be RAM. Other spec will just increase processing speed. I am not sure I would recommend a laptop. But you can set up a server wherever and just put that and your laptop on the same zerotier network. Then just get a macbook air IMO. E.g., running jupyter notebook over the network is identical to running it on your local machine after you initialize it.

    • @mehdiraouine2979
      @mehdiraouine2979 หลายเดือนก่อน

      @@sanbomics thx for the reply!

  • @elianegracielapilan5816
    @elianegracielapilan5816 4 หลายเดือนก่อน

    I need to analyze a GEO dataset (GSE198896). How do I read the matrix from the GSE198896_raw file? Thanks

    • @sanbomics
      @sanbomics  4 หลายเดือนก่อน +1

      What format are the raw files?

    • @elianegracielapilan5816
      @elianegracielapilan5816 3 หลายเดือนก่อน

      When downloading the operating system, the files are compressed. When unzipping, several folders are opened, one for each sample, which is also compressed. 1 of each sample. After unzipping, we have the following files in each sample folder: barcodes (TSV), features (TSV) and matrix (MTX).@@sanbomics

  • @chrisdoan3210
    @chrisdoan3210 ปีที่แล้ว +1

    Hi Mark. May I know which computer you used to run this analysis?

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Hey! Sure, ubuntu operating system with 128 gb ram, 24 cpu and a nvidia gpu. Ram will be the limiting factor depending on how many samples you are doing at once. Without a gpu it will just take longer

    • @chrisdoan3210
      @chrisdoan3210 ปีที่แล้ว +1

      @@sanbomics Thank you! So most laptop, desktop can't run this analysis. I have a Linux server which have more ram but it run by command line interface. How can I get Jupiter Notebook interface at you did?

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Great question:
      1) start your notebook on the server with the --no-browser flag (do it in tmux so you can exit terminal)
      2) on your local machine do ssh port forwarding: ssh -i path/to/key/if/you/have/one.pem -NfL 9999:localhost:8888 username@address
      3) localhost:9999 will bring up the server on your local machine
      To make it easier in the future, you can add the command as an alias in your bashrc

    • @chrisdoan3210
      @chrisdoan3210 ปีที่แล้ว

      @@sanbomics Hi Mark. I don't have root to set up new things on the server. So I run your code in a python script.
      python scRNA_seq.py
      Traceback (most recent call last):
      File "scRNA_seq.py", line 2, in
      import scanpy as sc
      File "/home/user/.pyenv/versions/3.8.0/lib/python3.8/site-packages/scanpy/__init__.py", line 8, in
      check_versions()
      File "/home/user/.pyenv/versions/3.8.0/lib/python3.8/site-packages/scanpy/_utils/__init__.py", line 47, in check_versions
      umap_version = pkg_version("umap-learn")
      File "/home/user/.pyenv/versions/3.8.0/lib/python3.8/site-packages/scanpy/_compat.py", line 33, in pkg_version
      return version.parse(v(package))
      File "/home/user/.pyenv/versions/3.8.0/lib/python3.8/site-packages/packaging/version.py", line 49, in parse
      return Version(version)
      File "/home/user/.pyenv/versions/3.8.0/lib/python3.8/site-packages/packaging/version.py", line 264, in __init__
      match = self._regex.search(version)
      TypeError: expected string or bytes-like object
      Do you know how I can fix this error? Thank you so much.

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Hi Chris. Are you doing this in a miniconda environment?

  • @henryren2790
    @henryren2790 11 หลายเดือนก่อน

    where do I find and import a list of mouse ribosome genes at 15:09?

    • @sanbomics
      @sanbomics  11 หลายเดือนก่อน +1

      You can use the human ones but you will need to change the capitalization so that instead of ABCD it is Abcd.

    • @henryren2790
      @henryren2790 11 หลายเดือนก่อน

      @@sanbomics ribo_genes = ribo_genes.applymap(lambda x: x.capitalize())

  • @mostafaismail4253
    @mostafaismail4253 ปีที่แล้ว

    What about scATAC + scDNA-seq (CNV), there's no resources for these topics and it will be great if you do a tutorial about these topics .❤️

    • @sanbomics
      @sanbomics  ปีที่แล้ว +2

      I'll keep that in mind for some upcoming video. Definitely going to do some scATAC + RNA soon at least. So many things I want to do but so little time to actually make videos..

  • @fsh9134
    @fsh9134 6 หลายเดือนก่อน

    Thank you very much for great video. I wonder why you have used .csv data file. In most of other videos you used .h5 or .mtx file. As per my knowledge CellRanger out put does not include such data file (.csv) you are using for this demo...

    • @sanbomics
      @sanbomics  5 หลายเดือนก่อน

      I am beholden to the data that is available. In this case, that is what the authors of the paper provided. Ideally, everyone would deposite h5ad files xD

  • @user-ux6wk6qt4d
    @user-ux6wk6qt4d 10 หลายเดือนก่อน

    Thank you man you are such an inspiration , u saved my life.
    I have a question and I hope you answer me: in this script should I always use 1e4 as value . if it's not the case How I should modify this script what I should eliminate and keep exactly please :
    adata = sc.read_h5ad('/lab/user/notebooks/test_elyoum/combined_filtred.h5ad')
    adata.layers['counts'] = adata.X.copy()
    #Normlize every cells to 10.000 UMI
    sc.pp.normalize_total(adata, target_sum = 1e4)
    #convert to log count
    sc.pp.log1p(adata)
    adata.raw = adata
    adata.obs.head()
    total_num_cells = adata.n_obs
    total_num_genes = adata.n_vars
    if total_num_genes > total_num_cells / 2:
    n_top_genes = int(0.40 * total_num_cells)
    sc.pp.highly_variable_genes(adata, n_top_genes=n_top_genes, subset=True, layer='counts', flavor="seurat_v3", batch_key="Sample")
    scvi.model.SCVI.setup_anndata(adata, layer = "counts",
    categorical_covariate_keys=["Sample"],
    continuous_covariate_keys=['pct_counts_mt', 'total_counts', 'pct_counts_ribo'])
    model = scvi.model.SCVI(adata)
    model.train() #may take a while without GPU
    #scvi clustering
    adata.obsm['X_scVI'] = model.get_latent_representation()
    adata.layers['scvi_normalized'] = model.get_normalized_expression(library_size = 1e4)
    #find neighbors
    sc.pp.neighbors(adata, use_rep = 'X_scVI')
    sc.tl.umap(adata)
    sc.tl.leiden(adata, resolution = 0.5)

    • @sanbomics
      @sanbomics  10 หลายเดือนก่อน

      I am glad that I saved your life xD. Actually, it is better to not use any target_sum. Just remove the argument. Things are always evolving in the sc-sphere

  • @user-qt5eh4xh6j
    @user-qt5eh4xh6j 5 หลายเดือนก่อน

    hi Can you help? I cant seem to make scvi work ...is there a way to be in touch

  • @chrisdoan3210
    @chrisdoan3210 ปีที่แล้ว

    Thank you for the video! After run ! pip3 install scanpy, I got this:
    WARNING: You are using pip version 21.1.1; however, version 22.2.2 is available.
    You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.9/bin/python3.9 -m pip install --upgrade pip' command.
    Then: import scanpy as sc doesn't work even though I updated pip to 22.2.2.
    What did I miss in this case?

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Are you using conda?

    • @chrisdoan3210
      @chrisdoan3210 ปีที่แล้ว

      @@sanbomics Because pip seems doesn't work so I tried to install scanpy using conda. However, when run:
      sc.pp.highly_variable_genes(adata, n_top_genes = 2000, subset = True, flavor = 'seurat_v3')
      ModuleNotFoundError Traceback (most recent call last)
      ~/opt/anaconda3/lib/python3.8/site-packages/scanpy/preprocessing/_highly_variable_genes.py in _highly_variable_genes_seurat_v3(adata, layer, n_top_genes, batch_key, check_values, span, subset, inplace)
      52 try:
      ---> 53 from skmisc.loess import loess
      54 except ImportError:
      ModuleNotFoundError: No module named 'skmisc'

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      What happens when you pip install skmisc? You can also try creating your conda environment with python 3.8. sometimes that solves some weird issues

    • @chrisdoan3210
      @chrisdoan3210 ปีที่แล้ว

      @@sanbomics This is what I got:
      Defaulting to user installation because normal site-packages is not writeable
      ERROR: Could not find a version that satisfies the requirement skmisc (from versions: none)
      ERROR: No matching distribution found for skmisc

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Try creating your environment with python=3.8

  • @ustigergirl
    @ustigergirl ปีที่แล้ว

    Your videos are amazing and helping me a lot. Could you do a video on sNuc-Seq, please? Thank you

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Hi! Thank you

    • @ustigergirl
      @ustigergirl ปีที่แล้ว

      @@sanbomics Is Integration same as batch effect cleaning?

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      Pretty much the same, yup

  • @chem1kal
    @chem1kal 11 หลายเดือนก่อน

    42:39
    How would i automatically label clusters?

    • @sanbomics
      @sanbomics  11 หลายเดือนก่อน

      check out this video: th-cam.com/video/tgk-rT_R4wk/w-d-xo.html

  • @ladenhudson2458
    @ladenhudson2458 ปีที่แล้ว

    does this routine take advantage from multicore processor?

    • @sanbomics
      @sanbomics  ปีที่แล้ว

      How I have it here relies the gpu for computing. There isn't much need for multicore here unless you want to save a little time inputing a bunch of sample, which most people wont be doing.

    • @ladenhudson2458
      @ladenhudson2458 ปีที่แล้ว

      @@sanbomics Thx, I am building my workstation. it helps a lot.