Single-cell analysis with scVI machine-learning toolkit

Поделиться
HTML-код
  • Опубликовано: 12 июл 2024
  • I show dataset integration, clustering, and differential expression. This is an introduction to the advanced python machine learning scVI toolkit. This is a high-powered toolkit capable of many things and it also allows users to easily integrate their own downstream machine-learning workflows.
    Notebook:
    github.com/mousepixels/sanbom...
    0:00 intro
    0:23 Preparing data
    3:14 core scVI
    9:31 marker genes
    11:06 Differential expression
  • НаукаНаука

Комментарии • 33

  • @mst63th
    @mst63th 2 года назад

    Interesting, very well explained. It's essential to know how to manipulate the Pandas data frame in scanpy and all corresponding packages.

    • @sanbomics
      @sanbomics  2 года назад

      Yup! Pandas is one of the reason scanpy is so great

    • @mst63th
      @mst63th 2 года назад

      @@sanbomics That's for sure.

    • @sanbomics
      @sanbomics  2 года назад +5

      I've thought about doing a video of ~"scanpy pandas essentials" or something like that. Going over the most basic and important pandas functions. Doing data analysis with scanpy is basically 50% manipulating pandas. However, in my experience those type of videos are less popular

    • @mst63th
      @mst63th 2 года назад

      @@sanbomics That's a good idea, I agree that maybe that kind of video is less popular, but if someone had experience working with scanpy before, they would understand how important it is to understand pandas' data frame comprehensively.

    • @zhaomingwu4105
      @zhaomingwu4105 4 месяца назад

      @@sanbomics I need that!!

  • @geney123
    @geney123 Год назад +1

    Brilliant. All your videos are great. Thanks a lot.
    I have a data composed of 20 samples from 2 batches and 2 conditions. seq1 (5 control vs 5 treated), seq2 (5 control vs 5 treated). So seq2 is like a replicate of seq1. Following your videos and comments. I did the analyse like that:
    1)Determine the doublets for each sample individually.
    2)Add doublet annotation to raw and delete doublets.
    3)Concatenate samples of each group in each batch like [seq1control, seq1treated, seq2control, seq2treated].
    4) Integrate seq1control + seq2control and seq1treated + seq2treated.
    So, at the end of this step I had it like all control samples and all treated samples
    5) Integrate the control and treated groups all together.
    6) Normalize, scale and cluster.
    note: all samples are from the same organ.
    note: I want to continue with DE analysis
    I would be happy to hear your thoughts. Many thanks

    • @sanbomics
      @sanbomics  Год назад

      It sounds like you've done a great job so far! I am not sure I quite understand step 3/4. You did two layers of concatenation to add batch keys then did the integration? What did you use for the categorical and batch keys in the model? I would be careful when using treatment/not-treated as a categorical because this could erase some of the differences that you want to find (i.e., don't correct for treatment only batch). SCVI is powerful, but it is very easy to over-correct when you don't want to.
      It sounds like you have a lot of cells. Do you have more than ~30k? If so, since you want to do DE, I would skip the highly variable features filter. If you don't care about time, you can also bump the epoch number up a bit to make it a theoretically better trained model. Then you can just use the model.differential_expression function
      For step 6: what are you normalizing and scaling? You should be using the latent representation after running scvi to do clustering. No need to do that unless you want to use the raw non-scvi normalized data for something.
      Let me know how it goes!

    • @geney123
      @geney123 Год назад

      @@sanbomics Thanks a lot. I used [sample, treatment, batches] for categorical.

    • @geney123
      @geney123 Год назад

      @@sanbomics Another thing is that I got about 200k cells in total so training model took ages :/

    • @sanbomics
      @sanbomics  Год назад +1

      I would remove treatment, because that will correct for actual differences due to treatment. You can likely remove sample too.

    • @sanbomics
      @sanbomics  Год назад +1

      Yeah i've been there xD. Do you have a GPU at least? Its the worst when you queue it up for when you go to sleep and you wake up and there is an error /cry

  • @blackmatti86
    @blackmatti86 2 года назад +1

    Great video, thank you! Question, if you are analysing single nuclei data, would you include any specific commands here that would be different to single cell data?

    • @sanbomics
      @sanbomics  2 года назад +1

      Thanks! Everything I showed should work for nuclei too. (If you ran cellranger you would have wanted to include the --include_intron flag when you ran cellranger count)

  • @jianyingli3928
    @jianyingli3928 3 месяца назад

    Extremely helpful video, love it! A quick question, when I check the model prior to training it, I got "Model's adata is minified?: False. I followed your protocol, how come yours never shows this? Thanks.

    • @sanbomics
      @sanbomics  3 месяца назад

      It's hard to say without looking into it. But in general, all these packages are changing over time and the videos can become outdated. Most of the time things still work, but a little differently

  • @shilpasy
    @shilpasy Год назад

    Thank you so much for the video. Can you (anyone) please tell me from where can I get the input data files (.h5) used here?
    Thank you!

    • @sanbomics
      @sanbomics  Год назад

      I believe I used my own unpublished data for this. However, there is lots of data available online!

  • @caspase888
    @caspase888 5 месяцев назад

    Great video. Was just wondering what GPU do you have? Would 1080Ti or 3050 GPU cut through? Thanks

  • @calvinchen1081
    @calvinchen1081 8 месяцев назад

    Thank you!How does using Tensor Core to accelerate the training of scvi for single-cell data affect the results?

    • @sanbomics
      @sanbomics  7 месяцев назад

      Any differences will be trivial, except for being a lot faster

    • @calvinchen1081
      @calvinchen1081 7 месяцев назад

      @@sanbomics Thank you! Sanbomics.I have learnt a lot from you vedio.Now i am puzzled by cell commnication.Could those tools do a proper job or not?

  • @IshwarHosamani
    @IshwarHosamani 7 месяцев назад

    Hey Mark, Shouldn't the normalization and scaling steps be done prior to concatenation?

    • @sanbomics
      @sanbomics  6 месяцев назад +1

      Hi! In this example the order doesn't matter at all because SCVI uses the raw counts. In many cases, you normalize individually before integration.

  • @goddyhong
    @goddyhong 3 месяца назад

    Great video! But I got an error message while using adata.concatenate process: 'Some (but not all) of the AnnData's to be concatenated had no .X value.' However, I've checked both of my data sources, and they both have a .X value (adata.a.X data type: float32, adata.a.X shape: (52489, 33538), adata.b.X data type: float32, adata.b.X shape: (130761, 32864)). Could you help me with this?😭

    • @sanbomics
      @sanbomics  3 месяца назад

      The modules are changing over time and so some of these videos require small changes. See if sc.concat([list of adata]) works better.

    • @goddyhong
      @goddyhong 2 месяца назад

      @@sanbomics Thanks! I will try it!