Pseudo-bulk analysis for single-cell RNA-Seq data | Detailed workflow tutorial

Поделиться
HTML-код
  • Опубликовано: 6 авг 2024
  • A detailed walk-through of steps to find perform pseudo-bulk differential expression analysis for single-cell RNA-Seq data in R. In this video I discuss what is pseudo-bulk analysis, why do we take this approach, and lastly how to perform this analysis. In this tutorial, I demonstrate how to manipulate data and aggregate counts to sample level using a Seurat object, followed by differential expression analysis using DESeq2 to find differentially expressed genes in a specific cell type cluster. I hope you find the video informative. I look forward to your comments in the comments section!
    1) Link to code:
    github.com/kpatel427/RUclipsT...
    2) Vignettes:
    ▸ hbctraining.github.io/scRNA-s...
    ▸ biocworkshops2019.bioconductor...
    ▸ bioconductor.org/books/3.14/OS...
    ▸ bioconductor.org/books/3.14/OS...
    3) Papers:
    ▸ www.nature.com/articles/s4146...
    ▸ bmcbioinformatics.biomedcentr...
    ▸ genomebiology.biomedcentral.c...
    Chapters:
    0:00 Intro
    0:29 WHAT is pseudo-bulk analysis?
    1:48 WHY perform pseudo-bulk analysis?
    5:25 (onwards) HOW to perform pseudo-bulk analysis?
    5:44 Fetch data from ExperimentHub
    8:36 QC and filtering
    12:11 Seurat's standard workflow steps
    14:10 Visualize data
    15:54 To use integrated or nonintegrated data?
    16:49 Aggregate counts to sample level
    21:35 Data manipulation step 1: Transpose matrix
    22:16 Data manipulation step 2: Split data frame
    24:46 Data manipulation step 3: Fix row.names and transpose again
    29:18 DESeq2 step 1: Get count matrix (corresponding to a cell type)
    30:09: Create sample level metadata i.e. colData
    32:23 DESeq2 step 2: Create DESeq2 dataset from matrix
    33:38 DESeq2 step 2: Run DESeq()
    33:51 Get results
    Show your support and encouragement by buying me a coffee:
    www.buymeacoffee.com/bioinfor...
    To get in touch:
    Website: bioinformagician.org/
    Github: github.com/kpatel427
    Email: khushbu_p@hotmail.com
    #bioinformagician #bioinformatics #pseudobulk #deg #seurat #integration #cca #R #genomics #beginners #tutorial #howto #omics #research #biology #ncbi #GEO #rnaseq #ngs

Комментарии • 73

  • @surfer101ist
    @surfer101ist Год назад

    Super helpful as a biologist with some CS training. Thank you!!!

  • @jimmylao349
    @jimmylao349 Год назад

    Very good explanation, I got the final script to try. Thanks

  • @aravindsundar4968
    @aravindsundar4968 9 месяцев назад

    Great tutorial! Thanks for sharing.

  • @bondjams8084
    @bondjams8084 Год назад

    Thank you so much! Your videos are so good!

  • @tushardhyani3931
    @tushardhyani3931 2 года назад

    Thank you for this video !!

  • @user-mb5ld7re8m
    @user-mb5ld7re8m 2 года назад

    good video! thank you sooooooo much!!!!

  • @learningtime1367
    @learningtime1367 2 года назад +1

    Thanks so much! Can you please do a video on GO analysis/KEGG for bulk rna-seq analysis? Thanks again

    • @Bioinformagician
      @Bioinformagician  2 года назад

      Thanks for the suggestion. I have plans to make a video covering this topic. Please stay tuned :)

  • @baymin4827
    @baymin4827 10 месяцев назад

    Your videos have been very helpful to me! What should I do if my Seurat object doesn't have 'ind' column? I am analyzing my own dataset. Thanks in advance

  • @subhasen2611
    @subhasen2611 2 года назад +1

    Thanks for the nice tutorials. Will you be adding any tutorial for trajectory analysis/ Cell Fate Decisions?

    • @Bioinformagician
      @Bioinformagician  2 года назад +1

      Yes, I will be making a video covering these topics. Thanks for the suggestion! :)

  • @prasadchaskar8542
    @prasadchaskar8542 2 года назад +1

    Thanks a lot for the tutorial. Could you please add a tutorial on trajectory analysis?

  • @jakobhansen5477
    @jakobhansen5477 6 месяцев назад

    Thankyou for a great video! what if I have very different cellcounts in clusters I want to compare? I would expect very different expression just due to different cell counts. Will a normalization step in deseq2 cancel out this difference?

  • @albanaisai3429
    @albanaisai3429 2 года назад

    Hi there great video, do you know how to ise Kallisto?

  • @Iman_1987
    @Iman_1987 Год назад

    could you please demonstrate isoform analysis by nanopore?? thnx

  • @saraalidadiani5881
    @saraalidadiani5881 2 месяца назад

    Thank you for the nice video. Just a question, how to account for two covariates in differential gene expression of single cell RNA seq data like sex and Age? thanks!

  • @pegahhejazi8399
    @pegahhejazi8399 Год назад +3

    Hello, thank you for the super helpful tutorial. I have a question regarding my own dataset. I have 3 groups (each has 3 rep), young, old+treatment, and old w/ treatment, does this tutorial apply to compare 3 groups? if not do you have other tutorials for that kind of dataset?

    • @raghavsharma4347
      @raghavsharma4347 Год назад

      Why do you have a young dataset, is it meant to be a control? You will need to set your model matrix as ~ age + treatment, and your contrasts will need to compare the treatment to the no treatment.

  • @mayconmarcao4554
    @mayconmarcao4554 2 года назад +1

    Graceful tutorial! I wonder which would be better to modeling a phenotype prediction (as input): i) pseudobulk or ii) single cell expression levels? Thanks for your existence =].

    • @Bioinformagician
      @Bioinformagician  2 года назад +1

      What is the outcome that you are hoping to predict? I do not have experience with statistical modeling, I am afraid I might not have useful inputs.

    • @mayconmarcao4554
      @mayconmarcao4554 2 года назад

      @@Bioinformagician I think I misunderstood the pseudobulk concept. Pseudobulk turns a single cell matrix into a patient-based matrix (as bulk RNAseq).
      What I thought was pseudobulk: I thought that with pseudbulk I'd be able to concatenate similar cells within a cell cluster to increase gene expression signals. But in this way pseudobulk would not represent patients but subclusters.
      Do you know if I can adapt pseudobulk strategy to aggregate subclusters?

  • @davidepasini3807
    @davidepasini3807 2 года назад +3

    Hi, thanks for the video and the nice explanation, this video happens at the right time, in fact I had thought to try this kind of analysis these days, I watched and tried your tutorial and I wondered how much can weigh the amount of cells per sample, for example in your case you have (looking at B cells) 864 with ind 1015 and 81 with ind 1039 this affects the analysis?

    • @Bioinformagician
      @Bioinformagician  2 года назад +1

      If I am understanding you correctly, you mean to ask does the amount of cells per sample affect the analysis? I would think not, because we are aggregating instead of averaging the counts across all cells to the sample level. So the number of cells should not affect the count values.

    • @anguscampbell3020
      @anguscampbell3020 Год назад

      @@Bioinformagician There are a number of methods which argue that the drop out in scRNA-seq data needs to be accounted for. It would be great if you could do a tutorial on MAST which is supposed to be able to account for this and differentiate between biological and technical variability in cell specific UMI.

  • @user-sl9wi7tl4f
    @user-sl9wi7tl4f 2 года назад +1

    Hi, thanks for the video,this is very helpful,Will you be adding any tutorial for monocle3? thank you again for these wonderful videos.

    • @Bioinformagician
      @Bioinformagician  2 года назад

      Yes, I definitely have plans on making videos using monocle3.

    • @user-sl9wi7tl4f
      @user-sl9wi7tl4f 2 года назад +1

      @@Bioinformagician Hi there,In the my study I face to another problem: Is it possible to compare two conditions without repetition within a certain cell type?Which analysis method could be used, or what package could be used?Hope for your reply.

    • @Bioinformagician
      @Bioinformagician  2 года назад

      @@user-sl9wi7tl4f Can you explain what do you mean by "compare two conditions without repetition within a certain cell type?"
      You mean you want to restrict comparison between two conditions to only certain clusters?

    • @zahraabdi1613
      @zahraabdi1613 2 года назад

      @@user-sl9wi7tl4f I have same problem. If you have found the solution, would you mind expalining it to me, please?

  • @blackmatti86
    @blackmatti86 2 года назад +2

    Your videos have been truly instrumental for me to grasp the concept of bioinformatic data analysis, especially for single cell RNA-seq. As far as I understand, scRNA-seq (or scATAC-seq) can be divided into droplet-based (e.g. 10X) and plate-based approaches, e.g. SMART-seq2. There seem to be a fair amount of help guides and instructions for the former method but not so much for the latter, I have noticed. Is there a resource that you know, that can guide a novice through a single cell (or single nucleus) RNA-seq performed using a plate approach (e.g. single cells FACS sorted into 384 WPs)? Thank you! xx

    • @Bioinformagician
      @Bioinformagician  2 года назад +3

      To get an overall idea of the pipeline, check this out: www2.stat.duke.edu/~sayan/Sta613/2018/singlecellrnaseq-170131050320.pdf
      This paper performs a comparison analyses between 10X and Smart-seq2: www.sciencedirect.com/science/article/pii/S1672022921000486#s0055
      Seurat also provided a vignette to integrate multiple datasets across different technologies (which includes smart-seq2): satijalab.org/seurat/archive/v3.1/integration.html
      This can give you an idea of how these datasets are processed before integration.
      Hope this helps!

    • @blackmatti86
      @blackmatti86 2 года назад

      @@Bioinformagician Thank you ❤️

  • @xiaosajackxu4242
    @xiaosajackxu4242 Год назад

    If I have 4 conditions, how to modify the codes to find DEGs that is enriched/depleted in at least one condition?

  • @SerorONG
    @SerorONG 2 года назад +1

    Hey there, great tutorial! May I just ask, how did you get so proficient with RegEx (regular expression). I feel that its one of the few core skills that would help immensely and is highly transferrable, especially during the initial stages of data-processing. Jus wanna know if you could recommend any resources to learn RegEx?

    • @Bioinformagician
      @Bioinformagician  2 года назад +1

      I first learnt regex when I learnt Perl. The more I kept using regex, the more it started to make sense. I use regexr (regexr.com/) often to practice and build my regex.
      Here are a few resources that could help you practice it more -
      1. regexone.com/
      2. regexlearn.com/
      3. www.hackerrank.com/domains/regex
      Hope this helps!

  • @zahraabdi1613
    @zahraabdi1613 2 года назад +1

    It was great! Thanks so much❤What should I do if my Seurat object doesn't have 'ind' column? I mean each cell just has the information about its cluster and the condition but not the individual information.

    • @Bioinformagician
      @Bioinformagician  2 года назад

      Can you tell me where did you download your data from?

  • @wanisajad785
    @wanisajad785 5 месяцев назад

    @Bioinformagician: Are you suggesting to use raw counts (slot =count) for un-integrated data and normalized counts (slot=data) for integrated seurat object?

  • @maytelopez-cascales6113
    @maytelopez-cascales6113 Год назад

    Very nice tutorial, I have a question, how could I do a differential expression analysis making the contrast between counts coming from different experiments, I have already done the pseudobulk with the single cell experiments, and I want to compare them with the counts from my RNAseq. Could I make a matrix with the data coming from two different techniques? will you make a tutorial about that, thanks.

    • @raghavsharma4347
      @raghavsharma4347 Год назад

      You can add your counts from your RNA-seq as another sample then adjust your contrasts so that it is your RNA-seq data minus your single cell datasets.

  • @rosaicelalunaramirez1284
    @rosaicelalunaramirez1284 Год назад +2

    Thank you for the great tutorials, they've helped a lot on my research. I am currently working with my own single-cell data that I obtained from 6 samples (3 controls and 3 experimental). I have tried your tutorial but I get stuck on the part where you include the ind, the individual identification. Cell ranger only gives me the cell sequence followed by a -1 so I tried that and adding the condition. It looked like this CONTROL_ACCAACAGTGCATTAC-1 but when I use the aggregate expression function it gives me 12,972 columns as if it was taking each of the cells as individual sample. How can I perform your analysis without an identification number? or how can I assign it? Thank you!!

    • @Bioinformagician
      @Bioinformagician  Год назад

      The goal is to aggregate counts at sample level. In my case, each sample belong to an individual hence counts are aggregated to ind level. In your case, you might not need ind information. You could simply add a 'sample' column in your metadata, merge all samples and aggregate counts to the sample.

    • @urmom.com629
      @urmom.com629 Год назад

      @@Bioinformagician how do you "merge all samples"?

  • @singhh5050
    @singhh5050 2 года назад +1

    Hi! Do you think that pseudobulk analysis or GSEA is better for downstream analysis of scRNA-seq data? Especially when considering that there may be two different conditions (experimental and control). What are the advantages and disadvantages for using each method?

    • @Bioinformagician
      @Bioinformagician  2 года назад

      Pseudobulking and GSEA are completely different methods serving different purposes. Each of the downstream analysis would make sense, depending on what the goal of your analysis is. Typically, pseudobulking is performed to find genes differentially expressed followed by which we use enrichment methods to find what pathways/GO terms are enriched.

    • @singhh5050
      @singhh5050 2 года назад

      @@Bioinformagician Okay, that makes sense!! Thanks so much :)

  • @bigteeth5644
    @bigteeth5644 Год назад

    Hey there! First of all, I'd love to express my thanks to you! Your videos are helpful for our analysis. Although I ran into some problems trying to follow your tutorial. Our dataset is the aggregated snRNAseq dataset from six samples. We performed doublet removal, SoupX, scTransform normalization and integration. Some of the assay 'RNA' values are not integer. When I was searching for a solution, I read from the DESeq2 vignette that we should use un-normalized data. Do you have any suggestions on this issue? Thank you!

    • @Bioinformagician
      @Bioinformagician  Год назад

      Which slot in 'RNA' assay are you particularly referring to i.e. counts, data or scale slot? As for the demonstration here, we have used 'counts' slot which stores un-normalized raw counts to aggregate across samples.

    • @khr1138
      @khr1138 Год назад

      because of SoupX, it makes raw counts rational number.
      Use round() function! in DESeqDataSetFromMatrix

  • @mischmuuu
    @mischmuuu Год назад

    Thank you for this great tutorial! Is it possible to do a pseudo-bulk DE analysis with only one single-cell sample per condition? How would the statistics work?

    • @akundiraghukiranvydhyanath9939
      @akundiraghukiranvydhyanath9939 Год назад +2

      I'm afraid that won't be possible. Deseqw requires atleast 2 biological sample replicates. The other alternative would be edgeR but you have to give an dispersion value

    • @Bioinformagician
      @Bioinformagician  Год назад

      DESeq2 is not designed to work without replicates.

  • @abassohilebo2213
    @abassohilebo2213 2 года назад +1

    Thank you for the video
    Can you organize workshop?

    • @Bioinformagician
      @Bioinformagician  2 года назад +1

      I haven't given a thought on organizing one yet. I shall think about it.

    • @abassohilebo2213
      @abassohilebo2213 2 года назад +1

      @@Bioinformagician please do
      People tends to love workshop more, and it will double if not triple your subscribers

  • @samuelmcallister1655
    @samuelmcallister1655 Год назад

    Hi there! I was wondering why you are using the normalised and scaled data to generate the aggregate counts - should we not use the raw data?

    • @Bioinformagician
      @Bioinformagician  Год назад

      I am using "counts" slots that stores raw counts to generate aggregate counts.
      cts

  • @thwoals456
    @thwoals456 2 года назад

    Hello, really thank you so much for your video!!!!! I have one question. I have followed your single-cell tutorial video using my single cell data. However, there is no 'ind' column in my seurat object. Could you tell how to make that column?
    Additionally, I did scRNA seq for one control sample and for two treatment samples (total 3). Then, is it possible to make an 'ind' column in the control sample? And, the ratio of control versus sample (1:2) can affect the downstream analysis??
    Sorry for my many questions..

    • @Bioinformagician
      @Bioinformagician  2 года назад

      The 'ind' column was already present in the dataset, I did not create it. Did you download the data the same way I did it in the tutorial?
      Are the two treatment samples replicates or separate samples?

    • @thwoals456
      @thwoals456 Год назад

      ​@@Bioinformagician Instead of using the data in your video, I used my scRNA-seq data for pseudo-bulk analysis. So I asked how to make the column similar to the 'ind' column.
      And "the former" is my reply to your second question. I have two replicates of the treatment sample.

    • @Bioinformagician
      @Bioinformagician  Год назад +1

      @@thwoals456 Oh I get it now. So basically "ind" column is nothing but information about samples in my dataset (ind stood for individuals). If your dataset have sample information, you could use that column to aggregate your counts to sample level.

  • @bumpingbell
    @bumpingbell 2 года назад

    Hi, I am analyzing differentially expressed genes in a snRNA-seq dataset (GSE159812), for subsequent pathway analysis. Using FindMarkers, I get extremely small p-values for differentially expressed genes. However when I aggregate the counts by cell type & sample and perform pseudo-bulk analysis, less than 0.1% genes are significant (p

    • @Bioinformagician
      @Bioinformagician  2 года назад

      FindMarkers tend to inflate p-values as each cell is treated as a sample (as cells within a sample are not truly independent of each other) unlike pseudo-bulk where counts are aggregated at the sample levels.
      Both methods will not give you the same differentially expressed genes as single cell methods tend to identify variation between cells and pseudo-bulking will identify variation among samples (between populations).
      Also single-cell methods tend to identify highly expressed genes as differentially expressed and exhibit low sensitivity for genes having low expression.
      Did you aggregate counts by samples or by both - samples & cell types?

    • @bumpingbell
      @bumpingbell 2 года назад

      @@Bioinformagician I meant that I aggregated counts across the cells to sample level, and for each cell type I made comparisons between 8 case & 8 control samples.
      I think pseudo-bulk is closer to my expectations, but as mentioned, only few genes have significant adj p-values from this method (which is surprising to me). This makes comparing single genes barely possible. If we use pathway analysis, where we can just input the log2FC of each gene from pseudo-bulk, we may not need to care about the statistical significance of each gene, but we still need to filter with adj p-values for the input to be valid. Am I right in this sense? We’re using Ingenuity Pathway Analysis. (Sorry if this is a bit off-topic in any way)

  • @user-ol8xn5wb4s
    @user-ol8xn5wb4s Год назад

    Aren't we not supposed to normalize first? DESeq requires raw read counts. Or is the counts slot raw?

    • @raghavsharma4347
      @raghavsharma4347 Год назад

      When she aggregates counts, the function pulls data from the raw counts. Normalized counts are only used for the Seurat pipeline, but not used for differential expression analysis.

  • @sreejas1302
    @sreejas1302 Год назад

    Thank you so much. Is it possible to convert bulk RNA seq into Sc-RNA seq by R? Please reply.

    • @raghavsharma4347
      @raghavsharma4347 Год назад

      Why are you trying to convert bulk data to single cell? Are you trying to compare data between two datasets?

  • @efstratioskirtsios298
    @efstratioskirtsios298 Год назад

    thats very confusing :(

  • @urmom.com629
    @urmom.com629 Год назад

    Can you do colData without ind?