Thank you! Spend a while trying to figure out how to do pathway analysis in R and most guides always expected you already have some sort of GO or Kegg library where you can refer to and don't go into specifics how these libraries work and what to do when they do not work. This step-by-step guide was enough to get me from DEG lists into proper pathway analysis - and I even understood why and what I am doing in each step! I am working with rat sequencing data and some columns I had were very different from the example data you had here but after checking specific points a few times I managed to filter and re-format all the necessary information from my data.
Great video! I learned about msigdbr and the dplyr::separate function. I just want to mention a few things. 1. The GSEA ranking metric doesn’t have to be fold-change. I use the gene wise average moderated t-statistic from limma or the signed -log10-transformed p-value. There are a ton of ranking metrics to choose from. Both of these are very similar, and we can compare their density plots to get an idea of how they would alter the GSEA results. 2. Over-representation analysis is not great as a follow-up to differential analysis because of the arbitrary significance threshold that you mentioned and the fact that there may be duplicates at the gene level. Also, we lose information about the direction of change, since ORA only tells us which sets are more present in the significant group than what we expect by chance. However, it is great when genes uniquely map to discrete clusters, so it is good as a follow-up to WGCNA or K-means clustering. 3. The figures you use to introduce GSEA show the phenotype permutation approach, but most R implementations (including fgsea) use the gene permutation approach, which is much faster but has a slightly different interpretation. 4. For ORA, it may be useful to plot the ratio of the number of significant genes in the gene sets to the total number of significant genes along the x-axis and change the bars to points scaled according to the -log10(adjusted p-value). Gene sets that include all significant genes (ratio of 1) may be interesting to look at, even if their adjusted p-values are hovering near 0.05. 5. The fora function in fgsea can be used for ORA as well. Personally, I find it easier than dealing with the bulkier clusterProfiler results objects.
Hi, great video and clarification of types of enrichment analyses. I have a question, what is the best way to create a ranked list of genes for 3 treatment and 3 control samples in one data frame using just normalized read counts. I want to rank the gene list from all genes not DEGs then do enrichment analysis. Thank you!
I'm new to this and I'm wondering why do you need to see how much your significant genes overlap with a larger or other gene set? Is that to elucidate what transcriptional regulation network controls the significant genes and or to discover other similar genes relative to the genes of interest?
Thanks! I have other R workshop videos ruclips.net/p/PL_Oo8UFoIb007lGeg78awOu44Ido35zsY with materials for those and other workshops that don't have videos at github.com/BIGslu/workshops and github.com/hawn-lab/workshops_UW_Seattle
Thank you for the very clear explanations. One question is that for the purpose of GSEA (either simple or gsea), what type of normalization of the counts should one use? Or does it even matter? If so, how would it be different between the two methods? Thank you!
For RNAseq GSEA, we use fold changes calculated from TMM normalized log2 counts per million (see limma package tutorial) or estimates output by whatever linear model we ran. In essence, whatever data normalization needs to be done for stats should also be done before calculating fold changes for GSEA. For simple enrichment, it's similar. Treat the data however is best for statistical tests. Then find significant genes from those tests and input those gene lists into enrichment
Do you mean the FDR by count histograms around 35min? You can add the total # of genes (count) to the top of each bar in a histogram with stat_bin(geom="text", aes(label=..count..)) And to plot Pvalue, I would make a new plot with x=Pval instead of x=FDR
Thank you! Spend a while trying to figure out how to do pathway analysis in R and most guides always expected you already have some sort of GO or Kegg library where you can refer to and don't go into specifics how these libraries work and what to do when they do not work. This step-by-step guide was enough to get me from DEG lists into proper pathway analysis - and I even understood why and what I am doing in each step! I am working with rat sequencing data and some columns I had were very different from the example data you had here but after checking specific points a few times I managed to filter and re-format all the necessary information from my data.
Great video! I learned about msigdbr and the dplyr::separate function. I just want to mention a few things.
1. The GSEA ranking metric doesn’t have to be fold-change. I use the gene wise average moderated t-statistic from limma or the signed -log10-transformed p-value. There are a ton of ranking metrics to choose from. Both of these are very similar, and we can compare their density plots to get an idea of how they would alter the GSEA results.
2. Over-representation analysis is not great as a follow-up to differential analysis because of the arbitrary significance threshold that you mentioned and the fact that there may be duplicates at the gene level. Also, we lose information about the direction of change, since ORA only tells us which sets are more present in the significant group than what we expect by chance. However, it is great when genes uniquely map to discrete clusters, so it is good as a follow-up to WGCNA or K-means clustering.
3. The figures you use to introduce GSEA show the phenotype permutation approach, but most R implementations (including fgsea) use the gene permutation approach, which is much faster but has a slightly different interpretation.
4. For ORA, it may be useful to plot the ratio of the number of significant genes in the gene sets to the total number of significant genes along the x-axis and change the bars to points scaled according to the -log10(adjusted p-value). Gene sets that include all significant genes (ratio of 1) may be interesting to look at, even if their adjusted p-values are hovering near 0.05.
5. The fora function in fgsea can be used for ORA as well. Personally, I find it easier than dealing with the bulkier clusterProfiler results objects.
I really enjoyed your presentation. I learned quite a bit. Thank you!
Thankyou! I was just wondering which paper to cite when performing the hypergeometric "simple" enrichment?
How to download the genesets directly in R studio?
can we do the gene set enrichment analysis for rice using the same code and databases
Hi how can we do the gsea analysis for dna methylation genes i have beta values of samples and logFC cutoff of the same, thank you
Hi, great video and clarification of types of enrichment analyses. I have a question, what is the best way to create a ranked list of genes for 3 treatment and 3 control samples in one data frame using just normalized read counts. I want to rank the gene list from all genes not DEGs then do enrichment analysis. Thank you!
I'm new to this and I'm wondering why do you need to see how much your significant genes overlap with a larger or other gene set? Is that to elucidate what transcriptional regulation network controls the significant genes and or to discover other similar genes relative to the genes of interest?
Very clear explanation, thanks for this amazing content! Would you have any additional bio-inf analysis tutorials?
Thanks!
I have other R workshop videos ruclips.net/p/PL_Oo8UFoIb007lGeg78awOu44Ido35zsY
with materials for those and other workshops that don't have videos at
github.com/BIGslu/workshops and github.com/hawn-lab/workshops_UW_Seattle
Thank you so much this was so helpful!
Thank you for the very clear explanations. One question is that for the purpose of GSEA (either simple or gsea), what type of normalization of the counts should one use? Or does it even matter? If so, how would it be different between the two methods? Thank you!
For RNAseq GSEA, we use fold changes calculated from TMM normalized log2 counts per million (see limma package tutorial) or estimates output by whatever linear model we ran. In essence, whatever data normalization needs to be done for stats should also be done before calculating fold changes for GSEA.
For simple enrichment, it's similar. Treat the data however is best for statistical tests. Then find significant genes from those tests and input those gene lists into enrichment
HELLO, How can we add gene count and pvalue in same histogram by using clusterprofiler package of R?
Do you mean the FDR by count histograms around 35min? You can add the total # of genes (count) to the top of each bar in a histogram with
stat_bin(geom="text", aes(label=..count..))
And to plot Pvalue, I would make a new plot with x=Pval instead of x=FDR