3 minute GSEA tutorial in R | RNAseq tutorials
HTML-код
- Опубликовано: 7 авг 2022
- Complete gene set enrichment analysis (GSEA) R tutorial in 3 minutes. I show you which R packages to install, how to run them on your differential expression output, and how to plot the results.
My example is Deseq2 output, but you can use this on any set of genes you can rank based on LFC, P-value, etc. You can use data from outside of R if you read in the csv.
Notebook:
github.com/mousepixels/sanbom...
More info and examples can be found:
bioconductor.org/packages/dev... Наука
Hey man, you are so effective! Everything was straightforward, spot on, no time waste. This is what usually people need, just a quick tutorial without extra info!
Thanks!
Thank you!
Thanks Man! I need a full analysis of DESeq and this
No problem!
Thanks for your tutorial!
You are welcome!
Bro, where were you like a month ago. I struggled and struggled until I figured it out. but big thanks anyway.
Wish I did it sooner for you :(. Did you end up using clusterprofiler, or something else?
@@sanbomics 😅😅😅. I used cluster profiler. Now, I'm busy with WGCNA. Can u belief, just 6 months ago, I didn't even know R syntax. I just wanna get dangerous enough in R then I'm learning Python just like u. You're a huge inspiration🙌🏿🙌🏿🙌🏿.
It's surprising how much you learn just by struggling through things in the beginning. That is definitely the way to do it IMO. Enough R to be proficient then learn python to be more future-proof in this age of machine learning. Thank you for the kind words!
Hi thank you so much for this video! I am fairly new to R so sorry if this is a dumb question - but you put geneSetID = 1 for the example of the nuclear division, but for choosing specific pathways to plot, would I look up a description and then use it's ID e.g. GO:0038065 and code like so: gseaplot(gse, geneSetID = "GO0038065") ?
I have tried the above as I expect this pathway to be highly enriched in the downregulated genes - from the genes that were flagged in DEG analysis, however although the enrichment signature is the correct way round indicating it is enriched in the downregulated genes, there are barely any black lines scored for the genes... Just wondered if you had any insight for this? Thanks so much!
you are great
Thank you!
Thanks for the video! Why did you extract "stat" for your genelist, rather than lfcSE or padj?
Good question! Any of these would work. Stat takes into account the difference as well as the error. Which statistic to use is somewhat arbitrary and can always be debated. Sometimes I use lfc * -log10P. For deseq2 output, stat seems to work pretty well.
Thanks for the video! Is there a way to just filter for protein encoding genes during GSEA? (In my case, I used EnrichGO)
This is precisely the kind of content I'm looking for while performing bioinformatics analysis. Thank you so much! Just a quick query, what exactly does the stat parameter signify? It isn't in any way a misrepresentation of our DEGs, right?
I don't remember off the top of my head exactly how it is calculated, but it takes into account the magnitude of the change as well as the standard deviation. It will be highly correlated to lfc, and the abs(stat) to p-value. If your DE genes will have a higher abs(stat). But for GSEA it is just used for ranking and GSEA is independent of whether it is a "significant" DE gene. The metric to use for ranking is still debated, but they never really differ that much
Hi! Thanks so much for this tutorial, it was extremely useful. However, while running gseGO, I get an error: could not find function "fgseaMultilevelCpp". Any suggestion?
Hey Sanbomics, great video! With this method, is it possible to plug in Hallmark Gene sets from MSigDB? Not sure where would I plug in those
You can for sure add your own gene sets. I don't know of the top of my head, but it is probably in the documentation. I think I cover it in my Python-based GSEA if you are interested.
Hi. I am studying with non-model organism. I have done functional annotation so that I have gene names, ref-seq or GO ID along with DEseq data.
With these data, can I do GSEA? You are using org database base in order to conduct GSEA.
It depends if your organism has annotated gene sets. If it doesn't, you could make up your own to see if they are enriched. For example creating homolog gene sets from a model organism. And you will have to use gene symbol probably instead of ENSEMBL ID
Awesome video! Can you do this with a custom set of genes?
Yup! I don't remember off the top of my head the arguments for it, but it should be relatively straight forward. If you cant figure it out here, in my python GSEA video I think I do show how to do a custom set.
Thank you so much! I was wondering if you have the code pasted in your GitHub, I cound't find it :(
Sorry! I didn't end up posting it because it was just a few lines of code. Here it is: github.com/mousepixels/sanbomics_scripts/blob/main/3_min_GSEA_tutorial.Rmd
Thank you so much! But I have two questions: the first one is why you selected only baseman > 50? The second one is you put all genes not differential expression genes for GSEA right?
Genes that are very lowly expressed are noisy. Thats why it is good to filter them out. 50 is arbitrary. e.g., 100 could work as well. Yup! Thats what I do in the video. Better to keep all genes (except ones with very low expression)
So is the purpose of doing a DEG in your video just to filter out lowly expressed genes? But you're actually using all genes (except lowly expressed ones) in the GSEA?
@@chrislee8408 He did DEGs in the video just because it's faster for the demo
is it possible to do a gene set enrichment analysis without doing a DEG? In my lab, we have just started doing NGS and we are still setting up our QCs. What we have in mind right now for one of our QC is to see if we can guess which sample (there are 4 samples) came from which tissues (heart, liver, kidney, diluent) by doing a gene set enrichment analysis to see if we can identify overexpressed genes which may only be expressed in specific tissues. Do you think it's feasible? Thank you.
Yes, you just cant use this method. You can do overrepresentation analysis which just requires a list of genes. Look at one of my GO enrichment videos. ruclips.net/video/JPwdqdo_tRg/видео.html
@@sanbomics I see what you're saying. I am currently stuck on how to group my samples for the DEG analysis. I know for DEG it's mostly comparing, for example: "treatment" v. "control" samples and there are usually 3 or more replicates for each groups (treatment & control). However, If we were "blinded" to knowing which sample came from which tissues and which sample is the control and we're not just comparing treatment v. control but instead 3 different tissue types that received the same treatment v. one control sample, in which our goal is to determine which sample came from which tissue based on genes that were overexpressed in each samples --do you have any idea in how to group or design a DEG analysis so that you can then take the differentially expressed genes from each sample to do a overrepresentation analysis (guessing you said GSEA isn't the best method for this goal) to figure out which sample came from which tissue? Thank you.
I think I'm still a little confused on your ultimate goal... Maybe you could filter the common DE genes for one cell type? e.g., the common unregulated genes in muscle vs liver, muscle vs kidney, muscle vs adipose
hi, could you do a video using just normalized counts to make a ranked list from which gsea could then be done? Thank you!
This should be pretty straight forward just to rank a list based on this. Were you able to figure it out?
thanks for the video; the bioconductor page ticked me off with their overly complicated tutorial with zero explanations
No problem! Happens for a lot of things unfortunately
2:45
How did you flip the comparison at this point? what exactly did you do?
In deseq2 i just changed the order of the contrast. ("condition", "C", "S") to ("condition", "S", "C")
@@sanbomics Thank you a lot. I have finished this RNAseq and it was very useful. keep it up.
Hello, Thank you for the insights. It has been really helpful. I have been getting this error when I run the code. I use Entrzid. Do you have any idea why?
preparing geneSet collections...
GSEA analysis...
Error: BiocParallel errors
2 remote errors, element index: 1, 57
109 unevaluated and other errors
first remote error:
Error in fgseaMultilevelCpp(x[, ES], stats, unique(x[, size]), sampleSize, : could not find function "fgseaMultilevelCpp"
In addition: Warning messages:
1: In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
There are ties in the preranked stats (18.78% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
2: In serialize(data, node$con) :
'package:stats' may not be available when loading
3: In serialize(data, node$con) :
'package:stats' may not be available when loading
Hi, sorry it is hard to tell just from this. were you able to figure it out?
@@sanbomics hello! No not yet.
Hello ! I found a solution which solved the BiocParallel error
I used this code before running your code.
library(BiocParallel)
register(DoparParam())
I am not sure what it does, but it works.
Huh, very interesting. I have never come across this. Nice job figuring it out. What operating system are you on?