3 minute GSEA tutorial in R | RNAseq tutorials

Sanbomics

Просмотров 24 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 7 авг 2022
Complete gene set enrichment analysis (GSEA) R tutorial in 3 minutes. I show you which R packages to install, how to run them on your differential expression output, and how to plot the results.
My example is Deseq2 output, but you can use this on any set of genes you can rank based on LFC, P-value, etc. You can use data from outside of R if you read in the csv.
Notebook:
github.com/mousepixels/sanbom...
More info and examples can be found:
bioconductor.org/packages/dev...
Наука

Комментарии • 47

@aldaszarnauskas27 Год назад ⁺⁵
Hey man, you are so effective! Everything was straightforward, spot on, no time waste. This is what usually people need, just a quick tutorial without extra info!
Thanks!
@sanbomics Год назад
Thank you!
@xelaldaero9339 Год назад
Thanks Man! I need a full analysis of DESeq and this
@sanbomics Год назад
No problem!
@lst595991 Год назад
Thanks for your tutorial!
@sanbomics Год назад
You are welcome!
@mocabeentrill Год назад ⁺¹
Bro, where were you like a month ago. I struggled and struggled until I figured it out. but big thanks anyway.
@sanbomics Год назад
Wish I did it sooner for you :(. Did you end up using clusterprofiler, or something else?
@mocabeentrill Год назад ⁺¹
@@sanbomics 😅😅😅. I used cluster profiler. Now, I'm busy with WGCNA. Can u belief, just 6 months ago, I didn't even know R syntax. I just wanna get dangerous enough in R then I'm learning Python just like u. You're a huge inspiration🙌🏿🙌🏿🙌🏿.
@sanbomics Год назад ⁺¹
It's surprising how much you learn just by struggling through things in the beginning. That is definitely the way to do it IMO. Enough R to be proficient then learn python to be more future-proof in this age of machine learning. Thank you for the kind words!
@mariannebest6796 Год назад ⁺¹
Hi thank you so much for this video! I am fairly new to R so sorry if this is a dumb question - but you put geneSetID = 1 for the example of the nuclear division, but for choosing specific pathways to plot, would I look up a description and then use it's ID e.g. GO:0038065 and code like so: gseaplot(gse, geneSetID = "GO0038065") ?
I have tried the above as I expect this pathway to be highly enriched in the downregulated genes - from the genes that were flagged in DEG analysis, however although the enrichment signature is the correct way round indicating it is enriched in the downregulated genes, there are barely any black lines scored for the genes... Just wondered if you had any insight for this? Thanks so much!
@ahmedal-mammari9639 Год назад
you are great
@sanbomics Год назад
Thank you!
@MrFluffster101 Год назад ⁺²
Thanks for the video! Why did you extract "stat" for your genelist, rather than lfcSE or padj?
@sanbomics Год назад ⁺³
Good question! Any of these would work. Stat takes into account the difference as well as the error. Which statistic to use is somewhat arbitrary and can always be debated. Sometimes I use lfc * -log10P. For deseq2 output, stat seems to work pretty well.
@cleo4325 4 месяца назад
Thanks for the video! Is there a way to just filter for protein encoding genes during GSEA? (In my case, I used EnrichGO)
@niharikasingh7677 Год назад
This is precisely the kind of content I'm looking for while performing bioinformatics analysis. Thank you so much! Just a quick query, what exactly does the stat parameter signify? It isn't in any way a misrepresentation of our DEGs, right?
@sanbomics Год назад
I don't remember off the top of my head exactly how it is calculated, but it takes into account the magnitude of the change as well as the standard deviation. It will be highly correlated to lfc, and the abs(stat) to p-value. If your DE genes will have a higher abs(stat). But for GSEA it is just used for ranking and GSEA is independent of whether it is a "significant" DE gene. The metric to use for ranking is still debated, but they never really differ that much
@giuconv7832 9 месяцев назад
Hi! Thanks so much for this tutorial, it was extremely useful. However, while running gseGO, I get an error: could not find function "fgseaMultilevelCpp". Any suggestion?
@13attles Год назад
Hey Sanbomics, great video! With this method, is it possible to plug in Hallmark Gene sets from MSigDB? Not sure where would I plug in those
@sanbomics Год назад ⁺²
You can for sure add your own gene sets. I don't know of the top of my head, but it is probably in the documentation. I think I cover it in my Python-based GSEA if you are interested.
@kyungwonmin7217 Год назад ⁺¹
Hi. I am studying with non-model organism. I have done functional annotation so that I have gene names, ref-seq or GO ID along with DEseq data.
With these data, can I do GSEA? You are using org database base in order to conduct GSEA.
@sanbomics Год назад ⁺¹
It depends if your organism has annotated gene sets. If it doesn't, you could make up your own to see if they are enriched. For example creating homolog gene sets from a model organism. And you will have to use gene symbol probably instead of ENSEMBL ID
@violetaduranlaforet5520 11 месяцев назад
Awesome video! Can you do this with a custom set of genes?
@sanbomics 11 месяцев назад
Yup! I don't remember off the top of my head the arguments for it, but it should be relatively straight forward. If you cant figure it out here, in my python GSEA video I think I do show how to do a custom set.
@gab4434 Год назад
Thank you so much! I was wondering if you have the code pasted in your GitHub, I cound't find it :(
@sanbomics Год назад
Sorry! I didn't end up posting it because it was just a few lines of code. Here it is: github.com/mousepixels/sanbomics_scripts/blob/main/3_min_GSEA_tutorial.Rmd
@yijingwang7308 Год назад
Thank you so much! But I have two questions: the first one is why you selected only baseman > 50? The second one is you put all genes not differential expression genes for GSEA right?
@sanbomics Год назад ⁺¹
Genes that are very lowly expressed are noisy. Thats why it is good to filter them out. 50 is arbitrary. e.g., 100 could work as well. Yup! Thats what I do in the video. Better to keep all genes (except ones with very low expression)
@chrislee8408 Год назад
So is the purpose of doing a DEG in your video just to filter out lowly expressed genes? But you're actually using all genes (except lowly expressed ones) in the GSEA?
@elhombreloco3680 Год назад
@@chrislee8408 He did DEGs in the video just because it's faster for the demo
@chrislee8408 Год назад ⁺¹
is it possible to do a gene set enrichment analysis without doing a DEG? In my lab, we have just started doing NGS and we are still setting up our QCs. What we have in mind right now for one of our QC is to see if we can guess which sample (there are 4 samples) came from which tissues (heart, liver, kidney, diluent) by doing a gene set enrichment analysis to see if we can identify overexpressed genes which may only be expressed in specific tissues. Do you think it's feasible? Thank you.
@sanbomics Год назад ⁺¹
Yes, you just cant use this method. You can do overrepresentation analysis which just requires a list of genes. Look at one of my GO enrichment videos. ruclips.net/video/JPwdqdo_tRg/видео.html
@chrislee8408 Год назад
@@sanbomics I see what you're saying. I am currently stuck on how to group my samples for the DEG analysis. I know for DEG it's mostly comparing, for example: "treatment" v. "control" samples and there are usually 3 or more replicates for each groups (treatment & control). However, If we were "blinded" to knowing which sample came from which tissues and which sample is the control and we're not just comparing treatment v. control but instead 3 different tissue types that received the same treatment v. one control sample, in which our goal is to determine which sample came from which tissue based on genes that were overexpressed in each samples --do you have any idea in how to group or design a DEG analysis so that you can then take the differentially expressed genes from each sample to do a overrepresentation analysis (guessing you said GSEA isn't the best method for this goal) to figure out which sample came from which tissue? Thank you.
@sanbomics Год назад
I think I'm still a little confused on your ultimate goal... Maybe you could filter the common DE genes for one cell type? e.g., the common unregulated genes in muscle vs liver, muscle vs kidney, muscle vs adipose
@joeyoviedo5202 9 месяцев назад
hi, could you do a video using just normalized counts to make a ranked list from which gsea could then be done? Thank you!
@sanbomics 8 месяцев назад
This should be pretty straight forward just to rank a list based on this. Were you able to figure it out?
@frankchen1845 Год назад ⁺¹
thanks for the video; the bioconductor page ticked me off with their overly complicated tutorial with zero explanations
@sanbomics Год назад
No problem! Happens for a lot of things unfortunately
@0916079787 Год назад
2:45
How did you flip the comparison at this point? what exactly did you do?
@sanbomics Год назад ⁺¹
In deseq2 i just changed the order of the contrast. ("condition", "C", "S") to ("condition", "S", "C")
@0916079787 Год назад
@@sanbomics Thank you a lot. I have finished this RNAseq and it was very useful. keep it up.
@meetpanjwani2752 Год назад
Hello, Thank you for the insights. It has been really helpful. I have been getting this error when I run the code. I use Entrzid. Do you have any idea why?
preparing geneSet collections...
GSEA analysis...
Error: BiocParallel errors
2 remote errors, element index: 1, 57
109 unevaluated and other errors
first remote error:
Error in fgseaMultilevelCpp(x[, ES], stats, unique(x[, size]), sampleSize, : could not find function "fgseaMultilevelCpp"
In addition: Warning messages:
1: In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
There are ties in the preranked stats (18.78% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
2: In serialize(data, node$con) :
'package:stats' may not be available when loading
3: In serialize(data, node$con) :
'package:stats' may not be available when loading
@sanbomics Год назад
Hi, sorry it is hard to tell just from this. were you able to figure it out?
@meetpanjwani2752 Год назад
@@sanbomics hello! No not yet.
@meetpanjwani2752 Год назад
Hello ! I found a solution which solved the BiocParallel error
I used this code before running your code.
library(BiocParallel)
register(DoparParam())
I am not sure what it does, but it works.
@sanbomics Год назад
Huh, very interesting. I have never come across this. Nice job figuring it out. What operating system are you on?

Следующие

Автовоспроизведение

Differential expression in Python with pyDESeq2