Note on which genes to include: In the video I include only significant DE genes. This is not best practice, but was faster and required less memory. You SHOULD include all genes that had reasonable expression. A good threshold might be genes with a base mean of greater than 100. If you do include all genes (which you should) you will likely need to decrease the default number of permutations in the gp.prerank command by adding this argument: permutation_num = 100 See notebook for example: github.com/mousepixels/sanbomics/blob/main/GSEA_in_python.ipynb
Thanks tonz for your effort making these informative videos. Without your help, I wouldn't be able to learn such nice and easy way of making GSEA plots. I really like every of your video and every single second of it. Really appreciate your smart and clear explaination. I have one question, is there anyway to plot GSEA with all the genesets listed at once? And per se rate them in order of significance?
I need your help! :), when I run this: gseaplot(pre_res.ranking, term = term_to_graph, **pre_res.results[term_to_graph]), the error reports says TypeError: gseaplot() got multiple values for argument 'term'
Is it possible to adjust the text font size etc., and/or remove the redundant FDR value when running on a single geneset of interest? My default text when plotting with gseapy is much blockier and often overlaps the graph in an ugly way.
I haven't tried, but you may be able to control it be changing the matplotlib default settings. Im not sure gsea returns an axis so it may not be super easy to customize after. Were you able to find a solution?
Thanks, it is super useful. But when I run gseapy.prerank(....), got the error "No gene sets passed through filtering condition", even try different min_size and permutation_num. Do you know where is wrong?
There are several things that could have gone wrong. The two most likely are either your keys don't match the ones in the database (e.g, they are case sensitive) or you don't have any enriched pathways. Its hard to tell without looking at it more closely.
@@sanbomics Thanks for the reply. Yes, the reason is that my data has no enriched GO terms. (I was stupid to use a small test dataset, and misunderstood the returned error.)
Thank you for this video and all of your other videos, they are very concise and educational. In this video, you pre-filter the genes in your ranked list. According to my understanding, this should not be done. I think that the ranked list should include all the genes in the experiment that have any evidence of expression. In other words, if it is possible to calculate a ranking metric, the gene should be included. Can you comment on the decision to use only deferentially expressed genes? I think this idea actually illustrates one of the strengths of GSEA in that it is possible to make biological interpretations in the absence of genes that pass thresholds that define differential expression. I was also wondering about your ranking metric itself. You are using log2FoldChange*-log10(adjp) which is certainly reasonable. I normally use the stat column for this which is something like log2FoldChange/StdError. Could you comment on your choice of ranking metric? Thanks!
Thanks for catching this. You raise very important points: 1) rerunning the notebook I realized why I only used the DE genes. With 1000 default permutations my 128 gb computer runs out of memory and it is very slow. It is better to include all genes like you point out. The results likely aren't much different since the non-DE genes are weighted so little, but it is better to include them all. In this case you will likely have to reduce the number of permutations. "permutation_num" argmuent in gp.prerank. I have included a correction in the video description and I will push a new notebook. 2) I'm not aware of a a definitive best metric to use to rank the genes. Stat is likely a very good choice. But p value and lfc are more universal. e.g., some single-cell programs do not give you a stat. I think both p and lfc individually may not be ideal, so I had learned to combine them through multiplication. Not sure if this is the BEST way, but seems to work well. Thanks again!
HI! Thank you for another great video! Can you also use gene sets from msigdb for this? I have them downloaded in visual code in a separate folder, I am working on a server. I wanted to use msigdb package in R with rpy2 , but I have too many problems with setting up an environment, that's why I gave up on R.
you can run any set of genes you can get into a python dictionary/list. Check the available databases first and see if it already exists. But if it doesn't-- import your data into a dictionary format, eg, gene_sets = {'term 1':['gene1', 'gene2' ...] , 'term 2':['gene1', 'gene2' ...], etc } then put the dictionary into the gene set argument in the gseapy function
Thank you for the great presentation. I have an error in the very last code no differences in front of these code term_to_graph > 'GO_Biological_Process_2021__cellular response to DNA damage stimulus (GO:0006974)' gseaplot(pre_res.ranking, term = term_to_graph, **pre_res.results[term_to_graph]) > --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[122], line 1 ----> 1 gseaplot(pre_res.ranking, term = term_to_graph, **pre_res.results[term_to_graph]) TypeError: gseaplot() got multiple values for argument 'term' I cannot solve this error. if possible, could you check this?
Note on which genes to include:
In the video I include only significant DE genes. This is not best practice, but was faster and required less memory. You SHOULD include all genes that had reasonable expression. A good threshold might be genes with a base mean of greater than 100. If you do include all genes (which you should) you will likely need to decrease the default number of permutations in the gp.prerank command by adding this argument: permutation_num = 100
See notebook for example: github.com/mousepixels/sanbomics/blob/main/GSEA_in_python.ipynb
How did you managed to get the result file?
Thanks tonz for your effort making these informative videos. Without your help, I wouldn't be able to learn such nice and easy way of making GSEA plots. I really like every of your video and every single second of it. Really appreciate your smart and clear explaination.
I have one question, is there anyway to plot GSEA with all the genesets listed at once?
And per se rate them in order of significance?
You mean multiple lines on one chart? Or plot many individual plots?
@@sanbomics wow… i meant individuals originally but now also wanna know how to plot multiple, thanks for the response!
I need your help! :), when I run this: gseaplot(pre_res.ranking, term = term_to_graph, **pre_res.results[term_to_graph]), the error reports says TypeError: gseaplot() got multiple values for argument 'term'
print term_to_graph and see what it is
it prints 'mitotic spindle organization (GO:0007052)'@@sanbomics
Try this: gseaplot(rank_metric=pre_res.ranking, term = term_to_graph, **pre_res.results[term_to_graph])
wow! It works! Thank a lot !! @@csalt3689
@@csalt3689 You are a lifesaver! Thank you.
Is it possible to adjust the text font size etc., and/or remove the redundant FDR value when running on a single geneset of interest? My default text when plotting with gseapy is much blockier and often overlaps the graph in an ugly way.
I haven't tried, but you may be able to control it be changing the matplotlib default settings. Im not sure gsea returns an axis so it may not be super easy to customize after. Were you able to find a solution?
@@sanbomics A fairly crude one, for now I just save the plots as pdf and then edit fonts etc. with a pdf editor.
Thanks, it is super useful. But when I run gseapy.prerank(....), got the error "No gene sets passed through filtering condition", even try different min_size and permutation_num. Do you know where is wrong?
There are several things that could have gone wrong. The two most likely are either your keys don't match the ones in the database (e.g, they are case sensitive) or you don't have any enriched pathways. Its hard to tell without looking at it more closely.
@@sanbomics Thanks for the reply. Yes, the reason is that my data has no enriched GO terms. (I was stupid to use a small test dataset, and misunderstood the returned error.)
Thank you for this video and all of your other videos, they are very concise and educational.
In this video, you pre-filter the genes in your ranked list. According to my understanding, this should not be done. I think that the ranked list should include all the genes in the experiment that have any evidence of expression. In other words, if it is possible to calculate a ranking metric, the gene should be included. Can you comment on the decision to use only deferentially expressed genes?
I think this idea actually illustrates one of the strengths of GSEA in that it is possible to make biological interpretations in the absence of genes that pass thresholds that define differential expression.
I was also wondering about your ranking metric itself. You are using log2FoldChange*-log10(adjp) which is certainly reasonable. I normally use the stat column for this which is something like log2FoldChange/StdError. Could you comment on your choice of ranking metric? Thanks!
Thanks for catching this. You raise very important points:
1) rerunning the notebook I realized why I only used the DE genes. With 1000 default permutations my 128 gb computer runs out of memory and it is very slow. It is better to include all genes like you point out. The results likely aren't much different since the non-DE genes are weighted so little, but it is better to include them all. In this case you will likely have to reduce the number of permutations. "permutation_num" argmuent in gp.prerank. I have included a correction in the video description and I will push a new notebook.
2) I'm not aware of a a definitive best metric to use to rank the genes. Stat is likely a very good choice. But p value and lfc are more universal. e.g., some single-cell programs do not give you a stat. I think both p and lfc individually may not be ideal, so I had learned to combine them through multiplication. Not sure if this is the BEST way, but seems to work well.
Thanks again!
Great Video! Can you explain the formula for Rank? why the log10(df.padj) should be negtive?
Thanks :) probably just easier to rank on stat. It needs to be negative to make it positive.
HI! Thank you for another great video! Can you also use gene sets from msigdb for this? I have them downloaded in visual code in a separate folder, I am working on a server. I wanted to use msigdb package in R with rpy2 , but I have too many problems with setting up an environment, that's why I gave up on R.
you can run any set of genes you can get into a python dictionary/list. Check the available databases first and see if it already exists. But if it doesn't-- import your data into a dictionary format, eg, gene_sets = {'term 1':['gene1', 'gene2' ...] , 'term 2':['gene1', 'gene2' ...], etc } then put the dictionary into the gene set argument in the gseapy function
hi! Where is the formula for rank from? Also great videos! :)
You should just rank on STAT which accounts for change and variance
Thank you for the great presentation.
I have an error in the very last code
no differences in front of these code
term_to_graph
> 'GO_Biological_Process_2021__cellular response to DNA damage stimulus (GO:0006974)'
gseaplot(pre_res.ranking, term = term_to_graph, **pre_res.results[term_to_graph])
> ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[122], line 1
----> 1 gseaplot(pre_res.ranking, term = term_to_graph, **pre_res.results[term_to_graph])
TypeError: gseaplot() got multiple values for argument 'term'
I cannot solve this error. if possible, could you check this?
I have the exact same error over and over again no matter what I correct! Did you maybe manage to solve it?
@@ΔημητραΜποζογλου yes, i solve the problem.
could you explain you code more detail or send me your script?
@@수박바아 Yes of course! How can I contact you?
@@수박바아 Yes of course! How can I contact you?
@@수박바아 Yes of course! How can I contact you?