Simple guide to GSEA and plotting in python

Поделиться
HTML-код
  • Опубликовано: 22 янв 2025

Комментарии • 31

  • @sanbomics
    @sanbomics  2 года назад +4

    Note on which genes to include:
    In the video I include only significant DE genes. This is not best practice, but was faster and required less memory. You SHOULD include all genes that had reasonable expression. A good threshold might be genes with a base mean of greater than 100. If you do include all genes (which you should) you will likely need to decrease the default number of permutations in the gp.prerank command by adding this argument: permutation_num = 100
    See notebook for example: github.com/mousepixels/sanbomics/blob/main/GSEA_in_python.ipynb

    • @aelnabwy
      @aelnabwy 2 года назад

      How did you managed to get the result file?

  • @jhpa0308
    @jhpa0308 Год назад

    Thanks tonz for your effort making these informative videos. Without your help, I wouldn't be able to learn such nice and easy way of making GSEA plots. I really like every of your video and every single second of it. Really appreciate your smart and clear explaination.
    I have one question, is there anyway to plot GSEA with all the genesets listed at once?
    And per se rate them in order of significance?

    • @sanbomics
      @sanbomics  Год назад +1

      You mean multiple lines on one chart? Or plot many individual plots?

    • @jhpa0308
      @jhpa0308 Год назад

      @@sanbomics wow… i meant individuals originally but now also wanna know how to plot multiple, thanks for the response!

  • @harryliu1005
    @harryliu1005 Год назад

    I need your help! :), when I run this: gseaplot(pre_res.ranking, term = term_to_graph, **pre_res.results[term_to_graph]), the error reports says TypeError: gseaplot() got multiple values for argument 'term'

    • @sanbomics
      @sanbomics  Год назад +2

      print term_to_graph and see what it is

    • @harryliu1005
      @harryliu1005 Год назад

      it prints 'mitotic spindle organization (GO:0007052)'@@sanbomics

    • @csalt3689
      @csalt3689 Год назад +4

      Try this: gseaplot(rank_metric=pre_res.ranking, term = term_to_graph, **pre_res.results[term_to_graph])

    • @harryliu1005
      @harryliu1005 Год назад +1

      wow! It works! Thank a lot !! @@csalt3689

    • @JohnDorien-b2z
      @JohnDorien-b2z 10 месяцев назад

      @@csalt3689 You are a lifesaver! Thank you.

  • @EamonCoughlan-rd5kk
    @EamonCoughlan-rd5kk 9 месяцев назад

    Is it possible to adjust the text font size etc., and/or remove the redundant FDR value when running on a single geneset of interest? My default text when plotting with gseapy is much blockier and often overlaps the graph in an ugly way.

    • @sanbomics
      @sanbomics  7 месяцев назад

      I haven't tried, but you may be able to control it be changing the matplotlib default settings. Im not sure gsea returns an axis so it may not be super easy to customize after. Were you able to find a solution?

    • @EamonCoughlan-rd5kk
      @EamonCoughlan-rd5kk 7 месяцев назад

      @@sanbomics A fairly crude one, for now I just save the plots as pdf and then edit fonts etc. with a pdf editor.

  • @cherhan9817
    @cherhan9817 2 года назад +1

    Thanks, it is super useful. But when I run gseapy.prerank(....), got the error "No gene sets passed through filtering condition", even try different min_size and permutation_num. Do you know where is wrong?

    • @sanbomics
      @sanbomics  2 года назад

      There are several things that could have gone wrong. The two most likely are either your keys don't match the ones in the database (e.g, they are case sensitive) or you don't have any enriched pathways. Its hard to tell without looking at it more closely.

    • @cherhan9817
      @cherhan9817 2 года назад

      @@sanbomics Thanks for the reply. Yes, the reason is that my data has no enriched GO terms. (I was stupid to use a small test dataset, and misunderstood the returned error.)

  • @charliewhittaker817
    @charliewhittaker817 2 года назад +2

    Thank you for this video and all of your other videos, they are very concise and educational.
    In this video, you pre-filter the genes in your ranked list. According to my understanding, this should not be done. I think that the ranked list should include all the genes in the experiment that have any evidence of expression. In other words, if it is possible to calculate a ranking metric, the gene should be included. Can you comment on the decision to use only deferentially expressed genes?
    I think this idea actually illustrates one of the strengths of GSEA in that it is possible to make biological interpretations in the absence of genes that pass thresholds that define differential expression.
    I was also wondering about your ranking metric itself. You are using log2FoldChange*-log10(adjp) which is certainly reasonable. I normally use the stat column for this which is something like log2FoldChange/StdError. Could you comment on your choice of ranking metric? Thanks!

    • @sanbomics
      @sanbomics  2 года назад +2

      Thanks for catching this. You raise very important points:
      1) rerunning the notebook I realized why I only used the DE genes. With 1000 default permutations my 128 gb computer runs out of memory and it is very slow. It is better to include all genes like you point out. The results likely aren't much different since the non-DE genes are weighted so little, but it is better to include them all. In this case you will likely have to reduce the number of permutations. "permutation_num" argmuent in gp.prerank. I have included a correction in the video description and I will push a new notebook.
      2) I'm not aware of a a definitive best metric to use to rank the genes. Stat is likely a very good choice. But p value and lfc are more universal. e.g., some single-cell programs do not give you a stat. I think both p and lfc individually may not be ideal, so I had learned to combine them through multiplication. Not sure if this is the BEST way, but seems to work well.
      Thanks again!

  • @harryliu1005
    @harryliu1005 Год назад

    Great Video! Can you explain the formula for Rank? why the log10(df.padj) should be negtive?

    • @sanbomics
      @sanbomics  Год назад

      Thanks :) probably just easier to rank on stat. It needs to be negative to make it positive.

  • @katarinavalentincic9621
    @katarinavalentincic9621 Год назад

    HI! Thank you for another great video! Can you also use gene sets from msigdb for this? I have them downloaded in visual code in a separate folder, I am working on a server. I wanted to use msigdb package in R with rpy2 , but I have too many problems with setting up an environment, that's why I gave up on R.

    • @sanbomics
      @sanbomics  Год назад +1

      you can run any set of genes you can get into a python dictionary/list. Check the available databases first and see if it already exists. But if it doesn't-- import your data into a dictionary format, eg, gene_sets = {'term 1':['gene1', 'gene2' ...] , 'term 2':['gene1', 'gene2' ...], etc } then put the dictionary into the gene set argument in the gseapy function

  • @katarinavalentincic9621
    @katarinavalentincic9621 5 месяцев назад

    hi! Where is the formula for rank from? Also great videos! :)

    • @sanbomics
      @sanbomics  2 месяца назад

      You should just rank on STAT which accounts for change and variance

  • @수박바아
    @수박바아 11 месяцев назад +1

    Thank you for the great presentation.
    I have an error in the very last code
    no differences in front of these code
    term_to_graph
    > 'GO_Biological_Process_2021__cellular response to DNA damage stimulus (GO:0006974)'
    gseaplot(pre_res.ranking, term = term_to_graph, **pre_res.results[term_to_graph])
    > ---------------------------------------------------------------------------
    TypeError Traceback (most recent call last)
    Cell In[122], line 1
    ----> 1 gseaplot(pre_res.ranking, term = term_to_graph, **pre_res.results[term_to_graph])
    TypeError: gseaplot() got multiple values for argument 'term'
    I cannot solve this error. if possible, could you check this?