name	tooluniverse-gene-enrichment
description	Perform comprehensive gene enrichment and pathway analysis using gseapy (ORA and GSEA), PANTHER, STRING, Reactome, and 40+ ToolUniverse tools. Supports GO enrichment (BP, MF, CC), KEGG, Reactome, WikiPathways, MSigDB Hallmark, and 220+ Enrichr libraries. Handles multiple ID types (gene symbols, Ensembl, Entrez, UniProt), multiple organisms (human, mouse, rat, fly, worm, yeast), customizable backgrounds, and multiple testing correction (BH, Bonferroni). Use when users ask about gene enrichment, pathway analysis, GO term enrichment, KEGG pathway analysis, GSEA, over-representation analysis, functional annotation, or gene set analysis.

name

tooluniverse-gene-enrichment

description

Perform comprehensive gene enrichment and pathway analysis using gseapy (ORA and GSEA), PANTHER, STRING, Reactome, and 40+ ToolUniverse tools. Supports GO enrichment (BP, MF, CC), KEGG, Reactome, WikiPathways, MSigDB Hallmark, and 220+ Enrichr libraries. Handles multiple ID types (gene symbols, Ensembl, Entrez, UniProt), multiple organisms (human, mouse, rat, fly, worm, yeast), customizable backgrounds, and multiple testing correction (BH, Bonferroni). Use when users ask about gene enrichment, pathway analysis, GO term enrichment, KEGG pathway analysis, GSEA, over-representation analysis, functional annotation, or gene set analysis.

Gene Enrichment and Pathway Analysis Pipeline

Perform comprehensive gene enrichment analysis including Gene Ontology (GO), KEGG, Reactome, WikiPathways, and MSigDB enrichment using both Over-Representation Analysis (ORA) and Gene Set Enrichment Analysis (GSEA). Integrates local computation via gseapy with ToolUniverse pathway databases for cross-validated, publication-ready results.

IMPORTANT: Always use English terms in tool calls (gene names, pathway names, organism names), even if the user writes in another language. Only try original-language terms as a fallback if English returns no results. Respond in the user's language.

When to Use This Skill

Apply when users:

Ask about gene enrichment analysis (GO, KEGG, Reactome, etc.)
Have a gene list from differential expression, clustering, or any experiment
Want to know which biological processes, molecular functions, or cellular components are enriched
Need KEGG or Reactome pathway enrichment analysis
Ask about GSEA (Gene Set Enrichment Analysis) with ranked gene lists
Want over-representation analysis (ORA) with Fisher's exact test
Need multiple testing correction (Benjamini-Hochberg, Bonferroni)
Ask about enrichGO, gseapy, clusterProfiler-style analyses
Want to compare enrichment across multiple gene lists
Need to convert between gene ID types before enrichment
Ask specific BixBench-style questions about enrichment p-values, adjusted p-values, top terms

NOT for (use other skills instead):

Network pharmacology / drug repurposing -> Use tooluniverse-network-pharmacology
Disease characterization -> Use tooluniverse-multiomic-disease-characterization
Single gene function lookup -> Use tooluniverse-disease-research
Spatial omics analysis -> Use tooluniverse-spatial-omics-analysis
Protein-protein interaction analysis only -> Use tooluniverse-protein-interactions

Input Parameters

Parameter	Required	Description	Example
gene_list	Yes	List of gene symbols, Ensembl IDs, or Entrez IDs	`["TP53", "BRCA1", "EGFR"]`
organism	No	Organism (default: human). Supported: human, mouse, rat, fly, worm, yeast, zebrafish	`human`
analysis_type	No	`ORA` (default) or `GSEA`	`ORA`
enrichment_databases	No	Which databases to query. Default: all applicable	`["GO_BP", "GO_MF", "GO_CC", "KEGG", "Reactome"]`
gene_id_type	No	Input ID type: `symbol`, `ensembl`, `entrez`, `uniprot` (auto-detected if omitted)	`symbol`
p_value_cutoff	No	Significance threshold (default: 0.05)	`0.05`
correction_method	No	Multiple testing: `BH` (Benjamini-Hochberg, default), `bonferroni`, `fdr`	`BH`
background_genes	No	Custom background gene set (default: genome-wide)	`["GENE1", "GENE2", ...]`
ranked_gene_list	No	For GSEA: gene-to-score mapping (e.g., log2FC)	`{"TP53": 2.5, "BRCA1": -1.3, ...}`
min_gene_set_size	No	Minimum genes in a gene set (default: 5)	`5`
max_gene_set_size	No	Maximum genes in a gene set (default: 500)	`500`
enrichr_libraries	No	Specific Enrichr/gseapy library names	`["GO_Biological_Process_2021"]`

Supported Enrichment Databases

Gene Ontology (GO)

Category	gseapy Library	PANTHER Dataset	STRING Category
Biological Process (BP)	`GO_Biological_Process_2021/2023/2025`	`GO:0008150`	`Process`
Molecular Function (MF)	`GO_Molecular_Function_2021/2023/2025`	`GO:0003674`	`Function`
Cellular Component (CC)	`GO_Cellular_Component_2021/2023/2025`	`GO:0005575`	`Component`

Pathway Databases

Database	gseapy Library	ToolUniverse Tool	Notes
KEGG	`KEGG_2021_Human`, `KEGG_2026`	`STRING_functional_enrichment(category='KEGG')`	KEGG pathways
Reactome	`Reactome_2022`, `Reactome_Pathways_2024`	`ReactomeAnalysis_pathway_enrichment`	Curated pathways
WikiPathways	`WikiPathways_2024_Human`	`WikiPathways_search`	Community pathways
MSigDB Hallmark	`MSigDB_Hallmark_2020`	-	Cancer hallmarks
BioPlanet	`BioPlanet_2019`	-	NCI pathways
BioCarta	`BioCarta_2016`	-	Signal transduction
PANTHER	-	`PANTHER_enrichment`	Protein classification

Other Libraries (via gseapy)

Transcription Factors: ChEA_2022, ENCODE_TF_ChIP-seq_2015
Diseases: DisGeNET, OMIM_Disease, ClinVar_2025
Drugs: DGIdb_Drug_Targets_2024, DrugMatrix
Cell Types: CellMarker_2024, Azimuth_2023
Tissues: GTEx_Tissues_V8_2023, ARCHS4_Tissues

Total: 225+ Enrichr libraries available via gseapy.get_library_name().

Organism Support

Organism	Taxonomy ID	gseapy	PANTHER	STRING	Reactome
Human	9606	Yes	Yes	Yes	Yes
Mouse	10090	Yes (`*_Mouse`)	Yes	Yes	Yes (projection)
Rat	10116	Limited	Yes	Yes	Yes (projection)
Fly (Drosophila)	7227	Limited	Yes	Yes	Yes (projection)
Worm (C. elegans)	6239	Limited	Yes	Yes	Yes (projection)
Yeast (S. cerevisiae)	4932	Limited	Yes	Yes	Yes
Zebrafish	7955	Limited	Yes	Yes	Yes (projection)

KEY PRINCIPLES

Report-first approach - Create report file FIRST, then populate progressively
ID disambiguation FIRST - Detect and convert gene IDs before ANY enrichment
Multi-source validation - Run enrichment on at least 2 independent tools, cross-validate
Exact p-values - Report raw p-values AND adjusted p-values with correction method
Multiple testing correction - ALWAYS apply Benjamini-Hochberg unless user specifies otherwise
Gene set size filtering - Filter by min/max gene set size to avoid trivial/overly broad terms
Redundancy reduction - Group similar GO terms, report parent terms where appropriate
Background specification - Warn if no background provided; genome-wide is default
Evidence grading - Grade enrichment sources T1-T4
Negative results documented - "No significant enrichment" is a valid finding
Source references - Every enrichment result must cite the tool/database/library used
Completeness checklist - Mandatory section at end showing analysis coverage

Evidence Grading System

Tier	Symbol	Criteria	Examples
T1	[T1]	Curated/experimental enrichment	PANTHER enrichment (curated GO), Reactome (curated pathways)
T2	[T2]	Computational enrichment, well-validated	gseapy ORA/GSEA, STRING functional enrichment
T3	[T3]	Text-mining/predicted enrichment	Enrichr non-curated libraries, predicted functions
T4	[T4]	Single-source annotation	Individual gene GO annotations from QuickGO

Complete Workflow

Phase 0: Input Validation and ID Conversion

Step 0.1: Create the report file immediately.

report_path = "[analysis_name]_enrichment_report.md"
# Write header and placeholder sections

Step 0.2: Detect gene ID type and convert if needed.

from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()

# === ID TYPE DETECTION ===
# Ensembl IDs start with "ENSG" (human), "ENSMUSG" (mouse), etc.
# Entrez IDs are numeric strings ("7157", "672")
# UniProt IDs match pattern like "P04637", "Q9Y243"
# Gene symbols are alphabetic (TP53, BRCA1)

gene_list = ["TP53", "BRCA1", "EGFR", "MYC", "AKT1", "PTEN"]

# Detect ID type
sample = gene_list[0]
if sample.startswith("ENSG") or sample.startswith("ENSMUS"):
    id_type = "ensembl"
elif sample.replace("-", "").isdigit():
    id_type = "entrez"
elif len(sample) == 6 and sample[0].isalpha() and sample[1:].replace("_","").isalnum():
    id_type = "uniprot"  # rough heuristic
else:
    id_type = "symbol"

# === ID CONVERSION (if needed) ===
# Convert Ensembl -> Symbol using MyGene
if id_type == "ensembl":
    batch_result = tu.tools.MyGene_batch_query(
        gene_ids=gene_list,
        fields="symbol,entrezgene,ensembl.gene"
    )
    # Returns: {results: [{query, _id, symbol, entrezgene, ensembl: {gene}}]}
    symbol_map = {}
    for hit in batch_result.get('results', batch_result.get('data', {}).get('results', [])):
        symbol_map[hit['query']] = hit.get('symbol', hit['query'])
    gene_symbols = [symbol_map.get(g, g) for g in gene_list]
elif id_type == "entrez":
    batch_result = tu.tools.MyGene_batch_query(
        gene_ids=gene_list,
        fields="symbol,entrezgene,ensembl.gene"
    )
    symbol_map = {}
    for hit in batch_result.get('results', batch_result.get('data', {}).get('results', [])):
        symbol_map[hit['query']] = hit.get('symbol', hit['query'])
    gene_symbols = [symbol_map.get(g, g) for g in gene_list]
else:
    gene_symbols = gene_list  # Already symbols

# === VALIDATE GENE SYMBOLS ===
# Verify genes map correctly with STRING
string_mapped = tu.tools.STRING_map_identifiers(
    protein_ids=gene_symbols[:20],  # sample for validation
    species=9606
)
# Returns: {status, data: [{queryItem, preferredName, stringId, ...}]}
# Use preferredName for canonical symbol

Step 0.3: Determine organism and taxonomy ID.

# Map organism name to taxonomy ID
ORGANISM_MAP = {
    'human': 9606, 'homo_sapiens': 9606,
    'mouse': 10090, 'mus_musculus': 10090,
    'rat': 10116, 'rattus_norvegicus': 10116,
    'fly': 7227, 'drosophila_melanogaster': 7227,
    'worm': 6239, 'caenorhabditis_elegans': 6239,
    'yeast': 4932, 'saccharomyces_cerevisiae': 4932,
    'zebrafish': 7955, 'danio_rerio': 7955,
}
taxon_id = ORGANISM_MAP.get(organism.lower(), 9606)

Phase 1: Over-Representation Analysis (ORA) with gseapy

This is the PRIMARY enrichment method. Uses Fisher's exact test with Benjamini-Hochberg correction.

Step 1.1: GO Biological Process enrichment.

import gseapy
import pandas as pd

# GO Biological Process ORA
go_bp_result = gseapy.enrichr(
    gene_list=gene_symbols,
    gene_sets='GO_Biological_Process_2021',  # or 2023, 2025
    organism='human',  # 'human', 'mouse', 'fly', 'yeast', 'worm', 'fish', 'rat'
    outdir=None,        # None = no file output
    no_plot=True,       # suppress plots for programmatic use
    background=None,    # None = Enrichr default background
)

# Result is Enrichr object with .results DataFrame
# Columns: Gene_set, Term, Overlap, P-value, Adjusted P-value,
#          Old P-value, Old Adjusted P-value, Odds Ratio, Combined Score, Genes
go_bp_df = go_bp_result.results

# Filter by significance
go_bp_sig = go_bp_df[go_bp_df['Adjusted P-value'] < 0.05].copy()

# Extract key info
for _, row in go_bp_sig.head(10).iterrows():
    term = row['Term']
    pval = row['P-value']
    adj_pval = row['Adjusted P-value']
    overlap = row['Overlap']
    genes = row['Genes']
    odds_ratio = row['Odds Ratio']
    combined_score = row['Combined Score']
    print(f"{term}: adj_p={adj_pval:.2e}, overlap={overlap}, genes={genes}")

Step 1.2: GO Molecular Function and Cellular Component.

# GO Molecular Function
go_mf_result = gseapy.enrichr(
    gene_list=gene_symbols,
    gene_sets='GO_Molecular_Function_2021',
    organism='human',
    outdir=None,
    no_plot=True,
)
go_mf_df = go_mf_result.results

# GO Cellular Component
go_cc_result = gseapy.enrichr(
    gene_list=gene_symbols,
    gene_sets='GO_Cellular_Component_2021',
    organism='human',
    outdir=None,
    no_plot=True,
)
go_cc_df = go_cc_result.results

Step 1.3: KEGG pathway enrichment.

# KEGG via gseapy
kegg_result = gseapy.enrichr(
    gene_list=gene_symbols,
    gene_sets='KEGG_2021_Human',  # or KEGG_2026
    organism='human',
    outdir=None,
    no_plot=True,
)
kegg_df = kegg_result.results
kegg_sig = kegg_df[kegg_df['Adjusted P-value'] < 0.05]

Step 1.4: Reactome pathway enrichment.

# Reactome via gseapy
reactome_gseapy = gseapy.enrichr(
    gene_list=gene_symbols,
    gene_sets='Reactome_Pathways_2024',  # or Reactome_2022
    organism='human',
    outdir=None,
    no_plot=True,
)
reactome_df = reactome_gseapy.results

Step 1.5: MSigDB Hallmark and other libraries.

# MSigDB Hallmark
hallmark_result = gseapy.enrichr(
    gene_list=gene_symbols,
    gene_sets='MSigDB_Hallmark_2020',
    organism='human',
    outdir=None,
    no_plot=True,
)
hallmark_df = hallmark_result.results

# WikiPathways
wp_result = gseapy.enrichr(
    gene_list=gene_symbols,
    gene_sets='WikiPathways_2024_Human',
    organism='human',
    outdir=None,
    no_plot=True,
)
wp_df = wp_result.results

# Multiple libraries at once
multi_result = gseapy.enrichr(
    gene_list=gene_symbols,
    gene_sets=['GO_Biological_Process_2021', 'KEGG_2021_Human', 'Reactome_2022'],
    organism='human',
    outdir=None,
    no_plot=True,
)
# Results combined in single DataFrame with Gene_set column distinguishing libraries

Step 1.6: Custom background gene set.

# With custom background (e.g., all expressed genes)
background_genes = ["GENE1", "GENE2", ...]  # user-provided

# gseapy enrichr with background
result_with_bg = gseapy.enrichr(
    gene_list=gene_symbols,
    gene_sets='GO_Biological_Process_2021',
    organism='human',
    outdir=None,
    no_plot=True,
    background=background_genes,  # custom background
)
# Note: background changes p-value calculation significantly

Phase 2: Gene Set Enrichment Analysis (GSEA)

Use when you have a RANKED gene list (e.g., by log2 fold-change, p-value, or signal-to-noise ratio).

Step 2.1: Prepare ranked gene list.

import pandas as pd
import numpy as np

# Input: dict of gene -> score (e.g., log2FC from DESeq2)
ranked_dict = {"TP53": 3.2, "BRCA1": 2.8, "EGFR": -1.5, "MYC": 4.1, ...}

# Convert to Series, sorted by score (descending)
ranked_series = pd.Series(ranked_dict).sort_values(ascending=False)
# OR from a DataFrame:
# ranked_series = df.set_index('gene_symbol')['log2FoldChange'].sort_values(ascending=False)

Step 2.2: Run GSEA preranked.

# GSEA with GO Biological Process
gsea_bp = gseapy.prerank(
    rnk=ranked_series,
    gene_sets='GO_Biological_Process_2021',
    outdir=None,
    no_plot=True,
    seed=42,
    min_size=5,         # minimum gene set size
    max_size=500,       # maximum gene set size
    permutation_num=1000,  # number of permutations (1000 is standard)
)

# Result columns: Name, Term, ES, NES, NOM p-val, FDR q-val, FWER p-val, Tag %, Gene %, Lead_genes
gsea_bp_df = gsea_bp.res2d

# Filter significant
gsea_sig = gsea_bp_df[gsea_bp_df['FDR q-val'].astype(float) < 0.25]  # GSEA uses 0.25 FDR cutoff

# Key metrics:
# NES (Normalized Enrichment Score): positive = enriched in top of list, negative = enriched in bottom
# NOM p-val: nominal p-value (unadjusted)
# FDR q-val: false discovery rate (adjusted)
# FWER p-val: family-wise error rate (Bonferroni-like)
# Lead_genes: core genes driving enrichment

Step 2.3: GSEA with KEGG and Reactome.

# KEGG GSEA
gsea_kegg = gseapy.prerank(
    rnk=ranked_series,
    gene_sets='KEGG_2021_Human',
    outdir=None,
    no_plot=True,
    seed=42,
    min_size=5,
    max_size=500,
    permutation_num=1000,
)

# Reactome GSEA
gsea_reactome = gseapy.prerank(
    rnk=ranked_series,
    gene_sets='Reactome_Pathways_2024',
    outdir=None,
    no_plot=True,
    seed=42,
    min_size=5,
    max_size=500,
    permutation_num=1000,
)

Step 2.4: Multiple gene set libraries.

# Run GSEA across multiple libraries
gsea_multi = gseapy.prerank(
    rnk=ranked_series,
    gene_sets=['GO_Biological_Process_2021', 'KEGG_2021_Human', 'MSigDB_Hallmark_2020'],
    outdir=None,
    no_plot=True,
    seed=42,
    min_size=5,
    max_size=500,
    permutation_num=1000,
)

Phase 3: Cross-Validation with ToolUniverse Enrichment Tools

Run enrichment on at least one additional independent source and cross-validate results.

Step 3.1: PANTHER enrichment (curated GO enrichment).

# PANTHER ORA - GO Biological Process
panther_bp = tu.tools.PANTHER_enrichment(
    gene_list=','.join(gene_symbols),  # comma-separated string
    organism=9606,                      # NCBI taxonomy ID
    annotation_dataset='GO:0008150'     # biological_process
)
# Returns: {data: {gene_list, organism, annotation_dataset, result_count, enriched_terms: [
#   {term_id, term_label, number_in_list, number_in_reference, expected, fold_enrichment, pvalue, fdr, direction}
# ]}}
panther_terms = panther_bp.get('data', {}).get('enriched_terms', [])

# PANTHER - GO Molecular Function
panther_mf = tu.tools.PANTHER_enrichment(
    gene_list=','.join(gene_symbols),
    organism=9606,
    annotation_dataset='GO:0003674'  # molecular_function
)

# PANTHER - GO Cellular Component
panther_cc = tu.tools.PANTHER_enrichment(
    gene_list=','.join(gene_symbols),
    organism=9606,
    annotation_dataset='GO:0005575'  # cellular_component
)

# PANTHER - PANTHER Pathways
panther_pw = tu.tools.PANTHER_enrichment(
    gene_list=','.join(gene_symbols),
    organism=9606,
    annotation_dataset='ANNOT_TYPE_ID_PANTHER_PATHWAY'
)

# PANTHER - Reactome Pathways
panther_reactome = tu.tools.PANTHER_enrichment(
    gene_list=','.join(gene_symbols),
    organism=9606,
    annotation_dataset='ANNOT_TYPE_ID_PANTHER_REACTOME_PATHWAY'
)

Step 3.2: STRING functional enrichment.

# STRING enrichment - returns ALL categories at once
# Categories: Process, Function, Component, KEGG, Reactome, COMPARTMENTS, DISEASES, etc.
string_enrichment = tu.tools.STRING_functional_enrichment(
    protein_ids=gene_symbols,
    species=9606
)
# Returns: {status: "success", data: [
#   {category, term, number_of_genes, number_of_genes_in_background, inputGenes, preferredNames, p_value, fdr, description}
# ]}

# NOTE: STRING returns ALL categories regardless of 'category' parameter
# Filter by category:
string_data = string_enrichment.get('data', [])
if isinstance(string_data, list):
    string_go_bp = [d for d in string_data if d.get('category') == 'Process']
    string_go_mf = [d for d in string_data if d.get('category') == 'Function']
    string_go_cc = [d for d in string_data if d.get('category') == 'Component']
    string_kegg = [d for d in string_data if d.get('category') == 'KEGG']
    string_reactome = [d for d in string_data if d.get('category') == 'Reactome']
    string_wp = [d for d in string_data if d.get('category') == 'WikiPathways']

Step 3.3: Reactome Analysis Service enrichment.

# Reactome pathway overrepresentation analysis
reactome_result = tu.tools.ReactomeAnalysis_pathway_enrichment(
    identifiers=' '.join(gene_symbols),  # space-separated, NOT array
    page_size=50,                         # max pathways to return
    include_disease=True,                 # include disease pathways
    projection=True                       # project to human (for cross-species)
)
# Returns: {data: {token, analysis_type, identifiers_not_found, pathways_found, pathways: [
#   {pathway_id, name, species, is_disease, is_lowest_level, entities_found, entities_total,
#    entities_ratio, p_value, fdr, reactions_found, reactions_total}
# ]}}

reactome_pathways = reactome_result.get('data', {}).get('pathways', [])
# Filter significant
reactome_sig = [p for p in reactome_pathways if p.get('fdr', 1) < 0.05]

Step 3.4: Cross-validate results.

# Compare top terms across tools
# 1. Collect top GO BP terms from each source
gseapy_top_bp = set(go_bp_sig.head(20)['Term'].str.extract(r'\(GO:\d+\)')[0].dropna())
panther_top_bp = set(t['term_id'] for t in panther_terms[:20] if t.get('fdr', 1) < 0.05)
string_top_bp = set(d['term'] for d in string_go_bp[:20] if d.get('fdr', 1) < 0.05)

# Find consensus terms (present in 2+ sources)
all_go_terms = gseapy_top_bp | panther_top_bp | string_top_bp
consensus_terms = []
for term in all_go_terms:
    sources = []
    if term in gseapy_top_bp: sources.append('gseapy')
    if term in panther_top_bp: sources.append('PANTHER')
    if term in string_top_bp: sources.append('STRING')
    if len(sources) >= 2:
        consensus_terms.append((term, sources))

# 2. Compare KEGG results
# gseapy KEGG vs STRING KEGG vs Reactome
# Cross-reference pathway IDs and names

Phase 4: Additional Multiple Testing Correction

Apply user-specified or additional correction methods.

Step 4.1: Apply Benjamini-Hochberg (default).

from scipy import stats
import statsmodels.stats.multitest as mt
import numpy as np

# gseapy already applies BH correction (Adjusted P-value column)
# For custom correction on raw p-values:

raw_pvals = go_bp_df['P-value'].values

# Benjamini-Hochberg (FDR)
reject_bh, pvals_bh, _, _ = mt.multipletests(raw_pvals, alpha=0.05, method='fdr_bh')

# Bonferroni
reject_bonf, pvals_bonf, _, _ = mt.multipletests(raw_pvals, alpha=0.05, method='bonferroni')

# Benjamini-Yekutieli (more conservative FDR)
reject_by, pvals_by, _, _ = mt.multipletests(raw_pvals, alpha=0.05, method='fdr_by')

# Holm-Sidak
reject_hs, pvals_hs, _, _ = mt.multipletests(raw_pvals, alpha=0.05, method='holm-sidak')

# Add to DataFrame
go_bp_df['BH_adjusted'] = pvals_bh
go_bp_df['Bonferroni_adjusted'] = pvals_bonf

Step 4.2: Compare correction methods.

# Count significant terms by method
n_bh = sum(reject_bh)
n_bonf = sum(reject_bonf)
n_by = sum(reject_by)

print(f"Significant terms:")
print(f"  BH (FDR): {n_bh}")
print(f"  Bonferroni: {n_bonf}")
print(f"  BY: {n_by}")

Phase 5: Pathway Context from ToolUniverse

Enrich top findings with additional biological context from ToolUniverse pathway databases.

Step 5.1: Get pathway details from Reactome.

# For top enriched Reactome pathways, get details
for pathway in reactome_sig[:5]:
    pathway_id = pathway['pathway_id']  # e.g., "R-HSA-212436"

    # Get full pathway info
    pathway_detail = tu.tools.Reactome_get_pathway(pathway_id=pathway_id)

    # Get pathway hierarchy
    pathway_hier = tu.tools.Reactome_get_pathway_hierarchy(pathway_id=pathway_id)

    # Get pathway reactions
    pathway_rxns = tu.tools.Reactome_get_pathway_reactions(pathway_id=pathway_id)

Step 5.2: Get KEGG pathway details.

# For top KEGG pathways
for _, row in kegg_sig.head(5).iterrows():
    term = row['Term']  # e.g., "Cell cycle"
    # Search KEGG for pathway details
    kegg_info = tu.tools.kegg_search_pathway(query=term)
    # Get specific pathway info if ID known
    # kegg_detail = tu.tools.kegg_get_pathway_info(pathway_id="hsa04110")

Step 5.3: WikiPathways context.

# For top WikiPathways results
for _, row in wp_df[wp_df['Adjusted P-value'] < 0.05].head(5).iterrows():
    term = row['Term']
    # Search WikiPathways
    wp_detail = tu.tools.WikiPathways_search(query=term, organism='Homo sapiens')
    # Get specific pathway
    # wp_full = tu.tools.WikiPathways_get_pathway(pathway_id="WP179")

Step 5.4: GO term details for top enriched terms.

# Get details for top enriched GO terms
for _, row in go_bp_sig.head(5).iterrows():
    term_str = row['Term']
    # Extract GO ID from term string: "regulation of cell cycle (GO:0051726)"
    import re
    go_match = re.search(r'(GO:\d+)', term_str)
    if go_match:
        go_id = go_match.group(1)
        # Get GO term details
        go_detail = tu.tools.GO_get_term_by_id(go_id=go_id)
        # Get GO term hierarchy
        go_children = tu.tools.QuickGO_get_term_children(go_id=go_id)

Step 5.5: Gene-level GO annotations for key genes.

# For genes driving top enrichment, get individual annotations
for gene in gene_symbols[:10]:
    # Get GO annotations from GO API
    go_annot = tu.tools.GO_get_annotations_for_gene(gene_id=gene)

    # For detailed annotations with evidence codes, use QuickGO with UniProt ID
    # First get UniProt ID
    mygene_result = tu.tools.MyGene_query_genes(query=gene)
    if mygene_result.get('hits'):
        hit = mygene_result['hits'][0]
        uniprot_id = hit.get('uniprot', {}).get('Swiss-Prot', '')
        if uniprot_id:
            quickgo = tu.tools.QuickGO_annotations_by_gene(
                gene_product_id=f"UniProtKB:{uniprot_id}",
                limit=50
            )

Phase 6: Network Context and PPI Enrichment

Connect enrichment results to protein interaction networks.

Step 6.1: STRING PPI enrichment test.

# Test if gene list has more interactions than expected
ppi_enrichment = tu.tools.STRING_ppi_enrichment(
    protein_ids=gene_symbols,
    species=9606
)
# Returns: PPI enrichment p-value (significant = more connected than random)

Step 6.2: Build interaction network for enriched gene sets.

# Get interactions for genes in top enriched pathway
top_pathway_genes = go_bp_sig.iloc[0]['Genes'].split(';')

string_network = tu.tools.STRING_get_interaction_partners(
    protein_ids=top_pathway_genes,
    species=9606,
    limit=50
)
# Returns: {status: "success", data: [{stringId_A, stringId_B, preferredName_A, preferredName_B, score}]}

Phase 7: Comparative Enrichment (Multiple Gene Lists)

When users have multiple gene lists (e.g., up-regulated vs down-regulated, cluster 1 vs cluster 2).

Step 7.1: Run enrichment on each list.

# Example: up-regulated and down-regulated genes
up_genes = ["TP53", "BRCA1", "MYC", ...]
down_genes = ["EGFR", "AKT1", "PTEN", ...]

up_result = gseapy.enrichr(
    gene_list=up_genes,
    gene_sets='GO_Biological_Process_2021',
    organism='human', outdir=None, no_plot=True,
)

down_result = gseapy.enrichr(
    gene_list=down_genes,
    gene_sets='GO_Biological_Process_2021',
    organism='human', outdir=None, no_plot=True,
)

Step 7.2: Compare enrichment results.

# Find unique and shared enriched terms
up_terms = set(up_result.results[up_result.results['Adjusted P-value'] < 0.05]['Term'])
down_terms = set(down_result.results[down_result.results['Adjusted P-value'] < 0.05]['Term'])

shared = up_terms & down_terms
up_unique = up_terms - down_terms
down_unique = down_terms - up_terms

print(f"Up-regulated specific: {len(up_unique)} terms")
print(f"Down-regulated specific: {len(down_unique)} terms")
print(f"Shared: {len(shared)} terms")

Phase 8: Report Generation

Step 8.1: Compile comprehensive enrichment report.

# Gene Enrichment Analysis Report

## Executive Summary
[2-3 sentence summary: number of input genes, organism, top findings across databases]

## Input Summary
- **Gene list**: N genes
- **Organism**: [organism]
- **Gene ID type**: [symbol/ensembl/entrez]
- **Analysis type**: [ORA/GSEA]
- **Background**: [genome-wide/custom (N genes)]
- **Significance threshold**: p < [cutoff]
- **Correction method**: [BH/Bonferroni]
- **Genes not mapped**: [list or count]

## 1. GO Biological Process Enrichment

### Top Enriched Terms (gseapy ORA)
| Rank | GO Term | Description | P-value | Adj. P-value | Overlap | Genes |
|------|---------|-------------|---------|-------------|---------|-------|
| 1 | GO:XXXXXXX | [description] | X.XXe-XX | X.XXe-XX | X/Y | [genes] |

### Cross-Validation
| GO Term | gseapy FDR | PANTHER FDR | STRING FDR | Consensus |
|---------|-----------|-------------|-----------|-----------|
| [term] | [fdr] | [fdr] | [fdr] | [2/3 or 3/3] |

## 2. GO Molecular Function Enrichment
[Same format as BP]

## 3. GO Cellular Component Enrichment
[Same format as BP]

## 4. KEGG Pathway Enrichment
| Rank | Pathway | P-value | Adj. P-value | Overlap | Genes |
|------|---------|---------|-------------|---------|-------|
| 1 | [pathway] | X.XXe-XX | X.XXe-XX | X/Y | [genes] |

## 5. Reactome Pathway Enrichment
| Rank | Pathway ID | Name | P-value | FDR | Entities Found | Entities Total |
|------|-----------|------|---------|-----|---------------|----------------|
| 1 | R-HSA-XXXXX | [name] | X.XXe-XX | X.XXe-XX | X | Y |

## 6. Additional Enrichment Libraries
### MSigDB Hallmark
[If applicable]

### WikiPathways
[If applicable]

## 7. GSEA Results (if applicable)
| Term | NES | NOM p-val | FDR q-val | Lead Genes |
|------|-----|-----------|-----------|------------|

## 8. Multiple Testing Correction Comparison
| Method | Significant Terms (BP) | Significant Terms (KEGG) |
|--------|----------------------|------------------------|
| BH (FDR) | N | N |
| Bonferroni | N | N |
| BY | N | N |

## 9. Network Context
- **PPI enrichment p-value**: [value]
- **Network interpretation**: [connected/random]

## 10. Consensus Findings
[List of findings confirmed by 2+ independent sources]

## Evidence Summary Table
| Finding | Source | Evidence Grade | P-value | FDR |
|---------|--------|---------------|---------|-----|
| [finding] | gseapy/PANTHER/STRING/Reactome | [T1-T4] | [pval] | [fdr] |

## Completeness Checklist
| Phase | Status | Tools Used | Key Findings |
|-------|--------|------------|--------------|
| ID Conversion | Done/Partial/Failed | MyGene, STRING_map | [summary] |
| GO BP | Done/Partial/Failed | gseapy, PANTHER, STRING | [N significant terms] |
| GO MF | Done/Partial/Failed | gseapy, PANTHER, STRING | [N significant terms] |
| GO CC | Done/Partial/Failed | gseapy, PANTHER, STRING | [N significant terms] |
| KEGG | Done/Partial/Failed | gseapy, STRING | [N significant terms] |
| Reactome | Done/Partial/Failed | gseapy, Reactome API | [N significant terms] |
| WikiPathways | Done/Partial/Failed | gseapy | [N significant terms] |
| MSigDB Hallmark | Done/Partial/Failed | gseapy | [N significant terms] |
| GSEA | Done/Skipped/Failed | gseapy prerank | [N significant terms] |
| Cross-validation | Done/Partial/Failed | comparison | [N consensus terms] |
| Network Context | Done/Partial/Failed | STRING PPI | [summary] |
| Report | Done | - | Complete |

Tool Parameter Reference (Verified)

Primary Enrichment Tools

Tool	Key Parameters	Response Structure
`gseapy.enrichr()`	`gene_list: list[str]`, `gene_sets: str/list`, `organism: str`, `outdir: None`, `no_plot: True`, `background: list/None`	`.results` DataFrame: `Gene_set, Term, Overlap, P-value, Adjusted P-value, Odds Ratio, Combined Score, Genes`
`gseapy.prerank()`	`rnk: pd.Series`, `gene_sets: str/list`, `outdir: None`, `no_plot: True`, `seed: int`, `min_size: int`, `max_size: int`, `permutation_num: int`	`.res2d` DataFrame: `Name, Term, ES, NES, NOM p-val, FDR q-val, FWER p-val, Tag %, Gene %, Lead_genes`
`PANTHER_enrichment`	`gene_list: str` (comma-separated), `organism: int` (9606), `annotation_dataset: str`	`{data: {enriched_terms: [{term_id, term_label, fold_enrichment, pvalue, fdr, direction}]}}`
`STRING_functional_enrichment`	`protein_ids: list[str]`, `species: int` (9606), `category: str`	`{status, data: [{category, term, p_value, fdr, description, inputGenes}]}`
`ReactomeAnalysis_pathway_enrichment`	`identifiers: str` (space-separated), `page_size: int`, `include_disease: bool`, `projection: bool`	`{data: {pathways: [{pathway_id, name, p_value, fdr, entities_found, entities_total}]}}`

ID Conversion Tools

Tool	Key Parameters	Response Structure
`MyGene_batch_query`	`gene_ids: list[str]`, `fields: str`	`{results: [{query, _id, symbol, entrezgene, ensembl: {gene}}]}`
`MyGene_query_genes`	`query: str`	`{hits: [{_id, symbol, entrezgene, ensembl: {gene}}]}`
`STRING_map_identifiers`	`protein_ids: list[str]`, `species: int`	`[{queryItem, preferredName, stringId}]`
`UniProtIDMap_convert_ids`	`from_db: str`, `to_db: str`, `ids: str` (comma-separated)	ID mapping results

GO Term Tools

Tool	Key Parameters	Response Structure
`GO_get_term_by_id`	`go_id: str`	GO term details
`GO_get_term_details`	`go_id: str`	Detailed GO term info
`GO_search_terms`	`query: str`	Search results
`GO_get_annotations_for_gene`	`gene_id: str`	List of GO annotations
`GO_get_genes_for_term`	`go_id: str`	List of genes for GO term
`QuickGO_annotations_by_gene`	`gene_product_id: str` (format: "UniProtKB:P04637")	Detailed annotations with evidence
`QuickGO_get_term_detail`	`go_id: str`	GO term detail
`QuickGO_get_term_children`	`go_id: str`	Child terms

Pathway Context Tools

Tool	Key Parameters	Response Structure
`Reactome_get_pathway`	`pathway_id: str`	Pathway details
`Reactome_get_pathway_hierarchy`	`pathway_id: str`	Hierarchy
`Reactome_get_pathway_reactions`	`pathway_id: str`	Reactions list
`Reactome_map_uniprot_to_pathways`	`id: str` (UniProt)	Plain list of pathways
`WikiPathways_search`	`query: str`, `organism: str`	`{status, data: {result: [{id, name, species}]}}`
`WikiPathways_get_pathway`	`pathway_id: str`	Pathway detail
`kegg_search_pathway`	`query: str`	KEGG pathway matches
`kegg_get_pathway_info`	`pathway_id: str`	Pathway detail
`kegg_get_gene_info`	`gene_id: str`	Gene in KEGG

Network Context Tools

Tool	Key Parameters	Response Structure
`STRING_ppi_enrichment`	`protein_ids: list[str]`, `species: int`	PPI enrichment statistics
`STRING_get_interaction_partners`	`protein_ids: list[str]`, `species: int`, `limit: int`	`{status, data: [{preferredName_A, preferredName_B, score}]}`
`STRING_get_network`	`protein_ids: list[str]`, `species: int`	Network data

Response Format Notes

gseapy enrichr results: DataFrame with columns Gene_set, Term, Overlap, P-value, Adjusted P-value, Old P-value, Old Adjusted P-value, Odds Ratio, Combined Score, Genes. Access via .results.

gseapy prerank results: DataFrame with columns Name, Term, ES, NES, NOM p-val, FDR q-val, FWER p-val, Tag %, Gene %, Lead_genes. Access via .res2d.

PANTHER enrichment: Returns {data: {enriched_terms: [...]}}. Each term has term_id, term_label, pvalue, fdr, fold_enrichment, direction.

STRING functional enrichment: Returns ALL categories in one call regardless of category parameter. Filter by d['category'] after receiving results. Categories: Process, Function, Component, KEGG, Reactome, WikiPathways, COMPARTMENTS, DISEASES, Keyword, PMID, HPO, TISSUES, NetworkNeighborAL, RCTM.

ReactomeAnalysis_pathway_enrichment: Takes space-separated identifiers string, NOT array. Returns {data: {pathways: [...]}}.

enrichr_gene_enrichment_analysis (ToolUniverse): Returns connected_paths JSON string (NOT standard enrichment). DO NOT USE for standard enrichment analysis -- use gseapy.enrichr() directly instead.

MyGene_batch_query: Parameter is gene_ids (NOT ids). Returns {results: [...]} or {data: {results: [...]}}.

Fallback Strategies

Phase	Primary Tool	Fallback 1	Fallback 2
GO BP enrichment	gseapy enrichr	PANTHER enrichment	STRING functional enrichment
GO MF enrichment	gseapy enrichr	PANTHER enrichment	STRING functional enrichment
GO CC enrichment	gseapy enrichr	PANTHER enrichment	STRING functional enrichment
KEGG enrichment	gseapy enrichr	STRING functional enrichment	kegg_search_pathway (manual)
Reactome enrichment	gseapy enrichr	ReactomeAnalysis_pathway_enrichment	Reactome_map_uniprot_to_pathways
WikiPathways	gseapy enrichr	WikiPathways_search	-
GSEA	gseapy prerank	-	ORA as approximation
ID conversion	MyGene_batch_query	STRING_map_identifiers	UniProtIDMap_convert_ids
GO term details	GO_get_term_by_id	QuickGO_get_term_detail	GO_search_terms
PPI context	STRING_ppi_enrichment	STRING_get_interaction_partners	-
Pathway details	Reactome_get_pathway	kegg_get_pathway_info	WikiPathways_get_pathway

Common Use Patterns

Pattern 1: Standard ORA from DEG List

Input: List of differentially expressed gene symbols
Organism: human

Flow:
1. Validate gene symbols (STRING_map_identifiers)
2. Run gseapy.enrichr with GO_BP, GO_MF, GO_CC, KEGG, Reactome
3. Cross-validate with PANTHER and STRING
4. Apply BH correction
5. Report top enriched terms per category

Pattern 2: GSEA from Ranked List

Input: Gene-to-log2FC mapping from DESeq2/edgeR
Organism: human

Flow:
1. Convert to ranked Series (sort by log2FC descending)
2. Run gseapy.prerank with GO_BP, KEGG, MSigDB_Hallmark
3. Filter by FDR q-val < 0.25
4. Report NES, lead genes for top terms
5. Compare positive vs negative NES (up- vs down-regulated pathways)

Pattern 3: BixBench Enrichment Question

Input: Specific question about enrichment (e.g., "What is the adjusted p-val for neutrophil activation?")
Context: Gene list and specific library/method

Flow:
1. Parse question for: gene list, library name, GO term of interest
2. Run gseapy.enrichr with exact library specified
3. Find the specific term in results
4. Report exact p-value and adjusted p-value
5. If enrichGO mentioned, use gseapy ORA (closest equivalent)

Pattern 4: Multi-Organism Enrichment

Input: Gene list from mouse experiment
Organism: mouse

Flow:
1. Use gseapy with organism='mouse' and mouse-specific libraries
2. Use PANTHER with organism=10090
3. Use STRING with species=10090
4. Use Reactome with projection=True for human pathway mapping

Pattern 5: Custom Background Enrichment

Input: DEGs from RNA-seq (detected genes as background)
Background: All expressed genes (e.g., 12000 genes)

Flow:
1. Run gseapy.enrichr with background=expressed_genes
2. Compare with genome-wide background
3. Report both results; custom background is more conservative
4. Note: PANTHER also supports custom background via reference list

Available gseapy Enrichr Libraries (225 total)

GO Libraries

GO_Biological_Process_2021, GO_Biological_Process_2023, GO_Biological_Process_2025
GO_Molecular_Function_2021, GO_Molecular_Function_2023, GO_Molecular_Function_2025
GO_Cellular_Component_2021, GO_Cellular_Component_2023, GO_Cellular_Component_2025

Pathway Libraries

KEGG_2021_Human, KEGG_2019_Mouse, KEGG_2026
Reactome_2022, Reactome_Pathways_2024
WikiPathways_2024_Human, WikiPathways_2024_Mouse
BioCarta_2016, BioPlanet_2019
MSigDB_Hallmark_2020, MSigDB_Computational, MSigDB_Oncogenic_Signatures
Elsevier_Pathway_Collection

Disease/Phenotype Libraries

DisGeNET, OMIM_Disease, OMIM_Expanded
ClinVar_2025
Rare_Diseases_GeneRIF_Gene_Lists

Drug/Target Libraries

DGIdb_Drug_Targets_2024
DrugMatrix
Drug_Perturbations_from_GEO_down, Drug_Perturbations_from_GEO_up

Cell/Tissue Libraries

CellMarker_2024, Azimuth_2023
GTEx_Tissues_V8_2023
ARCHS4_Tissues
Descartes_Cell_Types_and_Tissue_2021
Allen_Brain_Atlas_10x_scRNA_2021

Transcription Factor Libraries

ChEA_2022
ENCODE_TF_ChIP-seq_2015
ENCODE_and_ChEA_Consensus_TFs_from_ChIP-X

Epigenomics Libraries

ENCODE_Histone_Modifications_2015
Epigenomics_Roadmap_HM_ChIP-seq

Other

SynGO_2022, SynGO_2024 (synaptic GO)
CORUM (protein complexes)
COMPARTMENTS_Curated_2025, COMPARTMENTS_Experimental_2025
DepMap_CRISPR_GeneDependency_CellLines_2023

Full list: Run gseapy.get_library_name() for all 225 libraries.

PANTHER Annotation Datasets

Dataset ID	Description
`GO:0008150`	GO Biological Process
`GO:0003674`	GO Molecular Function
`GO:0005575`	GO Cellular Component
`ANNOT_TYPE_ID_PANTHER_PATHWAY`	PANTHER Pathway
`ANNOT_TYPE_ID_PANTHER_REACTOME_PATHWAY`	Reactome Pathways (via PANTHER)
`ANNOT_TYPE_ID_PANTHER_GO_SLIM_MF`	GO Slim Molecular Function
`ANNOT_TYPE_ID_PANTHER_GO_SLIM_BP`	GO Slim Biological Process
`ANNOT_TYPE_ID_PANTHER_GO_SLIM_CC`	GO Slim Cellular Component
`ANNOT_TYPE_ID_PANTHER_PC`	PANTHER Protein Class

Edge Cases

Small Gene Lists (<5 genes)

ORA may have insufficient power. Warn user.
Use PANTHER and STRING which handle small lists better.
Consider gene-level annotation (GO_get_annotations_for_gene) instead of enrichment.
Lower min_size for GSEA to 3.

Very Large Gene Lists (>500 genes)

Enrichment becomes less specific. Consider stricter cutoff.
Recommend custom background (e.g., expressed genes only).
Focus on most specific (lowest-level) GO terms.
Use GO Slim for high-level overview.

Non-Human Organisms

gseapy supports: human, mouse, fly, yeast, worm, fish, rat.
PANTHER supports many organisms via taxonomy ID.
STRING supports most model organisms.
Reactome can project non-human genes to human pathways (projection=True).
For unsupported organisms, use ortholog mapping first.

Mixed ID Types

Detect most common ID type in the list.
Convert all to gene symbols using MyGene_batch_query.
Report any unmapped genes in the output.
Remove duplicates after conversion.

No Significant Results

Report as valid finding ("no significant enrichment").
Try relaxing cutoff (e.g., p < 0.1).
Try different library version (2021 vs 2023 vs 2025).
Consider GSEA as alternative (may detect weak but consistent signals).
Check if gene names are correctly formatted.

Duplicate/Obsolete GO Terms

gseapy uses curated Enrichr libraries (already filtered).
PANTHER uses official GO (current).
STRING uses current GO.
Note any discrepancies between library versions.

Troubleshooting

"No significant enrichment found":

Check gene symbols are valid (try STRING_map_identifiers)
Try different library versions (2021 vs 2023 vs 2025)
Try relaxing significance cutoff
Try GSEA if ORA fails (different statistical framework)
Verify organism is correctly specified

"Gene not found" errors:

Check ID type (Ensembl vs Entrez vs Symbol)
Convert IDs using MyGene_batch_query
Some aliases may not map; try official HGNC symbols
Remove version suffixes from Ensembl IDs (ENSG00000141510.16 -> ENSG00000141510)

"gseapy library not found":

Run gseapy.get_library_name() to see available libraries
Library names are case-sensitive
Some libraries are organism-specific (e.g., KEGG_2021_Human vs KEGG_2019_Mouse)

"PANTHER returns empty results":

Check gene_list format: comma-separated string, NOT array
Check organism taxonomy ID (9606 for human)
Check annotation_dataset is valid

"STRING returns all categories":

This is expected behavior; STRING always returns all categories
Filter by category field after receiving results
Use d['category'] == 'Process' for GO BP, etc.

"Reactome identifiers_not_found > 0":

Some gene symbols may not map to Reactome entities
Use official gene symbols (not aliases)
Try UniProt IDs or Ensembl IDs as alternatives

"enrichr_gene_enrichment_analysis returns connected_paths":

This is the ToolUniverse enrichr tool, which returns path analysis, NOT standard enrichment
Use gseapy.enrichr() directly for standard ORA enrichment
The ToolUniverse enrichr tool is NOT suitable for standard enrichment questions

Resources

For network-level analysis: tooluniverse-network-pharmacology For disease characterization: tooluniverse-multiomic-disease-characterization For spatial omics: tooluniverse-spatial-omics-analysis For protein interactions: tooluniverse-protein-interactions

gseapy documentation: https://gseapy.readthedocs.io/ PANTHER API: http://pantherdb.org/services/oai/pantherdb/ STRING API: https://string-db.org/cgi/help?sessionId=&subpage=api Reactome Analysis: https://reactome.org/AnalysisService/

FilesExpand file tree

SKILL_OLD.md

Latest commit

History