A toolkit for exporting MSigDB gene sets and generating static HTML pages for each gene set.
This project provides three main tools for working with MSigDB gene sets:
export_genesets.py- Exports gene sets from MSigDB SQLite databases to YAML filesexport_genesets_xml.py- Exports gene sets from MSigDB XML files to YAML files (alternative to SQLite-based export)generate_pages.py- Generates static HTML pages from YAML gene set files
The typical workflow is:
- Export gene sets from either SQLite databases or XML files to YAML format
- Generate HTML pages from the YAML files for web presentation
- Python 3.9 or higher
- PyYAML 6.0+
- Jinja2 3.0+
- lxml 4.6+ (optional, for better XML error handling)
pip install -r requirements.txtExports MSigDB gene sets from SQLite databases to structured YAML files. This is the primary export method for working with MSigDB database files.
This script reads gene set data from MSigDB SQLite database files and version history from XML files, then exports each gene set as a separate YAML file. The YAML files include comprehensive metadata such as gene members, descriptions, source publications, related gene sets, and version history.
python export_genesets.py [OPTIONS]--human- Export only human gene sets (from Hs database)--mouse- Export only mouse gene sets (from Mm database)- If neither flag is specified, both human and mouse gene sets are exported
--limit N- Limit the total number of gene sets to export across all species- Example:
--limit 100exports up to 100 gene sets total
- Example:
--resume- Skip generating YAML files that already exist on disk- Useful for resuming interrupted exports or updating only new gene sets
--output PATH- Custom output directory for YAML files- Default:
outputs/ - YAML files are created in
{output}/human/and{output}/mouse/subdirectories
- Default:
--input PATH- Custom input directory for database and XML files- Default:
inputs/
- Default:
--hs-db PATH- Override path to human database file- Default:
inputs/msigdb_FULL_v2025.1.Hs.db
- Default:
--mm-db PATH- Override path to mouse database file- Default:
inputs/msigdb_FULL_v2025.1.Mm.db
- Default:
--hs-xml PATH- Override path to human XML history file- Default:
inputs/msigdb_history_v2025.1.Hs.xml
- Default:
--mm-xml PATH- Override path to mouse XML history file- Default:
inputs/msigdb_history_v2025.1.Mm.xml
- Default:
Export all gene sets (human and mouse):
python export_genesets.pyExport only human gene sets:
python export_genesets.py --humanExport only mouse gene sets:
python export_genesets.py --mouseExport the first 500 gene sets:
python export_genesets.py --limit 500Resume a previous export (skip existing files):
python export_genesets.py --resumeExport to a custom directory:
python export_genesets.py --output /path/to/outputExport with custom database paths:
python export_genesets.py --hs-db /path/to/human.db --mm-db /path/to/mouse.dbExport only 1000 mouse gene sets, resuming from where you left off:
python export_genesets.py --mouse --limit 1000 --resumeEach gene set is exported as a YAML file in outputs/{species}/{GENE_SET_NAME}.yaml containing:
- Standard and systematic names
- Brief and full descriptions
- Collection information
- Source species and publication details
- Authors and contributor information
- Related gene sets (from same publication and from same authors)
- Gene members with symbols and NCBI IDs
- Version history
- Dataset references
- External links (PubMed, etc.)
Exports MSigDB gene sets from XML files to structured YAML files. This is an alternative to export_genesets.py that works directly with XML source files instead of SQLite databases.
This script parses MSigDB XML files and exports each gene set as a separate YAML file. It includes advanced XML sanitization to handle malformed XML content, including invalid UTF-8 sequences and control characters. The script can process both the main gene set XML files and version history XML files.
python export_genesets_xml.py [OPTIONS]--human- Export only human gene sets (from Hs XML)--mouse- Export only mouse gene sets (from Mm XML)- If neither flag is specified, both human and mouse gene sets are exported
--limit N- Limit the total number of gene sets to export across all species- Example:
--limit 100exports up to 100 gene sets total
- Example:
--resume- Skip generating YAML files that already exist on disk- Useful for resuming interrupted exports or updating only new gene sets
--output PATH- Custom output directory for YAML files- Default:
outputs/ - YAML files are created in
{output}/human-xml/and{output}/mouse-xml/subdirectories
- Default:
--input PATH- Custom input directory for XML files- Default:
inputs/
- Default:
--hs-xml PATH- Override path to human gene set XML file- Default:
inputs/msigdb_v2025.1.Hs.xml
- Default:
--mm-xml PATH- Override path to mouse gene set XML file- Default:
inputs/msigdb_v2025.1.Mm.xml
- Default:
--hs-history-xml PATH- Override path to human XML history file- Default:
inputs/msigdb_history_v2025.1.Hs.xml
- Default:
--mm-history-xml PATH- Override path to mouse XML history file- Default:
inputs/msigdb_history_v2025.1.Mm.xml
- Default:
Export all gene sets (human and mouse):
python export_genesets_xml.pyExport only human gene sets:
python export_genesets_xml.py --humanExport only mouse gene sets:
python export_genesets_xml.py --mouseExport the first 500 gene sets:
python export_genesets_xml.py --limit 500Resume a previous export (skip existing files):
python export_genesets_xml.py --resumeExport to a custom directory:
python export_genesets_xml.py --output /path/to/outputExport with custom XML paths:
python export_genesets_xml.py --hs-xml /path/to/human.xml --mm-xml /path/to/mouse.xmlExport only 1000 mouse gene sets with custom paths:
python export_genesets_xml.py --mouse --limit 1000 --mm-xml /path/to/mouse.xmlEach gene set is exported as a YAML file in outputs/{species}-xml/{GENE_SET_NAME}.yaml with the same structure as export_genesets.py:
- Standard and systematic names
- Brief and full descriptions
- Collection information
- Source species and publication details
- Authors and contributor information
- Related gene sets
- Gene members with symbols and NCBI IDs
- Version history
- Dataset references
- External links
The script includes robust XML sanitization that:
- Handles invalid UTF-8 sequences
- Removes invalid XML control characters
- Processes malformed attribute values
- Creates sanitized temporary files when necessary
- Provides detailed error reporting
Generates static HTML pages from YAML gene set files. Creates a complete website with individual pages for each gene set and collection index pages.
This script reads YAML gene set files (generated by either export_genesets.py or export_genesets_xml.py) and creates static HTML pages using Jinja2 templates. Each gene set gets its own HTML page with complete metadata, gene lists, related gene sets, and version history. Index pages are also generated to browse gene sets by collection.
Note: Index pages and links between gene sets use relative paths, so they work correctly regardless of where the files are deployed (subdirectory, different domain, etc.).
python generate_pages.py [OPTIONS]--human- Generate only human gene set pages--mouse- Generate only mouse gene set pages- If neither flag is specified, both human and mouse pages are generated
--limit N- Limit the total number of HTML pages to generate across all species- Example:
--limit 100generates up to 100 pages total
- Example:
--resume- Skip generating HTML files that already exist on disk- Useful for resuming interrupted generation or updating only new pages
--geneset NAME- Generate a specific gene set by name- Example:
--geneset ZNF320_TARGET_GENES - Useful for testing or regenerating a single page
- Example:
--index- Generate index pages- When included, the overall index and all collection index pages are generated
- When omitted, only individual gene set pages are generated (no indices)
- Example:
--indexto generate all index pages
--input PATH- Custom input directory containing YAML files- Default:
outputs/ - The script expects subdirectories
{input}/human/and{input}/mouse/ - Example:
--input outputs/2025.1/
- Default:
--output PATH- Custom output directory for HTML files- Default:
msigdb/ - HTML files are created in
{output}/human/geneset/and{output}/mouse/geneset/subdirectories
- Default:
--link-prefix PREFIX- Prefix for external links (JSP pages, compendia, etc.)- Default: `` (empty string)
- Example:
--link-prefix https://www.gsea-msigdb.org/ - Note: This only affects external links; internal links between gene sets and index pages always use relative paths
--version TAG- Version tag to display on pages- Example:
--version v2025.1.Hs
- Example:
Generate all gene set pages (human and mouse):
python generate_pages.pyGenerate from a custom input directory:
python generate_pages.py --input outputs/2025.1/Generate only human gene set pages:
python generate_pages.py --humanGenerate only mouse gene set pages:
python generate_pages.py --mouseGenerate the first 100 pages:
python generate_pages.py --limit 100Resume page generation (skip existing files):
python generate_pages.py --resumeGenerate to a custom directory:
python generate_pages.py --output /var/www/htmlGenerate pages with link prefix for external resources:
python generate_pages.py --link-prefix https://www.gsea-msigdb.org/Generate a single specific gene set:
python generate_pages.py --geneset HALLMARK_APOPTOSISGenerate 500 mouse pages, resuming from where you left off:
python generate_pages.py --mouse --limit 500 --resumeGenerate with custom input, output, and version tag:
python generate_pages.py \
--input outputs/2025.1/ \
--output msigdb/2025.1/ \
--version v2025.1.HsGenerate without index pages (gene set pages only):
python generate_pages.py --index falseHTML pages are generated in the following structure:
{output}/
├── index.html # Overall index (relative links)
├── human/
│ ├── collection_C1.html # Collection indexes (relative links)
│ ├── collection_C2.html
│ ├── collection_C2_CGP.html
│ ├── collection_C2_CP.html
│ └── geneset/
│ ├── GENE_SET_1.html # Individual gene set pages
│ ├── GENE_SET_2.html
│ └── ...
└── mouse/
├── collection_M1.html
├── collection_M2.html
└── geneset/
├── GENE_SET_1.html
└── ...
Each HTML page includes:
- Gene set name and description
- Collection and source information
- Complete gene member list with NCBI links
- Related gene sets (with relative links)
- Version history
- External links
- Formatted metadata tables
Relative Links: Links between gene sets and from index pages to gene sets use relative paths. This means the generated HTML files can be deployed to any location (subdirectory, CDN, etc.) and the internal navigation will continue to work correctly.
The script expects YAML files to exist in:
{input}/human/for human gene sets (default:outputs/human/){input}/mouse/for mouse gene sets (default:outputs/mouse/)
Run export_genesets.py or export_genesets_xml.py first to generate these YAML files.