A Monarch Initiative ingest pipeline for phenopacket data from the phenopacket-store. This pipeline downloads phenopacket data, extracts it, and transforms it into Biolink-compatible entities for inclusion in Monarch knowledge graphs.
The phenopacket-ingest pipeline processes phenopacket data through several steps:
- Download: Retrieves the latest phenopacket data from the phenopacket-store GitHub releases
- Extract: Parses phenopacket data into a structured JSONL format
- Transform: Converts the structured data into Biolink model entities for knowledge graph integration
This ingest relies on phenopacket data from the phenopacket-store repository, which contains structured phenotypic data about rare disease cases in the GA4GH Phenopacket format.
- phenopacket-store releases: ZIP archive containing JSON phenopacket files organized by cohort (gene folder)
- Each phenopacket contains standardized data about an individual case including:
- Subject information (ID, sex, age)
- Phenotypic features (HPO terms)
- Disease information (MONDO terms)
- Genetic findings (variants and genes)
- Interpretations (causality assessments)
- Metadata (references, provenance)
Phenopacket Case entities are assigned IDs that include the cohort (gene folder) name for proper URI resolution:
phenopacket.store:{cohort}.{phenopacket_id}
For example:
phenopacket.store:POGZ.PMID_34133408_casephenopacket.store:KCNT1.PMID_30566666_patient1phenopacket.store:11q_terminal_deletion.PMID_15266616_35
The dot separator is URL-safe (per GA4GH recommendations and RFC 3986), avoiding routing issues that occur with / separators. The Monarch API expands this CURIE to the correct GitHub URL:
https://github.com/monarch-initiative/phenopacket-store/blob/main/notebooks/{cohort}/phenopackets/{phenopacket_id}.json
- Python >= 3.10
- Poetry
- phenopackets library (optional, but recommended)
- phenopacket-store-toolkit (optional, but recommended)
cd phenopacket-ingest
make install
# or
poetry installTo run the complete pipeline (download, extract, transform) in one step:
poetry run phenopacket_ingest pipelineFor more granular control, you can run each step individually:
- Download the data:
poetry run phenopacket_ingest download- Extract phenopacket data to JSONL format:
poetry run phenopacket_ingest extract- Transform the data to Biolink entities:
poetry run phenopacket_ingest transformEach command has various options. To see them:
poetry run phenopacket_ingest [command] --helpFor example:
poetry run phenopacket_ingest download --help
The test suite covers:
- Model validation and conversion
- Transformation to Biolink entities
- Registry functionality (downloading and extraction)
Run the tests with:
make test
# or
python -m pytestTo contribute to this project:
- Clone the repository
- Install development dependencies:
make install-dev # or poetry install --with dev - Run the tests:
make test
This project was generated using monarch-initiative/cookiecutter-monarch-ingest.