Skip to content

monarch-initiative/phenopacket-ingest

Repository files navigation

phenopacket-ingest

A Monarch Initiative ingest pipeline for phenopacket data from the phenopacket-store. This pipeline downloads phenopacket data, extracts it, and transforms it into Biolink-compatible entities for inclusion in Monarch knowledge graphs.

Overview

The phenopacket-ingest pipeline processes phenopacket data through several steps:

  1. Download: Retrieves the latest phenopacket data from the phenopacket-store GitHub releases
  2. Extract: Parses phenopacket data into a structured JSONL format
  3. Transform: Converts the structured data into Biolink model entities for knowledge graph integration

Data Sources

This ingest relies on phenopacket data from the phenopacket-store repository, which contains structured phenotypic data about rare disease cases in the GA4GH Phenopacket format.

Source Files

  • phenopacket-store releases: ZIP archive containing JSON phenopacket files organized by cohort (gene folder)
  • Each phenopacket contains standardized data about an individual case including:
    • Subject information (ID, sex, age)
    • Phenotypic features (HPO terms)
    • Disease information (MONDO terms)
    • Genetic findings (variants and genes)
    • Interpretations (causality assessments)
    • Metadata (references, provenance)

Entity ID Format

Phenopacket Case entities are assigned IDs that include the cohort (gene folder) name for proper URI resolution:

phenopacket.store:{cohort}.{phenopacket_id}

For example:

  • phenopacket.store:POGZ.PMID_34133408_case
  • phenopacket.store:KCNT1.PMID_30566666_patient1
  • phenopacket.store:11q_terminal_deletion.PMID_15266616_35

The dot separator is URL-safe (per GA4GH recommendations and RFC 3986), avoiding routing issues that occur with / separators. The Monarch API expands this CURIE to the correct GitHub URL:

https://github.com/monarch-initiative/phenopacket-store/blob/main/notebooks/{cohort}/phenopackets/{phenopacket_id}.json

Requirements

Installation

cd phenopacket-ingest
make install
# or
poetry install

Usage

Full Pipeline

To run the complete pipeline (download, extract, transform) in one step:

poetry run phenopacket_ingest pipeline

Individual Steps

For more granular control, you can run each step individually:

  1. Download the data:
poetry run phenopacket_ingest download
  1. Extract phenopacket data to JSONL format:
poetry run phenopacket_ingest extract
  1. Transform the data to Biolink entities:
poetry run phenopacket_ingest transform

Command Options

Each command has various options. To see them:

poetry run phenopacket_ingest [command] --help

For example:

poetry run phenopacket_ingest download --help

Testing

The test suite covers:

  • Model validation and conversion
  • Transformation to Biolink entities
  • Registry functionality (downloading and extraction)

Run the tests with:

make test
# or
python -m pytest

Development

To contribute to this project:

  1. Clone the repository
  2. Install development dependencies:
    make install-dev
    # or
    poetry install --with dev
  3. Run the tests:
    make test

This project was generated using monarch-initiative/cookiecutter-monarch-ingest.

About

This is a modular ingest for phenopackets into the Monarch KG

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •