Peter's Parse and Processing of Prenatal Particulars via Pandas
A simple, extensible CLI for downloading the Human Phenotype Ontology, parsing genotype/phenotype Excel workbooks, and producing GA4GH Phenopackets as specified here.
- Features
- Prerequisites
- Installation
- Quickstart
- CLI Reference
- Development & Testing
- Contributing
- License
- Contact
- Download: fetch the latest or a specific
hp.jsonrelease from GitHub - Parse: autodetect genotype vs phenotype sheets in any Excel workbook
- Normalize: clean up column names, HPO IDs, timestamps, and data types
- Generate: emit individual Phenopacket files, one per record (will change the file extension later)
-
Clone the repo:
git clone https://github.com/VarenyaJ/P6.git cd P6 -
(Recommended) Create a virtual environment (venv or Conda):
python3 -m venv .venv source .venv/bin/activateconda env create -f requirements/environment.yml -y conda activate P6
-
Install via pip:
python3 -m pip install -r requirements/requirements.txt . -
Verify the installation:
p6 --help
You should see something like:
Usage: p6 [OPTIONS] COMMAND [ARGS]... P6: Peter's Parse and Processing of Prenatal Particulars via Pandas. Options: --help Show this message and exit. Commands: download Download a specific or the latest HPO JSON release into... parse-excel Read each sheet, check column order, then: - Identify as a...
Fetch the latest release into tests/data/ (the default directory):
p6 downloadAfter running, you’ll have tests/data/hp.json.
With your HPO JSON in place at tests/data/hp.json, run:
p6 parse-excel -e tests/data/Sydney_Python_transformation.xlsxResulting phenopacket files will be under:
phenopacket_from_excel/$(date "+%Y-%m-%d_%H-%M-%S")/phenopackets/
Quickly check each sheet in an Excel file for header normalization, sheet classification, and presence of required variant columns.
p6 audit-excel -e tests/data/Sydney_Python_transformation.xlsxBy default you get a table; use -r for a JSON output to the console.
p6 audit-excel -e tests/data/Sydney_Python_transformation.xlsx -rUsage:
p6 download [OPTIONS]Options:
-d, --data-path PATH where to save HPO JSON (default: tests/data)
-v, --hpo-version TEXT exact HPO release tag (e.g. 2025-03-03 or v2025-03-03)
--help Show this help message and exit.Examples:
Fetch a specific release tag (e.g. v2025-03-03 or 2025-03-03) into tests/data/ (the default directory):
p6 download -v 2025-03-03
p6 download --hpo-version 2025-03-03Fetch a specific release tag (e.g. v2025-03-03 or 2025-03-03) into a custom directory:
p6 download -d src/P6 -v 2025-03-03
p6 download --data-path src/P6 --hpo-version 2025-03-03Read an Excel workbook, classify sheets, normalize fields, and emit Phenopacket protobuffers.
Usage: p6 parse-excel [OPTIONS] EXCEL_FILE
Options:
-e, --excel-path FILE path to the Excel workbook [required]
-hpo, --custom-hpo FILE path to a custom HPO JSON file (defaults to `tests/data/hp.json`)
--help Show this message and exit.Example:
Explicitly point at a custom HPO file:
p6 parse-excel -e tests/data/Sydney_Python_transformation.xlsx -hpo src/P6/hp.jsonRun a lightweight audit on each sheet in an Excel workbook, reporting header counts, sheet classification, and missing variant‐column checks.
Usage: p6 audit-excel [OPTIONS] EXCEL_FILE
Options:
-e, --excel-path FILE path to the Excel workbook [required]
-r, --report-json output audit report as JSON instead of table
--help Show this message and exit.Install dev requirements:
python3 -m pip install -r requirements/requirements.txt -r requirements/requirements_test.txt .This will install P6 along with the dependencies needed for the development.
Run the full test suite:
pytest -qLint & type-check (via ruff and built-in assertions):
ruff check .
ruff format .- Fork the repo & create a feature branch
- Make your changes & add tests
- Ensure all tests pass & lint is clean
- Submit a pull request against main
- Please follow the AGPL-3.0 code of conduct.
This project is licensed under the AGPL-3.0. See LICENSE for details.
Varenya Jain varenyajj@gmail.com GitHub: @VarenyaJ