Skip to content

Feature/expand phenopackets #24

@VarenyaJ

Description

@VarenyaJ

Completed PRs/Issue

PR Title: PR #19 – Feature/expand phenopackets (Branch: feature/expand-phenopackets → develop)
PR Type: Feature
Status: Completed

Background

The pipeline initially produced Phenopackets with only genotype and phenotype
blocks. To fully comply with GA4GH Phenopacket v2, we needed to expand support
for diseases, measurements, and biosamples. This PR also aligned the CLI and
DefaultMapper to serialize richer data structures into each output file.


Scope

Outline

Extend the mapper and CLI to capture diseases, measurements, and biosamples in
Phenopacket JSON outputs.

Included/Required

  • Added DiseaseRecord, MeasurementRecord, and BiosampleRecord dataclasses.
  • Extended loader (RENAME_MAP) to recognize new columns.
  • Updated DefaultMapper.apply_mapping to detect and map disease/measurement/
    biosample tables.
  • CLI parse-excel now serializes these blocks into phenopacket JSON.
  • Group records by patient across all five record types.
  • Integration test test_full_features_parse_creates_all_blocks.
  • Updated README with new CLI commands and audit-excel reference.
  • Refactored DefaultMapper into modular row-level helpers.
  • Added audit improvements and verbose reporting.

Optional

  • Graceful CLI tolerance for chromosome input (chr16 vs 16).
  • Canonical HGVS emitted without redundant chr prefix.
  • Tests for alias-based sheet selection (variants, hpo, labs).

Not included

  • VariationDescriptor gene_context and HGVS expression integration (left as TODO).
  • No visualization/dashboard layer.

Technical Plan / Implementation Details

  • New files: src/P6/disease.py, src/P6/measurement.py, src/P6/biosample.py.
  • Loader extended with mappings for disease, measurement, biosample fields.
  • DefaultMapper.apply_mapping now returns list[Phenopacket] instead of tuples.
  • Row-level parsing split into _map_genotype_table, _map_phenotype_table,
    _map_diseases_table, _map_measurements_table, _map_biosamples_table.
  • _group_records_by_patient aggregates all record types before serialization.
  • CLI integration:
    • p6 parse-excel → writes phenopackets with all supported blocks.
    • p6 audit-excel → improved audit with header normalization, sheet
      classification, and variant checks.
  • Tests added:
    • tests/test_full_features.py
    • tests/test_mapper_* (row parsing, required column checks, HGVS consistency).
    • Utility helpers + audit/preprocess validation.

Validation & Testing

  • Integration tests confirmed that diseases, measurements, and biosamples appear
    in output phenopacket JSON.
  • Unit tests for row-level mapping of genotype, phenotype, disease, measurement,
    and biosample tables.
  • CLI tested with both table and JSON audit outputs.
  • Network calls mocked in test_download_mock.py for HPO fetch.
  • Mapper tested on strict vs non-strict HGVS consistency.

Milestones

  • Add disease/measurement/biosample dataclasses.
  • Extend loader and RENAME_MAP for new fields.
  • Update DefaultMapper to map new sheet types.
  • Update CLI parse-excel to emit expanded phenopackets.
  • Add full integration tests.
  • Refactor DefaultMapper into modular components.

Outcome

  • Phenopackets now support diseases, measurements, and biosamples alongside
    genotypes and phenotypes.
  • CLI users can parse richer Excel inputs and produce GA4GH-compliant JSON.
  • Expanded tests and refactoring increased maintainability of the mapping layer.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions