Skip to content

Feature/expand phenopackets - Builds on top of feature/preprocessing-columns#6

Merged
VarenyaJ merged 39 commits intodevelopfrom
feature/expand-phenopackets
Aug 15, 2025
Merged

Feature/expand phenopackets - Builds on top of feature/preprocessing-columns#6
VarenyaJ merged 39 commits intodevelopfrom
feature/expand-phenopackets

Conversation

@VarenyaJ
Copy link
Copy Markdown
Owner

@VarenyaJ VarenyaJ commented Aug 7, 2025

(phenopacket): Expand mapping to include diseases, measurements and biosamples

  • Add DiseaseRecord, MeasurementRecord and BiosampleRecord dataclasses
  • Extend loader and RENAME_MAP to recognize measurement and biosample columns
  • Update DefaultMapper.apply_mapping to detect disease, measurement and biosample sheets by key columns
  • Wire up CLI parse-excel to serialize diseases, measurements and biosamples blocks into each phenopacket JSON
  • Add integration test test_full_features_parse_creates_all_blocks
  • Clean up ruff warnings: remove unused records assignments in mapper stubs (for now)

VarenyaJ and others added 20 commits August 5, 2025 12:21
…if both then these notations needs to say the same thing, then add verbose audit output and configurable strictness for variant checks:

- Introduce `--verbose` flag in `parse-excel` to show preprocessing audit steps.
- Colorize audit messages (info=cyan, warn=yellow, error=red) and indent them for better readability.
- Add `--strict-variants` / `--no-strict-variants` option (default: warn) to treat raw↔HGVS mismatches as errors when enabled.
- Pass `strict_variants` through to `DefaultMapper` and update `_check_hgvs_consistency` to honor the new strictness setting.
- **New CLI command** `audit-excel` in `src/P6/__main__.py`
  - Options:
    - `-e, --excel-path <FILE>` (required) to specify the workbook
    - `-r, --report-json` flag to toggle JSON output
  - Reads sheets via existing `_read_sheets()`
  - Calls `preprocess()` to collect audit entries
  - Renders either:
    - Human-readable table (columns: SHEET, STEP, LEVEL, MESSAGE; two spaces between columns for reliable parsing)
    - JSON array of `{ step, sheet, level, message }` objects with `--report-json`
- **Tests for `audit-excel`** in `tests/test_cli_audit_excel.py`
  - `test_audit_excel_table_output`:
    - Invokes `audit-excel -e <sample.xlsx>`
    - Asserts exit code `0`
    - Verifies header line begins with `SHEET`
    - Splits each subsequent line on `\s{2,}` and checks for ≥4 columns
  - `test_audit_excel_json_output`:
    - Invokes with `-r`
    - Parses output as JSON list of dicts
    - Asserts each dict contains keys `step`, `sheet`, `level`, `message`
- **Unit tests for `preprocess()`** in `tests/test_preprocess.py`
  - Fixture `simple_workbook`: generates a minimal valid genotype sheet in a temp Excel file
  - `test_preprocess_detects_variant_sheet`:
    - Loads tables via `load_sheets_as_tables()`
    - Runs `preprocess()`
    - Asserts presence of a `classify-sheet` entry for “geno”
    - Asserts no error‐level `variant-check` entry on valid input
- **Non-functional changes**
  - Imported `json` at top of `__main__.py`
  - No modifications to existing `parse-excel` or phenopacket serialization logic
@VarenyaJ VarenyaJ requested a review from ielis August 7, 2025 13:37
Comment thread tests/test_full_features.py Outdated
)


def test_full_features_parse_creates_all_blocks(decompressed_hpo):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test receives a path to a decompressed HPO JSON file. However, HPO toolkit will decompress hp.json.gz on the fly (it picks up the .gz suffix).

So, you do not need to do the decompression, unless you really want to do it from some other reason.

Comment thread src/P6/mapper.py Outdated
Comment on lines +105 to +107
def apply_mapping(
self, tables: dict[str, pd.DataFrame], notepad: Notepad
) -> tuple[list[Genotype], list[Phenotype]]:
) -> tuple[
Copy link
Copy Markdown
Collaborator

@ielis ielis Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a lint in my IDE - the method does not override the abstract method in the correct fashion.

It should return a typing.Sequence[Phenopacket], but it returns a tuple of other things.

We should either change the TableMappers signature or return a sequence of phenopackets from the overriding method (this method, preferred solution).


A second point. This function involves a lot of guesswork regarding finding out what sheet are we dealing with. This brings a lot of unnecessary complexity. I think it would be better to make this mapper work with a dict of data frames where the dict keys either include known values (genotype, phenotype, etc.) or we bail out. I do not think we should try to make this class to work for any set of data frames imaginable. That is just too hard to test and reason about.

Working with known table names allows us to delegate parsing of the genotype and phenotype into separate mappers, which we can test. The parsing can be followed with aggregation into phenopackets.

So, I would recommend to simplify the logic by basically asking for "standardized" table names.


A third point. I think this function does a lot of stuff and it may be worth to try to split it into smaller components, each responsible for doing one thing. For instance, in case of _check_hgvs_consistency, we can make it into a separate function that checks a single item instead of a frame of items. Something like:

def check_hgvs_consistency(
    item: pd.Series,
    notepad: Notepad,
):
    ...

where is a row with a variant to check.

Then you can write tests for all sorts of weird variants and know that the errors will be reported. Using the function from DefaultMapper will also be trivial; we can call it within a loop.

Similarly, we can isolate the logic for parsing genotypes (self._map_genotype) and phenotypes (self._map_phenotype) into their own components. The logic is anything but trivial. The same drill applies - make it work with a single data item, test it, and let the mapper orchestrate their application on the provided tables.

Copy link
Copy Markdown
Collaborator

@ielis ielis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello Varenya,

I had a look at the code and I do not see any principal issues with the code logic. I would, however, recommend restructuring the code. The DefaultMapper is the most important pipeline component, and it should be split into several smaller components. Each of the component should deal with a particular task and be well tested.

I see that you have some top-level tests of the CLI, and those are great to ensure that CLI works as intended. It is hard, however, to test all sorts of weird situations that can arise when parsing variants or phenotypes (e.g. a non-existing HPO term, such as HP:9999999). If you split the mapper into smaller components (classes or functions), you will be able to thoroughly test them, and increase the confidence about the code.

I added several comments below (or above?).

VarenyaJ and others added 13 commits August 8, 2025 15:37
…testable components and align interface

- **Align with `TableMapper` interface**
  - `apply_mapping` now returns `list[Phenopacket]` (previously returned tuples of record lists)
  - Pipeline steps:
    1. Select and validate tables via `_choose_named_tables`
    2. Map rows to domain records using `_map_*_table` wrappers
    3. Group records per patient
    4. Build one `Phenopacket` per patient
    5. Return the list of packets

- **Explicit table selection**
  - Added `KNOWN_SHEET_ALIASES` for common name variants:
    - `genotype`: {"genotype", "variants", "variant", "geno"}
    - `phenotype`: {"phenotype", "hpo", "pheno"}
    - `diseases`: {"disease", "diseases"}
    - `measurements`: {"measurement", "measurements", "labs"}
    - `biosamples`: {"biosample", "biosamples", "samples"}
  - Removed column-based guessing; require explicit/aliased names
  - Emit error if both genotype and phenotype sheets are missing

- **Row-level parsers**
  - `parse_genotype_row`:
    - Splits slash-separated zygosity/inheritance
    - Validates against `ZYGOSITY_MAP` / `INHERITANCE_MAP`
    - Normalizes email and parses numeric positions
    - Returns `(genotype_records, [])`
  - `parse_phenotype_row`:
    - Extracts HPO ID + optional label
    - Normalizes `date_of_observation`
    - Uses `_to_bool` for `status` (fixes tuple bug from `bool(row["status"],)`)
    - Runs ontology lookups and warns on obsolete/mismatched labels
    - Returns `(phenotype_records, term_ids)`

- **Validation utilities**
  - `_normalize_time_like`: handles `NaN`, `NaT`, blanks; adds `T` prefix to numeric dates
  - `_to_bool`: robust bool parsing from `1/0`, `yes/no`, `true/false`, etc.
  - `check_hgvs_consistency`: compares HGVS string to raw coordinate/allele fields; normalizes case and `chr` prefix; warns/errors based on `strict_variants`

- **Sheet-level wrappers**
  - `_map_genotype_table`: normalizes index, checks required `GENOTYPE_KEY_COLUMNS`, runs HGVS consistency check, delegates to `_map_genotype`
  - `_map_phenotype_table`: checks `PHENOTYPE_KEY_COLUMNS`, delegates to `_map_phenotype` with batch HPO validators
  - `_map_diseases_table`, `_map_measurements_table`, `_map_biosamples_table`: normalize index, check required columns, delegate to row mappers

- **Phenopacket assembly**
  - `_group_records_by_patient`: aggregates all record types by patient ID
  - `construct_phenopacket_for_patient`:
    - Sets subject id
    - Adds phenotypic features (sets `excluded` if status is false)
    - Adds genotype interpretations with HGVS expression and optional location/allele fields
    - Adds diseases, measurements, and biosamples minimally

- **Fixes**
  - Corrected phenotype `status` parsing to `_to_bool` (prevents always-true bug)
  - Centralized column existence checks to fail fast on missing required fields

- **Non-functional**
  - Improved docstrings and inline comments for clarity
  - Left optional raw/HGVS fallback logic commented for possible future re-enablement
…rve test output

- **DefaultMapper**
  - Added `self.stats` to `apply_mapping()` for back-compat record counts
    (`genotypes`, `phenotypes`, `diseases`, `measurements`, `biosamples`, `patients`)
  - Count values set immediately before returning assembled `Phenopacket`s
  - Ensures CLI/tests can still print "Created N Genotype objects" / "Created N Phenotype objects"
    without re-grouping data in the CLI

- **CLI (`parse_excel`)**
  - Removed legacy Steps 6–9 that rebuilt `records_by_patient` from raw record lists
    (variables no longer exist in refactored mapper API)
  - Now:
    1. Apply mapping → receive full `Phenopacket` list
    2. Prepare output dir via `_prepare_output_dir()`
    3. Serialize phenopackets directly to numbered `.json` files
    4. Print summary:
       - "Wrote {patients} phenopacket files to {output_dir}"
       - "Created N Genotype objects" / "Created N Phenotype objects" from `mapper.stats`
  - Preserves exact output strings and order expected by existing tests
  - Fixed syntax error in `out_f.write(MessageToJson(pkt))` (missing `)`)

- **Other**
  - Retained warning/error reporting via `_report_issues(notepad)`
  - No functional change to mapping logic or file formats
…dding a tiny normalization in `DefaultMapper.parse_genotype_row` before constructing the `Genotype`
- When constructing Phenopackets, strip a leading 'chr' from HGVS g. expressions
  so output matches canonical form expected by tests (e.g., '16:g.100A>G').
- Keeps validation tolerant of either form; only serialization is normalized.
…S output; cleanup

- Mapper unit tests:
  - parse_phenotype_row:
    - Happy-path parsing with numeric date normalization.
    - Handles NAD entries gracefully, skipping with warning.
    - Emits label/name mismatch warnings when ontology term labels differ.
  - parse_genotype_row:
    - Correctly pairs multi-token zygosity and inheritance fields.
    - Errors on invalid or unrecognized zygosity/inheritance codes.
  - check_hgvs_consistency:
    - Accepts both chr-prefixed and bare chromosome identifiers.
    - Supports BED-like SNV formatting.
    - In strict mode, errors when genomic positions disagree.
  - utils:
    - _normalize_time_like correctly handles multiple timestamp formats.
    - _to_bool covers a truth table of truthy/falsy string inputs.
  - alias selection:
    - _choose_named_tables resolves variant, HPO, and laboratory sheet aliases.
  - table wrappers:
    - Enforces required columns for genotype, phenotype, diseases, measurements, and biosamples.
  - apply_mapping (end-to-end):
    - Builds a complete Phenopacket for a patient.
    - Normalizes IDs (e.g., HP:0000510) and assigns canonical HGVS expression.
    - Updates mapper.stats counters correctly for genotypes, phenotypes, and patients.

- CLI/main helper unit tests:
  - _prepare_output_dir creates timestamped directories.
  - _report_issues outputs WARN and ERROR blocks with clean bullet formatting.
  - download command runs with mocked network, verifying correct output.

- mapper: serialize HGVS expressions without the optional 'chr' prefix so output matches
  canonical form `16:g.100A>G` expected by tests, while keeping validation tolerant to
  either variant.
…ers; canonicalize HGVS output; tidy tests

- mapper:
  - Broke down `construct_phenopacket_for_patient` into `_add_phenotypic_features`,
    `_add_genotype_interpretations`, `_add_diseases`, `_add_measurements`,
    and `_add_biosamples` to reduce cyclomatic complexity (C901) without altering behavior.
  - Canonicalize `hgvsg` values by stripping optional 'chr' prefix before serializing into
    `VariationDescriptor.expressions[0].value` so output matches expected form
    (`16:g.100A>G`) while still accepting either format during parsing.

- tests:
  - Removed leftover debug `print` calls in `test_mapper_apply_construct`.
  - Removed unused `m` variables in `test_mapper_check_hgvs` flagged by Ruff (F841).
  - All 43 tests pass with no Ruff errors.

Result: Cleaner mapper internals, consistent HGVS serialization, and tests free of style violations.
…kets

Refactor/mapper-returns-phenopackets
Add description from `Projektliste_Robinson`
@VarenyaJ VarenyaJ mentioned this pull request Aug 15, 2025
@VarenyaJ VarenyaJ merged commit a084a9a into develop Aug 15, 2025
2 checks passed
@VarenyaJ VarenyaJ mentioned this pull request Sep 22, 2025
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants