Develop by VarenyaJ · Pull Request #16 · VarenyaJ/P6

VarenyaJ · 2025-08-15T12:57:00Z

Pull latest features into main

…I subpackages - enhance regex to extract both free‐text label and numeric HPO ID - use hpotk.get_term() to look up the ontology term and compare its name - emit a warning if the provided label doesn’t match the ontology’s term name - allow sheets without labels to continue working (backwards compatibility) - fixes missing Phenotype records on both Python_headers_phenocopy_transformation.xlsx and Sydney_Python_transformation.xlsx

- Single method, unified regex for HPO label+ID parsing - Properly handle unparsable HPO cells, missing ontology terms, obsolete terms, and label mismatches - Wrap access to `VariationDescriptor.Expression` and `.location`/alleles in `try/except AttributeError` so CLI won’t crash on protobufs lacking those fields - Eliminate mutable default dict in `Genotype`; rely on `zygosity_code` property and class‐level mapping - All existing tests (genotype, mapper, CLI) now pass, deprecation warning only remains from protobuf internals

Feature/extend map phenotype

…DO for Expression.HGVS

…if both then these notations needs to say the same thing, then add verbose audit output and configurable strictness for variant checks: - Introduce `--verbose` flag in `parse-excel` to show preprocessing audit steps. - Colorize audit messages (info=cyan, warn=yellow, error=red) and indent them for better readability. - Add `--strict-variants` / `--no-strict-variants` option (default: warn) to treat raw↔HGVS mismatches as errors when enabled. - Pass `strict_variants` through to `DefaultMapper` and update `_check_hgvs_consistency` to honor the new strictness setting.

- **New CLI command** `audit-excel` in `src/P6/__main__.py` - Options: - `-e, --excel-path <FILE>` (required) to specify the workbook - `-r, --report-json` flag to toggle JSON output - Reads sheets via existing `_read_sheets()` - Calls `preprocess()` to collect audit entries - Renders either: - Human-readable table (columns: SHEET, STEP, LEVEL, MESSAGE; two spaces between columns for reliable parsing) - JSON array of `{ step, sheet, level, message }` objects with `--report-json` - **Tests for `audit-excel`** in `tests/test_cli_audit_excel.py` - `test_audit_excel_table_output`: - Invokes `audit-excel -e <sample.xlsx>` - Asserts exit code `0` - Verifies header line begins with `SHEET` - Splits each subsequent line on `\s{2,}` and checks for ≥4 columns - `test_audit_excel_json_output`: - Invokes with `-r` - Parses output as JSON list of dicts - Asserts each dict contains keys `step`, `sheet`, `level`, `message` - **Unit tests for `preprocess()`** in `tests/test_preprocess.py` - Fixture `simple_workbook`: generates a minimal valid genotype sheet in a temp Excel file - `test_preprocess_detects_variant_sheet`: - Loads tables via `load_sheets_as_tables()` - Runs `preprocess()` - Asserts presence of a `classify-sheet` entry for “geno” - Asserts no error‐level `variant-check` entry on valid input - **Non-functional changes** - Imported `json` at top of `__main__.py` - No modifications to existing `parse-excel` or phenopacket serialization logic

… outputted phenopackets

…ntations

…ferred design)

…preferred design)

…testable components and align interface - **Align with `TableMapper` interface** - `apply_mapping` now returns `list[Phenopacket]` (previously returned tuples of record lists) - Pipeline steps: 1. Select and validate tables via `_choose_named_tables` 2. Map rows to domain records using `_map_*_table` wrappers 3. Group records per patient 4. Build one `Phenopacket` per patient 5. Return the list of packets - **Explicit table selection** - Added `KNOWN_SHEET_ALIASES` for common name variants: - `genotype`: {"genotype", "variants", "variant", "geno"} - `phenotype`: {"phenotype", "hpo", "pheno"} - `diseases`: {"disease", "diseases"} - `measurements`: {"measurement", "measurements", "labs"} - `biosamples`: {"biosample", "biosamples", "samples"} - Removed column-based guessing; require explicit/aliased names - Emit error if both genotype and phenotype sheets are missing - **Row-level parsers** - `parse_genotype_row`: - Splits slash-separated zygosity/inheritance - Validates against `ZYGOSITY_MAP` / `INHERITANCE_MAP` - Normalizes email and parses numeric positions - Returns `(genotype_records, [])` - `parse_phenotype_row`: - Extracts HPO ID + optional label - Normalizes `date_of_observation` - Uses `_to_bool` for `status` (fixes tuple bug from `bool(row["status"],)`) - Runs ontology lookups and warns on obsolete/mismatched labels - Returns `(phenotype_records, term_ids)` - **Validation utilities** - `_normalize_time_like`: handles `NaN`, `NaT`, blanks; adds `T` prefix to numeric dates - `_to_bool`: robust bool parsing from `1/0`, `yes/no`, `true/false`, etc. - `check_hgvs_consistency`: compares HGVS string to raw coordinate/allele fields; normalizes case and `chr` prefix; warns/errors based on `strict_variants` - **Sheet-level wrappers** - `_map_genotype_table`: normalizes index, checks required `GENOTYPE_KEY_COLUMNS`, runs HGVS consistency check, delegates to `_map_genotype` - `_map_phenotype_table`: checks `PHENOTYPE_KEY_COLUMNS`, delegates to `_map_phenotype` with batch HPO validators - `_map_diseases_table`, `_map_measurements_table`, `_map_biosamples_table`: normalize index, check required columns, delegate to row mappers - **Phenopacket assembly** - `_group_records_by_patient`: aggregates all record types by patient ID - `construct_phenopacket_for_patient`: - Sets subject id - Adds phenotypic features (sets `excluded` if status is false) - Adds genotype interpretations with HGVS expression and optional location/allele fields - Adds diseases, measurements, and biosamples minimally - **Fixes** - Corrected phenotype `status` parsing to `_to_bool` (prevents always-true bug) - Centralized column existence checks to fail fast on missing required fields - **Non-functional** - Improved docstrings and inline comments for clarity - Left optional raw/HGVS fallback logic commented for possible future re-enablement

…rve test output - **DefaultMapper** - Added `self.stats` to `apply_mapping()` for back-compat record counts (`genotypes`, `phenotypes`, `diseases`, `measurements`, `biosamples`, `patients`) - Count values set immediately before returning assembled `Phenopacket`s - Ensures CLI/tests can still print "Created N Genotype objects" / "Created N Phenotype objects" without re-grouping data in the CLI - **CLI (`parse_excel`)** - Removed legacy Steps 6–9 that rebuilt `records_by_patient` from raw record lists (variables no longer exist in refactored mapper API) - Now: 1. Apply mapping → receive full `Phenopacket` list 2. Prepare output dir via `_prepare_output_dir()` 3. Serialize phenopackets directly to numbered `.json` files 4. Print summary: - "Wrote {patients} phenopacket files to {output_dir}" - "Created N Genotype objects" / "Created N Phenotype objects" from `mapper.stats` - Preserves exact output strings and order expected by existing tests - Fixed syntax error in `out_f.write(MessageToJson(pkt))` (missing `)`) - **Other** - Retained warning/error reporting via `_report_issues(notepad)` - No functional change to mapping logic or file formats

…dding a tiny normalization in `DefaultMapper.parse_genotype_row` before constructing the `Genotype`

- When constructing Phenopackets, strip a leading 'chr' from HGVS g. expressions so output matches canonical form expected by tests (e.g., '16:g.100A>G'). - Keeps validation tolerant of either form; only serialization is normalized.

…S output; cleanup - Mapper unit tests: - parse_phenotype_row: - Happy-path parsing with numeric date normalization. - Handles NAD entries gracefully, skipping with warning. - Emits label/name mismatch warnings when ontology term labels differ. - parse_genotype_row: - Correctly pairs multi-token zygosity and inheritance fields. - Errors on invalid or unrecognized zygosity/inheritance codes. - check_hgvs_consistency: - Accepts both chr-prefixed and bare chromosome identifiers. - Supports BED-like SNV formatting. - In strict mode, errors when genomic positions disagree. - utils: - _normalize_time_like correctly handles multiple timestamp formats. - _to_bool covers a truth table of truthy/falsy string inputs. - alias selection: - _choose_named_tables resolves variant, HPO, and laboratory sheet aliases. - table wrappers: - Enforces required columns for genotype, phenotype, diseases, measurements, and biosamples. - apply_mapping (end-to-end): - Builds a complete Phenopacket for a patient. - Normalizes IDs (e.g., HP:0000510) and assigns canonical HGVS expression. - Updates mapper.stats counters correctly for genotypes, phenotypes, and patients. - CLI/main helper unit tests: - _prepare_output_dir creates timestamped directories. - _report_issues outputs WARN and ERROR blocks with clean bullet formatting. - download command runs with mocked network, verifying correct output. - mapper: serialize HGVS expressions without the optional 'chr' prefix so output matches canonical form `16:g.100A>G` expected by tests, while keeping validation tolerant to either variant.

…ers; canonicalize HGVS output; tidy tests - mapper: - Broke down `construct_phenopacket_for_patient` into `_add_phenotypic_features`, `_add_genotype_interpretations`, `_add_diseases`, `_add_measurements`, and `_add_biosamples` to reduce cyclomatic complexity (C901) without altering behavior. - Canonicalize `hgvsg` values by stripping optional 'chr' prefix before serializing into `VariationDescriptor.expressions[0].value` so output matches expected form (`16:g.100A>G`) while still accepting either format during parsing. - tests: - Removed leftover debug `print` calls in `test_mapper_apply_construct`. - Removed unused `m` variables in `test_mapper_check_hgvs` flagged by Ruff (F841). - All 43 tests pass with no Ruff errors. Result: Cleaner mapper internals, consistent HGVS serialization, and tests free of style violations.

…kets Refactor/mapper-returns-phenopackets

Add description from `Projektliste_Robinson`

Feature/preprocessing columns

Feature/expand phenopackets - Builds on top of `feature/preprocessing-columns`

VarenyaJ and others added 30 commits August 4, 2025 14:35

Ruff formatting

0119b29

Add reference for Zygosity codes

28341e7

Revise example files

2546071

Ruff formatting

e52e635

Merge pull request #3 from VarenyaJ/feature/extend_map_phenotype

234ccc6

Feature/extend map phenotype

Clean up TODO asking to write phenopackets to a folder

cae7f1e

Clean TODO for downloading an HPO JSON file from the Monarch repo

01be17c

Clean up TODO for VariationDescriptor and gene_context, and remove TO…

206e414

…DO for Expression.HGVS

Rearrange the mapper

021b19f

Fix typo and ruff format

308511c

Ruff formatting

7e6cf2e

Augment README

e85b4b3

Create new files for optional categories to fill out Phenopacket Schema

9744ee9

Start to implement parsing of optional sheets to further fill out the…

bb73625

… outputted phenopackets

Write a test for the new disease, measurements, and biosample impleme…

55a9c5c

…ntations

Fix indentation errors :(

30081c8

unpack & group all five lists in __main__.py

15d45b2

Correct typos

4a471b9

Adjust indentation of the for loop

b42ee16

Adjust indentation

94d3424

Adjust fixture since the CLI is expecting a PATH as a string

3834c43

Add TODO

86028e0

Adjust TODOs and initiate Ruff formatting;

86761a1

Remove manual decompression of HPO JSON files - leave that for hpotk

33695a0

ruff checking

bfeb879

Start to refactor to have apply_mapping return list[Phenopacket] (pre…

c262f18

…ferred design)

VarenyaJ and others added 18 commits August 8, 2025 13:33

ruff format

b8ff573

Continue to refactor to have apply_mapping return list[Phenopacket] (…

4a764e2

…preferred design)

Clarify input by standardized sheet names via TypedTables

95e861f

Attempt to split apply_mapping into small row-by-row helpers

ca759ae

Split into a row-level helper and call it in the table wrapper

e37930f

Ruff check and formatting

d4f5dd6

Have the CLI be forgiving (accept 16 and normalize to chr16) by a…

5c79279

…dding a tiny normalization in `DefaultMapper.parse_genotype_row` before constructing the `Genotype`

Revise the CLI forgiveness for chromosome entry

9478eae

Merge pull request #10 from VarenyaJ/refactor/mapper-returns-phenopac…

3fd26fa

…kets Refactor/mapper-returns-phenopackets

Add description from

6bdcd26

Merge pull request #14 from VarenyaJ/feature/short-readme-description

a3540a8

Add description from `Projektliste_Robinson`

Merge pull request #5 from VarenyaJ/feature/preprocessing-columns

0592880

Feature/preprocessing columns

Merge pull request #6 from VarenyaJ/feature/expand-phenopackets

a084a9a

Feature/expand phenopackets - Builds on top of `feature/preprocessing-columns`

VarenyaJ marked this pull request as ready for review August 15, 2025 13:00

VarenyaJ merged commit de056c2 into main Aug 15, 2025
2 checks passed

VarenyaJ mentioned this pull request Sep 22, 2025

Consolidation / Release #28

Closed

7 tasks

VarenyaJ linked an issue Sep 22, 2025 that may be closed by this pull request

Consolidation / Release #28

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop#16

Develop#16
VarenyaJ merged 48 commits intomainfrom
develop

VarenyaJ commented Aug 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

VarenyaJ commented Aug 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant