Skip to content

hv-lint: cleanup and migration from HV repo #312

@amc-corey-cox

Description

@amc-corey-cox

Context

PR RTIInternational/NHLBI-BDC-DMC-HV#560 migrated hv-lint from the control center into the HV repo as a self-contained package (20 Python scripts, 47 validation rules across 4 phases, committed dbGaP indexes). This landed the tooling in a working state — CI passes, local runs work — but there's cleanup to do before it moves to dm-bip or its own repo.

Parent tracking issue: #187
Related: #303 (post-harmonization QC — separate tool, but shares the QC umbrella)

Cleanup items

1. No tests

~10K lines of validation logic with no automated tests. The 47 rules are currently validated only by running against the YAML corpus. At minimum, unit tests for the core rule logic (Finding generation, severity thresholds, CURIE validation, duplicate detection, etc.) and a handful of integration tests with fixture YAML files.

2. Finding dataclass and shared logic duplicated across phases

The Finding dataclass (terminal_line(), gh_annotation(), severity ranking), SEVERITY_RANK, KNOWN_ISSUES patterns, and file discovery logic are copy-pasted into 6+ scripts. A shared _common.py module would reduce drift risk.

3. COHORTS list hardcoded in multiple files

The COHORTS list (["ARIC", "CARDIA", "CHS", ...]) appears independently in 6+ scripts despite cohorts.yaml already existing as a single source of truth. Adding a new cohort currently requires updating cohorts.yaml AND multiple Python files.

4. TRANSFORM_DIR resolved at module import time

Several scripts call find_transform_dir() at module top level, so the module fails to load (even for --help) if _paths.py can't find the HV root. Should be deferred to runtime.

5. update_data.py duplicates standalone index builders

update_data.py (745 lines) duplicates substantial logic from build_phv_index.py and build_phv_detail_index.py. Either the standalone builders should be removed or update_data.py should delegate to them.

6. CI workflow: unconditional fetch-depth: 0

Full clone history is fetched for all trigger types, but only needed for pull_request cohort detection. A conditional or shallower depth would be more efficient.

Migration considerations

  • Eventually this tooling moves to dm-bip or its own repo
  • The dual-mode path resolution (_paths.py) already supports both HV repo and control center layouts
  • Cleanup should happen as part of the migration, not necessarily before

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions