Context
PR RTIInternational/NHLBI-BDC-DMC-HV#560 migrated hv-lint from the control center into the HV repo as a self-contained package (20 Python scripts, 47 validation rules across 4 phases, committed dbGaP indexes). This landed the tooling in a working state — CI passes, local runs work — but there's cleanup to do before it moves to dm-bip or its own repo.
Parent tracking issue: #187
Related: #303 (post-harmonization QC — separate tool, but shares the QC umbrella)
Cleanup items
1. No tests
~10K lines of validation logic with no automated tests. The 47 rules are currently validated only by running against the YAML corpus. At minimum, unit tests for the core rule logic (Finding generation, severity thresholds, CURIE validation, duplicate detection, etc.) and a handful of integration tests with fixture YAML files.
2. Finding dataclass and shared logic duplicated across phases
The Finding dataclass (terminal_line(), gh_annotation(), severity ranking), SEVERITY_RANK, KNOWN_ISSUES patterns, and file discovery logic are copy-pasted into 6+ scripts. A shared _common.py module would reduce drift risk.
3. COHORTS list hardcoded in multiple files
The COHORTS list (["ARIC", "CARDIA", "CHS", ...]) appears independently in 6+ scripts despite cohorts.yaml already existing as a single source of truth. Adding a new cohort currently requires updating cohorts.yaml AND multiple Python files.
4. TRANSFORM_DIR resolved at module import time
Several scripts call find_transform_dir() at module top level, so the module fails to load (even for --help) if _paths.py can't find the HV root. Should be deferred to runtime.
5. update_data.py duplicates standalone index builders
update_data.py (745 lines) duplicates substantial logic from build_phv_index.py and build_phv_detail_index.py. Either the standalone builders should be removed or update_data.py should delegate to them.
6. CI workflow: unconditional fetch-depth: 0
Full clone history is fetched for all trigger types, but only needed for pull_request cohort detection. A conditional or shallower depth would be more efficient.
Migration considerations
- Eventually this tooling moves to dm-bip or its own repo
- The dual-mode path resolution (
_paths.py) already supports both HV repo and control center layouts
- Cleanup should happen as part of the migration, not necessarily before
Context
PR RTIInternational/NHLBI-BDC-DMC-HV#560 migrated hv-lint from the control center into the HV repo as a self-contained package (20 Python scripts, 47 validation rules across 4 phases, committed dbGaP indexes). This landed the tooling in a working state — CI passes, local runs work — but there's cleanup to do before it moves to dm-bip or its own repo.
Parent tracking issue: #187
Related: #303 (post-harmonization QC — separate tool, but shares the QC umbrella)
Cleanup items
1. No tests
~10K lines of validation logic with no automated tests. The 47 rules are currently validated only by running against the YAML corpus. At minimum, unit tests for the core rule logic (Finding generation, severity thresholds, CURIE validation, duplicate detection, etc.) and a handful of integration tests with fixture YAML files.
2.
Findingdataclass and shared logic duplicated across phasesThe
Findingdataclass (terminal_line(),gh_annotation(), severity ranking),SEVERITY_RANK,KNOWN_ISSUESpatterns, and file discovery logic are copy-pasted into 6+ scripts. A shared_common.pymodule would reduce drift risk.3.
COHORTSlist hardcoded in multiple filesThe COHORTS list (
["ARIC", "CARDIA", "CHS", ...]) appears independently in 6+ scripts despitecohorts.yamlalready existing as a single source of truth. Adding a new cohort currently requires updatingcohorts.yamlAND multiple Python files.4.
TRANSFORM_DIRresolved at module import timeSeveral scripts call
find_transform_dir()at module top level, so the module fails to load (even for--help) if_paths.pycan't find the HV root. Should be deferred to runtime.5.
update_data.pyduplicates standalone index buildersupdate_data.py(745 lines) duplicates substantial logic frombuild_phv_index.pyandbuild_phv_detail_index.py. Either the standalone builders should be removed orupdate_data.pyshould delegate to them.6. CI workflow: unconditional
fetch-depth: 0Full clone history is fetched for all trigger types, but only needed for
pull_requestcohort detection. A conditional or shallower depth would be more efficient.Migration considerations
_paths.py) already supports both HV repo and control center layouts