Build automated tooling to produce QC reports comparing harmonized output against reference data, based on the grading methodology developed in #196.
Inputs
- Harmonized output TSVs (from pipeline)
- Reference data summary statistics (e.g., TOPMed DCC aggregates or the raw data)
- Variable mapping configuration (transform-specs which harmonized variables to compare)
Outputs
- Machine-readable scores (JSON or YAML) — per-variable grades, sample counts, summary statistics, overall cohort grade summary
- Visual HTML report — human-readable summary with per-variable detail, distribution comparisons, and methodological notes
Scoring
Uses the grading rubric defined in #196 (A+ through D for both continuous and categorical variables).
Approach
- Start with one cohort (COPDGene — structurally simplest, established baseline)
- Reports should work as pipeline artifacts (generated alongside harmonized data)
- Summary statistics only — no participant-level data leaves the enclave
Context
Chris Siege developed and validated this comparison methodology manually across 9 cohorts. This issue is about automating it as reusable tooling. See #196 for the scoring rubric and framework details.
Build automated tooling to produce QC reports comparing harmonized output against reference data, based on the grading methodology developed in #196.
Inputs
Outputs
Scoring
Uses the grading rubric defined in #196 (A+ through D for both continuous and categorical variables).
Approach
Context
Chris Siege developed and validated this comparison methodology manually across 9 cohorts. This issue is about automating it as reusable tooling. See #196 for the scoring rubric and framework details.