Skip to content

Post-harmonization QC reporting tool #303

@amc-corey-cox

Description

@amc-corey-cox

Build automated tooling to produce QC reports comparing harmonized output against reference data, based on the grading methodology developed in #196.

Inputs

  • Harmonized output TSVs (from pipeline)
  • Reference data summary statistics (e.g., TOPMed DCC aggregates or the raw data)
  • Variable mapping configuration (transform-specs which harmonized variables to compare)

Outputs

  • Machine-readable scores (JSON or YAML) — per-variable grades, sample counts, summary statistics, overall cohort grade summary
  • Visual HTML report — human-readable summary with per-variable detail, distribution comparisons, and methodological notes

Scoring

Uses the grading rubric defined in #196 (A+ through D for both continuous and categorical variables).

Approach

  • Start with one cohort (COPDGene — structurally simplest, established baseline)
  • Reports should work as pipeline artifacts (generated alongside harmonized data)
  • Summary statistics only — no participant-level data leaves the enclave

Context

Chris Siege developed and validated this comparison methodology manually across 9 cohorts. This issue is about automating it as reusable tooling. See #196 for the scoring rubric and framework details.

Metadata

Metadata

Assignees

Labels

Quality ControlProcedures and workflows for quality control

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions