You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Real-world data dictionaries from existing studies do not conform to the canonical format being designed in #191, and many never will. Even after substantial cleanup, the upstream landscape is messy:
A survey of the dbGaP variable digests for the 11 BDC harmonization-cohort studies (ARIC, CARDIA, CHS, COPDGene, FHS, HCHS, JHS, LTRC, MESA, SPIROMICS, WHI — ~190K variables total) shows the shape of the mess that schema-automator must be able to consume even when the canonical format from #191 is fully landed:
Empty types: 48% of variables have no declared type. FHS alone is 95% empty.
Partial coverage: codes present on 46% of vars; units on 22%; collection timing on 21%. Most fields are partially populated.
Per-study idiosyncrasies: each study has its own type vocabulary preferences and field conventions. There is no single upstream DD; there are N idiosyncratic ones.
These existing studies (and any future study that can't or won't produce a canonical-format DD) still need to flow through the enrichment pipeline.
Scope
This issue covers schema-automator's behavior when the DD it receives is not the canonical format from #191. That breaks down into three problem dimensions:
Vocabulary normalization — mapping freeform type strings (and other declared values) onto the canonical vocabulary. The dbGaP corpus shows 36 type strings collapsing to ~5–6 canonical types, plus composite handling.
Partial declarations — handling DDs that declare some fields but not others (e.g., codes present but type empty; description present but no type or codes). The merged schema needs a defensible representation of "we have this much info, no more."
Conflicts between declared and inferred — when DD says X and data says Y, what does the output schema express? This depends on a merge policy (which side wins by default), an override mechanism (user can pin a side), and a representation that surfaces the disagreement (Reconciliation report: inferred schema vs. declared data dictionary #193 reports it; this issue determines how it lives in the schema).
Position relative to other issues
Outputs into the canonical format from Define structured data dictionary input format #191 — vocabulary normalization is "messy → canonical." Once the canonical format is defined, this issue's normalization layer produces it.
Problem
Real-world data dictionaries from existing studies do not conform to the canonical format being designed in #191, and many never will. Even after substantial cleanup, the upstream landscape is messy:
A survey of the dbGaP variable digests for the 11 BDC harmonization-cohort studies (ARIC, CARDIA, CHS, COPDGene, FHS, HCHS, JHS, LTRC, MESA, SPIROMICS, WHI — ~190K variables total) shows the shape of the mess that schema-automator must be able to consume even when the canonical format from #191 is fully landed:
These existing studies (and any future study that can't or won't produce a canonical-format DD) still need to flow through the enrichment pipeline.
Scope
This issue covers schema-automator's behavior when the DD it receives is not the canonical format from #191. That breaks down into three problem dimensions:
Position relative to other issues
Out of scope
Sub-issue of #190.