Skip to content

Handle non-canonical and legacy data dictionaries #200

@amc-corey-cox

Description

@amc-corey-cox

Problem

Real-world data dictionaries from existing studies do not conform to the canonical format being designed in #191, and many never will. Even after substantial cleanup, the upstream landscape is messy:

A survey of the dbGaP variable digests for the 11 BDC harmonization-cohort studies (ARIC, CARDIA, CHS, COPDGene, FHS, HCHS, JHS, LTRC, MESA, SPIROMICS, WHI — ~190K variables total) shows the shape of the mess that schema-automator must be able to consume even when the canonical format from #191 is fully landed:

  • Type vocabulary: 36 distinct `type` strings across studies — synonyms (`numeric` / `integer` / `decimal` / `num`), encoded variants (`encoded`, `encoded value`, `encoded values`, `enumerated integer`), composite types (`decimal, encoded`, `string, encoded value`), and typos (`sting`, `strin`, `e`, `1`).
  • Empty types: 48% of variables have no declared type. FHS alone is 95% empty.
  • Partial coverage: codes present on 46% of vars; units on 22%; collection timing on 21%. Most fields are partially populated.
  • Per-study idiosyncrasies: each study has its own type vocabulary preferences and field conventions. There is no single upstream DD; there are N idiosyncratic ones.

These existing studies (and any future study that can't or won't produce a canonical-format DD) still need to flow through the enrichment pipeline.

Scope

This issue covers schema-automator's behavior when the DD it receives is not the canonical format from #191. That breaks down into three problem dimensions:

  1. Vocabulary normalization — mapping freeform type strings (and other declared values) onto the canonical vocabulary. The dbGaP corpus shows 36 type strings collapsing to ~5–6 canonical types, plus composite handling.
  2. Partial declarations — handling DDs that declare some fields but not others (e.g., codes present but type empty; description present but no type or codes). The merged schema needs a defensible representation of "we have this much info, no more."
  3. Conflicts between declared and inferred — when DD says X and data says Y, what does the output schema express? This depends on a merge policy (which side wins by default), an override mechanism (user can pin a side), and a representation that surfaces the disagreement (Reconciliation report: inferred schema vs. declared data dictionary #193 reports it; this issue determines how it lives in the schema).

Position relative to other issues

Out of scope

Sub-issue of #190.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions