Handle non-canonical and legacy data dictionaries

## Problem

Real-world data dictionaries from existing studies do not conform to the canonical format being designed in #191, and many never will. Even after substantial cleanup, the upstream landscape is messy:

A survey of the dbGaP variable digests for the 11 BDC harmonization-cohort studies (ARIC, CARDIA, CHS, COPDGene, FHS, HCHS, JHS, LTRC, MESA, SPIROMICS, WHI — ~190K variables total) shows the shape of the mess that schema-automator must be able to consume even when the canonical format from #191 is fully landed:

- **Type vocabulary:** 36 distinct \`type\` strings across studies — synonyms (\`numeric\` / \`integer\` / \`decimal\` / \`num\`), encoded variants (\`encoded\`, \`encoded value\`, \`encoded values\`, \`enumerated integer\`), composite types (\`decimal, encoded\`, \`string, encoded value\`), and typos (\`sting\`, \`strin\`, \`e\`, \`1\`).
- **Empty types:** 48% of variables have no declared type. FHS alone is 95% empty.
- **Partial coverage:** codes present on 46% of vars; units on 22%; collection timing on 21%. Most fields are partially populated.
- **Per-study idiosyncrasies:** each study has its own type vocabulary preferences and field conventions. There is no single upstream DD; there are N idiosyncratic ones.

These existing studies (and any future study that can't or won't produce a canonical-format DD) still need to flow through the enrichment pipeline.

## Scope

This issue covers schema-automator's behavior when the DD it receives is *not* the canonical format from #191. That breaks down into three problem dimensions:

1. **Vocabulary normalization** — mapping freeform type strings (and other declared values) onto the canonical vocabulary. The dbGaP corpus shows 36 type strings collapsing to ~5–6 canonical types, plus composite handling.
2. **Partial declarations** — handling DDs that declare some fields but not others (e.g., codes present but type empty; description present but no type or codes). The merged schema needs a defensible representation of \"we have *this* much info, no more.\"
3. **Conflicts between declared and inferred** — when DD says X and data says Y, what does the output schema express? This depends on a merge policy (which side wins by default), an override mechanism (user can pin a side), and a representation that surfaces the disagreement (#193 reports it; this issue determines how it lives in the schema).

## Position relative to other issues

- **Outputs into the canonical format from #191** — vocabulary normalization is \"messy → canonical.\" Once the canonical format is defined, this issue's normalization layer produces it.
- **Extends #192** (Ingest structured DD) — the happy path in #192 assumes canonical input; this issue makes ingestion robust to partial/non-canonical input.
- **Feeds #193** (Reconciliation report) — disagreements detected during merge are what the reconciliation report surfaces.
- **Pluggable normalizer interface** likely warranted — per-source quirks (dbGaP XML, study-specific spreadsheets) keep arriving, and we shouldn't bake every source's vocabulary into core. LLM-assisted \"messy DD → canonical\" is one normalizer option; hand-written per-source scripts are another.

## Out of scope

- Defining the canonical format (#191).
- Designing the reconciliation report's output (#193).
- Building specific normalizers for specific upstream sources (those live in their own packages or downstream projects like dm-bip).

Sub-issue of #190.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle non-canonical and legacy data dictionaries #200

Problem

Scope

Position relative to other issues

Out of scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Handle non-canonical and legacy data dictionaries #200

Description

Problem

Scope

Position relative to other issues

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions