Schema enrichment pipeline: from raw data to validated, semantically-rich schemas

## Vision

Schema Automator currently infers **structural** schemas from data (ranges, enums, optionality). The next step is **enrichment** — producing schemas that encode real-world semantics: units, coded value meanings, constraints, and ontology mappings.

Every constraint embedded in the schema is work you *don't* have to encode in downstream transformation logic. If we detect and embed units in the schema, we don't need to transform metadata into data by specifying it in transformation specs. Richer schemas → better validation → simpler transforms.

**Note:** Validation itself is handled by the linkml toolchain, not schema-automator. Our job is to produce the richest, most accurate schema possible so that downstream validation and transformation tools have good material to work with.

## The pipeline (LLM-independent)

All core functionality works without LLMs:

1. **Infer** structure from data (current: ranges, enums, optionality, mixed types; planned: units, patterns, better booleans/identifiers)
2. **Ingest** a structured data dictionary alongside data files (#191, #192) — handling for non-canonical / partial / legacy DDs is tracked in #200
3. **Enrich** the inferred schema with the dictionary's declared metadata
4. **Reconcile** — surface discrepancies between what the data shows and what the dictionary declares (#193)

## Two-layer data dictionary strategy

The data dictionary path has two layers:

- **Canonical target format (#191)** — the prescriptive, opinionated format we ask new studies to produce. Ingested directly by #192.
- **Non-canonical input handling (#200)** — vocabulary normalization, partial-declaration handling, and conflict policy for legacy / messy / partial DDs that don't (or can't) conform to the canonical format. Outputs into the canonical format from #191.

LLM-assisted transformation of messy DDs into the canonical format is one option that lives in the #200 layer, alongside hand-written per-source normalizers. Schema-automator's core ingestion path remains LLM-independent.

## Sub-issues

### Data dictionary path
- #191 — Define canonical data dictionary input format (target spec for new studies)
- #200 — Handle non-canonical and legacy data dictionaries (normalization + partial/conflict handling)
- #192 — Ingest structured data dictionary to enrich inferred schemas
- #193 — Reconciliation report: inferred schema vs. declared data dictionary
- #176 — Evaluate incorporating SchemaSheets

### Improved inference
- #93 — Richer boolean detection (0/1, Yes/No, True/False)
- #109 — Fix false-positive identifier heuristic
- #194 — Infer units from data values using quantulum3
- #195 — Infer pattern constraints from consistent value formats

## Related work

- Existing \`--annotator\` flag and \`SchemaAnnotator\` (OAK-based ontology annotation)
- Existing \`llm_annotator.py\` (basic LLM description generation)
- Existing \`--data-dictionary-row-count\` feature (minimal data dictionary support)
- Recent inference flags: \`--infer-optional\`, \`--infer-mixed-types\`, \`--infer-enum-from-integers\`
- Cross-reference: linkml/dm-bip#103, linkml/dm-bip#306, linkml/dm-bip#307

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema enrichment pipeline: from raw data to validated, semantically-rich schemas #190

Vision

The pipeline (LLM-independent)

Two-layer data dictionary strategy

Sub-issues

Data dictionary path

Improved inference

Related work

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Schema enrichment pipeline: from raw data to validated, semantically-rich schemas #190

Description

Vision

The pipeline (LLM-independent)

Two-layer data dictionary strategy

Sub-issues

Data dictionary path

Improved inference

Related work

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions