Skip to content

Schema enrichment pipeline: from raw data to validated, semantically-rich schemas #190

@amc-corey-cox

Description

@amc-corey-cox

Vision

Schema Automator currently infers structural schemas from data (ranges, enums, optionality). The next step is enrichment — producing schemas that encode real-world semantics: units, coded value meanings, constraints, and ontology mappings.

Every constraint embedded in the schema is work you don't have to encode in downstream transformation logic. If we detect and embed units in the schema, we don't need to transform metadata into data by specifying it in transformation specs. Richer schemas → better validation → simpler transforms.

Note: Validation itself is handled by the linkml toolchain, not schema-automator. Our job is to produce the richest, most accurate schema possible so that downstream validation and transformation tools have good material to work with.

The pipeline (LLM-independent)

All core functionality works without LLMs:

  1. Infer structure from data (current: ranges, enums, optionality, mixed types; planned: units, patterns, better booleans/identifiers)
  2. Ingest a structured data dictionary alongside data files (Define structured data dictionary input format #191, Ingest structured data dictionary to enrich inferred schemas #192) — handling for non-canonical / partial / legacy DDs is tracked in Handle non-canonical and legacy data dictionaries #200
  3. Enrich the inferred schema with the dictionary's declared metadata
  4. Reconcile — surface discrepancies between what the data shows and what the dictionary declares (Reconciliation report: inferred schema vs. declared data dictionary #193)

Two-layer data dictionary strategy

The data dictionary path has two layers:

LLM-assisted transformation of messy DDs into the canonical format is one option that lives in the #200 layer, alongside hand-written per-source normalizers. Schema-automator's core ingestion path remains LLM-independent.

Sub-issues

Data dictionary path

Improved inference

Related work

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions