You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Schema Automator currently infers structural schemas from data (ranges, enums, optionality). The next step is enrichment — producing schemas that encode real-world semantics: units, coded value meanings, constraints, and ontology mappings.
Every constraint embedded in the schema is work you don't have to encode in downstream transformation logic. If we detect and embed units in the schema, we don't need to transform metadata into data by specifying it in transformation specs. Richer schemas → better validation → simpler transforms.
Note: Validation itself is handled by the linkml toolchain, not schema-automator. Our job is to produce the richest, most accurate schema possible so that downstream validation and transformation tools have good material to work with.
The pipeline (LLM-independent)
All core functionality works without LLMs:
Infer structure from data (current: ranges, enums, optionality, mixed types; planned: units, patterns, better booleans/identifiers)
LLM-assisted transformation of messy DDs into the canonical format is one option that lives in the #200 layer, alongside hand-written per-source normalizers. Schema-automator's core ingestion path remains LLM-independent.
Vision
Schema Automator currently infers structural schemas from data (ranges, enums, optionality). The next step is enrichment — producing schemas that encode real-world semantics: units, coded value meanings, constraints, and ontology mappings.
Every constraint embedded in the schema is work you don't have to encode in downstream transformation logic. If we detect and embed units in the schema, we don't need to transform metadata into data by specifying it in transformation specs. Richer schemas → better validation → simpler transforms.
Note: Validation itself is handled by the linkml toolchain, not schema-automator. Our job is to produce the richest, most accurate schema possible so that downstream validation and transformation tools have good material to work with.
The pipeline (LLM-independent)
All core functionality works without LLMs:
Two-layer data dictionary strategy
The data dictionary path has two layers:
LLM-assisted transformation of messy DDs into the canonical format is one option that lives in the #200 layer, alongside hand-written per-source normalizers. Schema-automator's core ingestion path remains LLM-independent.
Sub-issues
Data dictionary path
Improved inference
Related work