Skip to content

extract-dd: project a LinkML schema into the canonical data dictionary format #209

@amc-corey-cox

Description

@amc-corey-cox

Scope

Build a utility that projects an arbitrary LinkML schema into the canonical data dictionary format defined in #191. Each class becomes a `DataDictionary`; each slot becomes a `DataDictionaryEntry`; ranges → DD types; enums → `PermissibleValueDefinition` lists; per-slot `pattern` / `minimum_value` / `maximum_value` / `unit` carry through where they map.

Bridges the importer family (XSD, JSON Schema, OWL, RDFS, SQL DDL, EML — anything producing a LinkML schema) into the DD-shaped consumers (schema enrichment in #190, ingestion in #192, reconciliation in #193). After this lands, anyone with a LinkML schema can get a canonical DD out of it without writing format-specific glue.

Motivation

The repo currently has two parallel patterns for converting foreign formats:

  1. Importers (`importers/` directory, e.g. `XsdImportEngine`): foreign format → LinkML schema. Used when the input describes dataset structure (XSD, JSON Schema, OWL, RDFS, SQL DDL).
  2. Adapters (`adapters/` directory, Adapter ecosystem for translating between existing DD formats and the canonical DD format #202 umbrella): foreign format → canonical DD. Used when the input is a data dictionary (Frictionless Frictionless Table Schema adapter for canonical DD format #203, dbGaP dbGaP variable digest adapter for canonical DD format #206, REDCap REDCap data dictionary adapter for canonical DD format #204).

The split is appropriate — schema-shaped inputs and DD-shaped inputs are different — but it leaves a gap: DD consumers can't reach importer outputs. A user with an EML document or an XSD-imported LinkML schema gets a LinkML schema; the DD enrichment workflow needs a DD; no bridge.

This utility is that bridge.

CLI

```
schemauto extract-dd <schema.yaml> [--class ] [-o ] [--tsv|--yaml]
```

  • If `--class` is given, project just that class. Otherwise, project all top-level classes; emit one DD per class in batch mode (use `..dd.{yaml,tsv}` filenames when `-o` is a directory).
  • `--tsv` / `--yaml` mirror the other adapter CLI conventions.
  • Standard parent-dir creation for `-o` (matches `adapt-frictionless` and `adapt-dbgap`).

Python API

```python
from schema_automator.utils.extract_dd import schema_to_dd

dd = schema_to_dd(schemaview, class_name="MyClass")
```

Mapping rules (sketch)

  • Class slots → DD entries. Use `SchemaView.class_induced_slots` to walk inherited + local slots.
  • Slot ranges → DD type vocabulary. LinkML's built-in types (`string`, `integer`, `decimal`, `boolean`, `date`, `datetime`, `time`, `uri`, `curie`) map 1-1 to the DD's canonical vocabulary. Enum ranges → `permissible_values` with the enum's `permissible_values` dict projected into `PermissibleValueDefinition` records.
  • Slot URI → DD `uri`. `slot.slot_uri` carries through.
  • `description`, `pattern`, `required`, `multivalued` → matching DD slots.
  • Numeric bounds. `minimum_value` / `maximum_value` → DD `min` / `max`.
  • `unit`. LinkML's `unit` slot uses a UCUM-flavored object; the DD uses a freeform string. Best-effort: take `unit.symbol` if present, fall back to `unit.ucum_code` or `unit.descriptive_name`.

Things to handle thoughtfully

  • Abstract classes and mixins. Probably skip in the default projection (a class you can't instantiate isn't a usable DD), but `--include-abstract` if anyone wants them.
  • Inlined references (slot with class range). A slot whose range is another class isn't a column descriptor in the DD sense. Project as `type: string` with a `description` noting the reference target, or skip entirely (configurable).
  • Multivalued slots with class ranges. Same case, multivalued. The DD has a `multivalued` Spec B field — apply it.
  • Identifiers. Slots marked `identifier: true` are valid DD entries; no special handling needed beyond standard projection.
  • Tree-root annotations. If the schema has a single `tree_root: true` class, default `--class` to that.
  • Per-slot `examples`. LinkML examples are richer than DD's `example_values`; project the `value` field of each example into the multivalued `example_values` list.

Lossy direction

This is intentionally a one-way projection — class hierarchies, slot inheritance graphs, multi-class relationships, mixins, abstract definitions, structural constraints (rules, classification rules), domain/range cross-references — all flatten or drop. The DD format has none of those; that's the deliberate tradeoff of #191.

Why this is the right home for the bridge

Lives in `schema_automator/utils/extract_dd.py` (or similar) rather than the adapter or importer trees, because it doesn't belong to either pattern — it's a general utility over LinkML schemas. Any importer benefits; any future tool producing a LinkML schema benefits.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions