Skip to content

EML (Ecological Metadata Language) importer #208

@amc-corey-cox

Description

@amc-corey-cox

Scope

Import Ecological Metadata Language (EML) XML documents into LinkML schemas. Lives in the importer family alongside `XsdImportEngine` (#153), `JsonSchemaImportEngine`, `OwlImportEngine`, etc. Output is a full LinkML schema describing the dataset(s) the EML document describes; the canonical-DD-shaped view is reachable via the `extract-dd` projection utility (#209).

Supersedes linkml/linkml#2168 (the original tracking issue, upstream rather than in this repo) and the never-finished draft PR #138.

Why importer rather than adapter

EML is not a data dictionary format — it's a metadata document describing a dataset. The DD layer is one facet alongside:

  • Bibliographic and provenance information
  • Geographic / temporal / taxonomic coverage
  • Sampling methods and protocols
  • Multi-table relationships
  • Pointers to data files
  • Unit definitions

Treating EML as a DD adapter (#202 family) would throw away the schema-shaped richness EML actually provides. Treating it as a schema importer preserves the structural information; the DD-shaped use case is served downstream by #209's projection utility.

The split that drops out:

Deliverables

LinkML representation of EML format

Use `XsdImportEngine` on EML's published XSD (https://eml.ecoinformatics.org/eml-schema) to bootstrap a LinkML representation of EML itself. Then hand-trim to the data-relevant subset — `DataTable`, `Attribute`, `MeasurementScale`, `EnumeratedDomain`, `StandardUnit`/`CustomUnit`, etc. — dropping the broader metadata wrapper (bibliographic, coverage, methods) unless we want to model them too.

This auto-generation step saves manual transcription and ensures our representation matches the upstream EML spec rather than just what's in our sample documents.

Import engine

`schema_automator/importers/eml_import_engine.py` — `EMLImportEngine` subclassing `ImportEngine`. Walks an EML document, producing a LinkML schema where:

  • Each `` → a class.
  • Each `` → a slot on that class.
  • `` and storage type → slot range (one of LinkML's built-in types, or an enum class for ``).
  • `` → `PermissibleValue` entries on the enum.
  • `` / `` → slot `unit`.
  • Description, units, examples carry through.
  • Multi-table linkage (when expressed in EML) preserved via slot ranges or inlined references.

CLI

```
schemauto import-eml <document.eml> [-o ] [-n ] [-I ]
```

Matches the existing importer CLI shape (`import-xsd`, `import-jsonschema`, etc.).

Tests

Real-world fixtures from the EML samples referenced in linkml/linkml#2168:

Reaching the DD-shaped consumer

The dm-bip-style DD-enrichment use case is served by chaining:

```
schemauto import-eml dataset.eml -o dataset-schema.yaml
schemauto extract-dd dataset-schema.yaml -o dataset-dd.yaml
```

No EML-specific DD adapter needed.

Things to handle thoughtfully

  • Multi-table EML documents. An EML document can describe several `` blocks. Emit one class per data table (similar to dbGaP's one-DD-per-pht).
  • Measurement scale richness. EML's `nominal` / `ordinal` / `interval` / `ratio` distinction is finer than LinkML's built-in types. Codes-bearing nominal/ordinal → enum class. Ordinal-without-codes is interesting; defer.
  • Unit handling. EML's `` references the EML unit dictionary; `` defines new units. LinkML's slot `unit` is richer than the canonical DD's freeform string but still simpler than EML's machinery. Best-effort: extract the unit name/symbol; preserve the URI when available.
  • Metadata-only documents. An EML document without data tables should produce a schema covering whatever non-data classes are appropriate, or just emit a near-empty schema with a clear warning.
  • Domain-specific metadata. Geographic coverage, taxonomic coverage, sampling methods — these have structural shape but don't fit the data-table pattern. Skip in v1; could be future enrichment.

What's not salvageable from PR #138

The 2024 draft has a stub `EMLImportEngine` whose `convert()` returns an empty schema, references a nonexistent `schema_automator.metamodels.eml` module, and tries to load XML via `json_loader`. The branch is 126 commits behind main and the file location was right (`importers/`) but the implementation has nothing to keep.

Related: linkml/linkml#2168 (upstream tracking issue), #87 / #153 (XSD importer that this builds on), #209 (extract-dd projection utility, the bridge from importer output to DD-shaped consumers).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions