Add canonical data dictionary format spec (closes #191) by amc-corey-cox · Pull Request #201 · linkml/schema-automator

amc-corey-cox · 2026-05-05T15:57:32Z

Summary

Defines schema-automator's canonical, opinionated data dictionary format — the forward-looking target spec we ask new studies to produce when onboarding to the pipeline. Closes #191.

This is a spec-only PR: the deliverable is documentation + a LinkML schema + worked examples. Implementation of ingestion (#192), reconciliation (#193), and handling of non-canonical/legacy DDs (#200) is tracked separately and lands in subsequent PRs.

What's in the spec

Two-spec structure. Spec A is the recommended field set for every entry: `name`, `type`, `description`, plus type-conditional `codes` (when permissible_values), `unit`, `min`, and `max` (when numeric). Spec B is a set of optional columns researchers may adopt independently à la carte: `label`, `multivalued`, `required`, `pattern`, `uri`, `see_also`.

Type vocabulary fixed at 10 researcher-comprehensible values (`string`, `integer`, `decimal`, `boolean`, `date`, `datetime`, `time`, `uri`, `curie`, `permissible_values`). Curated subset of LinkML's built-in types — technical primitives (`ncname`, `jsonpath`, etc.) are excluded.

Codes encoded REDCap-style as `code, label | code, label | ...` with bareword shorthand for value-equals-meaning cases. Comma separator (not `=`) avoids collisions with code values containing operators like `>=` or `<=`.

`none` token is the explicit "not applicable" value for unit/min/max — distinct from an empty cell, which means "the author has not declared this field" and is reported as a conformance issue.

Negative requirements on `description`: must not contain code lists, units, ranges, or example values — those have dedicated fields. Stated as a hard rule; enforcement is a separate content-quality lint.

No document-level metadata in v1. Every productive option (header that breaks CSV, sidecar coupling, filename encoding) has worse trade-offs than dropping it; the spec is intentionally substrate-only.

Files

`schema_automator/metamodels/data_dictionary.yaml` — the normative LinkML schema (10-value type vocabulary, conditional-required rules, slot definitions for Spec A and Spec B).
`docs/data_dictionary_format.rst` — prose spec covering audience, conformance modes, codes encoding, type vocabulary, and deferred-to-future revisions.
`docs/examples/dd_example_minimal.tsv` — Spec A example covering each type case.
`docs/examples/dd_example_with_optional.tsv` — Spec A plus selected Spec B columns.
`docs/examples/dd_example_minimal.yaml` — YAML equivalent.
`docs/index.rst` — toctree entry for the new spec doc.

Conformance modes

Default: missing best-practice/conditional fields are reported as warnings; processing continues.
Strict: any missing best-practice/conditional field, invalid type vocabulary value, or malformed codes encoding fails. Both modes use the same LinkML schema.

Validation

The LinkML schema validates the example YAML cleanly via `linkml-validate`. Type vocabulary violations are caught (verified with a synthetic invalid-type example).

Known LinkML quirks (worth filing upstream)

The `min` and `max` slots use slot-level `any_of` (decimal OR `equals_string: none`) without a top-level `range:`. This is because:

`gen-json-schema` emits malformed JSON Schema when a slot has both `range:` and `any_of:` — top-level `type: string` and `anyOf` conflict, with the type winning. Validation rejects bare numeric values in YAML.
`linkml-lint`'s `no_undeclared_ranges` rule doesn't recognize `any_of` as a range declaration, so it false-positives on these slots.

Both reproduce on linkml/linkml-runtime 1.10.0 (latest). A schema comment explains the workaround. Upstream issues to be filed after merge.

Test plan

Review spec prose for clarity and accuracy.
Confirm the LinkML schema captures the conformance contract correctly.
Sanity-check examples cover the type-case matrix (string, integer/decimal with units, decimal unbounded, boolean, date, permissible_values with full codes, permissible_values with bareword, uri, curie).
Verify TSV examples render correctly in your TSV viewer of choice.
Decide on filing the two upstream linkml issues (separate task).

Defines a forward-looking, prescriptive data dictionary format for new studies onboarding to schema-automator's enrichment pipeline. Includes: - LinkML schema at schema_automator/metamodels/data_dictionary.yaml (the normative machine-readable definition) - Prose spec doc at docs/data_dictionary_format.rst with Spec A (recommended fields) and Spec B (optional, independently-adoptable columns) - Worked examples in docs/examples/ covering each type case in both TSV and YAML form Type vocabulary is fixed at 10 researcher-comprehensible values (string, integer, decimal, boolean, date, datetime, time, uri, curie, permissible_values). Codes encoded as REDCap-style "code, label | ..." with bareword shorthand. Document-level metadata explicitly out of scope for v1.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Adds a `parse-digests` CLI command that reads cached data_dict.xml files for a cohort and writes one TSV per data table in the schema-automator canonical data dictionary format (linkml/schema-automator#201). Outputs land at `output/<cohort>/dd/<phs>.<pht>.dd.tsv` with all Spec A columns plus `uri` from Spec B. dbGaP types are translated to the canonical 10-value vocabulary; encoded values are rendered REDCap-style (`code, label | code, label`); each variable's `uri` carries the dbGaP phv accession as a CURIE for traceability. `unit`, `min`, and `max` are emitted empty pending richer var_report parsing. Refs #204

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

) Position the spec relative to Frictionless Table Schema, REDCap, and SchemaSheets: small extension to the universal common ground (name, description, partial type info), implementing the core of those formats, with adapters planned. Explicitly state SchemaSheets supersession. Reorganize 'Future revisions' to cover only additive in-format extensions; move use cases that need a different artifact entirely (hierarchical data, multi-table relationships, variable versioning, cross-column constraints, domain extensions) to a new 'Out of scope' section.

Electronic Health Record acronym used as a permissible-values code example in the data dictionary spec. Codespell flags 'EHR' as a typo for 'HER'.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Add example_values to Spec B (multivalued, pipe-separated in TSV). Closes a self-contradiction where description rules forbade examples but pointed to a Spec B field that didn't exist. - Add backslash-escaping for codes ('\,', '\|', '\'). Codes can now contain commas, pipes, and backslashes. Replaces the v1 'no commas at all' limitation. Tighten the codes regex to enforce the escape grammar and reject malformed strings (multi-comma in code position, empty tokens, A | | B, etc.). - Add LinkML rules forbidding 'codes' on non-permissible_values rows and forbidding 'unit'/'min'/'max' on non-numeric rows. linkml-validate treats multi-slot postconditions as AND, so the numeric rules are split into three separate rules (one per slot) to fire on any single violation. - Document multivalued TSV serialization (pipe-separated within a cell) for see_also and example_values. - Add 'time' and 'datetime' rows to the example files so all 10 type values are demonstrated. - Demonstrate codes-with-escaping in a race_ethnicity row across all three example files. - Document fractional-bounds-on-integer as a content-quality lint case (schema cannot cleanly enforce; lint pass catches).

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

- Remove stale 'Escaping for | and , in code values' bullet from Future revisions; escaping is shipped in v1. - Rephrase the codes-escaping intro to avoid the awkward ',', '|', or '\' literal markup that confuses Sphinx; spell it out as prose, with the bullets below showing the escapes. - Add 'Multivalued TSV cells' subsection specifying that whitespace around the | separator is trimmed and pipes within values must be escaped as \|. Same convention as codes encoding. Spec B entries for see_also and example_values now reference this rule. - Rewrite both example TSVs with uniform column counts (minimal: 7 fields per row, optional: 11). Earlier hand-edited rows had inconsistent tab counts that put values in wrong columns and could mis-parse in strict TSV readers.

amc-corey-cox requested a review from Copilot May 5, 2026 16:52

Copilot AI reviewed May 5, 2026

amc-corey-cox requested a review from Copilot May 5, 2026 17:45

Copilot started reviewing on behalf of amc-corey-cox May 5, 2026 17:45 View session

amc-corey-cox mentioned this pull request May 5, 2026

Add dbGaP digest fetcher and adapt-digests pipeline stage linkml/dm-bip#320

Merged

5 tasks

Copilot AI reviewed May 5, 2026

amc-corey-cox requested a review from Copilot May 5, 2026 19:54

Copilot started reviewing on behalf of amc-corey-cox May 5, 2026 19:55 View session

Add 'ehr' to codespell ignore-words-list

ec9db3b

Electronic Health Record acronym used as a permissible-values code example in the data dictionary spec. Codespell flags 'EHR' as a typo for 'HER'.

Copilot AI reviewed May 5, 2026

View reviewed changes

amc-corey-cox added 2 commits May 5, 2026 15:13

Fix codespell flag: 'requireds' -> 'required-fields'

d49b087

amc-corey-cox requested a review from Copilot May 6, 2026 13:37

Copilot started reviewing on behalf of amc-corey-cox May 6, 2026 13:37 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

Comment thread docs/data_dictionary_format.rst Outdated

Comment thread docs/data_dictionary_format.rst Outdated

Comment thread docs/data_dictionary_format.rst Outdated

Comment thread docs/examples/dd_example_with_optional.tsv Outdated

Comment thread docs/examples/dd_example_minimal.tsv Outdated

amc-corey-cox requested a review from Copilot May 6, 2026 16:25

Copilot started reviewing on behalf of amc-corey-cox May 6, 2026 16:26 View session

amc-corey-cox merged commit 7fce840 into main May 6, 2026
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add canonical data dictionary format spec (closes #191)#201

Add canonical data dictionary format spec (closes #191)#201
amc-corey-cox merged 6 commits into
mainfrom
dd-format-spec

amc-corey-cox commented May 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

amc-corey-cox commented May 5, 2026

Summary

What's in the spec

Files

Conformance modes

Validation

Known LinkML quirks (worth filing upstream)

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants