Skip to content

Add canonical data dictionary format spec (closes #191)#201

Merged
amc-corey-cox merged 6 commits into
mainfrom
dd-format-spec
May 6, 2026
Merged

Add canonical data dictionary format spec (closes #191)#201
amc-corey-cox merged 6 commits into
mainfrom
dd-format-spec

Conversation

@amc-corey-cox
Copy link
Copy Markdown
Contributor

Summary

Defines schema-automator's canonical, opinionated data dictionary format — the forward-looking target spec we ask new studies to produce when onboarding to the pipeline. Closes #191.

This is a spec-only PR: the deliverable is documentation + a LinkML schema + worked examples. Implementation of ingestion (#192), reconciliation (#193), and handling of non-canonical/legacy DDs (#200) is tracked separately and lands in subsequent PRs.

What's in the spec

Two-spec structure. Spec A is the recommended field set for every entry: `name`, `type`, `description`, plus type-conditional `codes` (when permissible_values), `unit`, `min`, and `max` (when numeric). Spec B is a set of optional columns researchers may adopt independently à la carte: `label`, `multivalued`, `required`, `pattern`, `uri`, `see_also`.

Type vocabulary fixed at 10 researcher-comprehensible values (`string`, `integer`, `decimal`, `boolean`, `date`, `datetime`, `time`, `uri`, `curie`, `permissible_values`). Curated subset of LinkML's built-in types — technical primitives (`ncname`, `jsonpath`, etc.) are excluded.

Codes encoded REDCap-style as `code, label | code, label | ...` with bareword shorthand for value-equals-meaning cases. Comma separator (not `=`) avoids collisions with code values containing operators like `>=` or `<=`.

`none` token is the explicit "not applicable" value for unit/min/max — distinct from an empty cell, which means "the author has not declared this field" and is reported as a conformance issue.

Negative requirements on `description`: must not contain code lists, units, ranges, or example values — those have dedicated fields. Stated as a hard rule; enforcement is a separate content-quality lint.

No document-level metadata in v1. Every productive option (header that breaks CSV, sidecar coupling, filename encoding) has worse trade-offs than dropping it; the spec is intentionally substrate-only.

Files

  • `schema_automator/metamodels/data_dictionary.yaml` — the normative LinkML schema (10-value type vocabulary, conditional-required rules, slot definitions for Spec A and Spec B).
  • `docs/data_dictionary_format.rst` — prose spec covering audience, conformance modes, codes encoding, type vocabulary, and deferred-to-future revisions.
  • `docs/examples/dd_example_minimal.tsv` — Spec A example covering each type case.
  • `docs/examples/dd_example_with_optional.tsv` — Spec A plus selected Spec B columns.
  • `docs/examples/dd_example_minimal.yaml` — YAML equivalent.
  • `docs/index.rst` — toctree entry for the new spec doc.

Conformance modes

  • Default: missing best-practice/conditional fields are reported as warnings; processing continues.
  • Strict: any missing best-practice/conditional field, invalid type vocabulary value, or malformed codes encoding fails. Both modes use the same LinkML schema.

Validation

The LinkML schema validates the example YAML cleanly via `linkml-validate`. Type vocabulary violations are caught (verified with a synthetic invalid-type example).

Known LinkML quirks (worth filing upstream)

The `min` and `max` slots use slot-level `any_of` (decimal OR `equals_string: none`) without a top-level `range:`. This is because:

  1. `gen-json-schema` emits malformed JSON Schema when a slot has both `range:` and `any_of:` — top-level `type: string` and `anyOf` conflict, with the type winning. Validation rejects bare numeric values in YAML.
  2. `linkml-lint`'s `no_undeclared_ranges` rule doesn't recognize `any_of` as a range declaration, so it false-positives on these slots.

Both reproduce on linkml/linkml-runtime 1.10.0 (latest). A schema comment explains the workaround. Upstream issues to be filed after merge.

Test plan

  • Review spec prose for clarity and accuracy.
  • Confirm the LinkML schema captures the conformance contract correctly.
  • Sanity-check examples cover the type-case matrix (string, integer/decimal with units, decimal unbounded, boolean, date, permissible_values with full codes, permissible_values with bareword, uri, curie).
  • Verify TSV examples render correctly in your TSV viewer of choice.
  • Decide on filing the two upstream linkml issues (separate task).

Defines a forward-looking, prescriptive data dictionary format for new
studies onboarding to schema-automator's enrichment pipeline. Includes:

- LinkML schema at schema_automator/metamodels/data_dictionary.yaml
  (the normative machine-readable definition)
- Prose spec doc at docs/data_dictionary_format.rst with Spec A
  (recommended fields) and Spec B (optional, independently-adoptable
  columns)
- Worked examples in docs/examples/ covering each type case in both
  TSV and YAML form

Type vocabulary is fixed at 10 researcher-comprehensible values
(string, integer, decimal, boolean, date, datetime, time, uri, curie,
permissible_values). Codes encoded as REDCap-style "code, label | ..."
with bareword shorthand. Document-level metadata explicitly out of
scope for v1.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@amc-corey-cox amc-corey-cox requested a review from Copilot May 5, 2026 17:45
amc-corey-cox added a commit to linkml/dm-bip that referenced this pull request May 5, 2026
Adds a `parse-digests` CLI command that reads cached data_dict.xml files
for a cohort and writes one TSV per data table in the schema-automator
canonical data dictionary format (linkml/schema-automator#201).

Outputs land at `output/<cohort>/dd/<phs>.<pht>.dd.tsv` with all Spec A
columns plus `uri` from Spec B. dbGaP types are translated to the
canonical 10-value vocabulary; encoded values are rendered REDCap-style
(`code, label | code, label`); each variable's `uri` carries the dbGaP
phv accession as a CURIE for traceability. `unit`, `min`, and `max` are
emitted empty pending richer var_report parsing.

Refs #204
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

)

Position the spec relative to Frictionless Table Schema, REDCap, and
SchemaSheets: small extension to the universal common ground (name,
description, partial type info), implementing the core of those formats,
with adapters planned. Explicitly state SchemaSheets supersession.

Reorganize 'Future revisions' to cover only additive in-format
extensions; move use cases that need a different artifact entirely
(hierarchical data, multi-table relationships, variable versioning,
cross-column constraints, domain extensions) to a new 'Out of scope'
section.
Electronic Health Record acronym used as a permissible-values code
example in the data dictionary spec. Codespell flags 'EHR' as a typo
for 'HER'.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread schema_automator/metamodels/data_dictionary.yaml Outdated
Comment thread schema_automator/metamodels/data_dictionary.yaml
Comment thread schema_automator/metamodels/data_dictionary.yaml
Comment thread schema_automator/metamodels/data_dictionary.yaml
Comment thread docs/data_dictionary_format.rst Outdated
Comment thread docs/data_dictionary_format.rst
Comment thread docs/data_dictionary_format.rst Outdated
Comment thread schema_automator/metamodels/data_dictionary.yaml Outdated
Comment thread schema_automator/metamodels/data_dictionary.yaml Outdated
Comment thread docs/data_dictionary_format.rst
- Add example_values to Spec B (multivalued, pipe-separated in TSV).
  Closes a self-contradiction where description rules forbade examples
  but pointed to a Spec B field that didn't exist.

- Add backslash-escaping for codes ('\,', '\|', '\'). Codes can now
  contain commas, pipes, and backslashes. Replaces the v1 'no commas
  at all' limitation. Tighten the codes regex to enforce the escape
  grammar and reject malformed strings (multi-comma in code position,
  empty tokens, A | | B, etc.).

- Add LinkML rules forbidding 'codes' on non-permissible_values rows
  and forbidding 'unit'/'min'/'max' on non-numeric rows. linkml-validate
  treats multi-slot postconditions as AND, so the numeric rules are
  split into three separate rules (one per slot) to fire on any single
  violation.

- Document multivalued TSV serialization (pipe-separated within a cell)
  for see_also and example_values.

- Add 'time' and 'datetime' rows to the example files so all 10 type
  values are demonstrated.

- Demonstrate codes-with-escaping in a race_ethnicity row across all
  three example files.

- Document fractional-bounds-on-integer as a content-quality lint case
  (schema cannot cleanly enforce; lint pass catches).
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Comment thread docs/data_dictionary_format.rst Outdated
Comment thread docs/data_dictionary_format.rst Outdated
Comment thread docs/data_dictionary_format.rst Outdated
Comment thread docs/examples/dd_example_with_optional.tsv Outdated
Comment thread docs/examples/dd_example_minimal.tsv Outdated
- Remove stale 'Escaping for | and , in code values' bullet from
  Future revisions; escaping is shipped in v1.

- Rephrase the codes-escaping intro to avoid the awkward
  ',', '|', or '\' literal markup that confuses Sphinx; spell
  it out as prose, with the bullets below showing the escapes.

- Add 'Multivalued TSV cells' subsection specifying that whitespace
  around the | separator is trimmed and pipes within values must be
  escaped as \|. Same convention as codes encoding. Spec B entries
  for see_also and example_values now reference this rule.

- Rewrite both example TSVs with uniform column counts (minimal: 7
  fields per row, optional: 11). Earlier hand-edited rows had
  inconsistent tab counts that put values in wrong columns and
  could mis-parse in strict TSV readers.
@amc-corey-cox amc-corey-cox requested a review from Copilot May 6, 2026 16:25
@amc-corey-cox amc-corey-cox merged commit 7fce840 into main May 6, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Define structured data dictionary input format

2 participants