Add canonical data dictionary format spec (closes #191)#201
Merged
Conversation
Defines a forward-looking, prescriptive data dictionary format for new studies onboarding to schema-automator's enrichment pipeline. Includes: - LinkML schema at schema_automator/metamodels/data_dictionary.yaml (the normative machine-readable definition) - Prose spec doc at docs/data_dictionary_format.rst with Spec A (recommended fields) and Spec B (optional, independently-adoptable columns) - Worked examples in docs/examples/ covering each type case in both TSV and YAML form Type vocabulary is fixed at 10 researcher-comprehensible values (string, integer, decimal, boolean, date, datetime, time, uri, curie, permissible_values). Codes encoded as REDCap-style "code, label | ..." with bareword shorthand. Document-level metadata explicitly out of scope for v1.
This was referenced May 5, 2026
amc-corey-cox
added a commit
to linkml/dm-bip
that referenced
this pull request
May 5, 2026
Adds a `parse-digests` CLI command that reads cached data_dict.xml files for a cohort and writes one TSV per data table in the schema-automator canonical data dictionary format (linkml/schema-automator#201). Outputs land at `output/<cohort>/dd/<phs>.<pht>.dd.tsv` with all Spec A columns plus `uri` from Spec B. dbGaP types are translated to the canonical 10-value vocabulary; encoded values are rendered REDCap-style (`code, label | code, label`); each variable's `uri` carries the dbGaP phv accession as a CURIE for traceability. `unit`, `min`, and `max` are emitted empty pending richer var_report parsing. Refs #204
5 tasks
) Position the spec relative to Frictionless Table Schema, REDCap, and SchemaSheets: small extension to the universal common ground (name, description, partial type info), implementing the core of those formats, with adapters planned. Explicitly state SchemaSheets supersession. Reorganize 'Future revisions' to cover only additive in-format extensions; move use cases that need a different artifact entirely (hierarchical data, multi-table relationships, variable versioning, cross-column constraints, domain extensions) to a new 'Out of scope' section.
Electronic Health Record acronym used as a permissible-values code example in the data dictionary spec. Codespell flags 'EHR' as a typo for 'HER'.
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Add example_values to Spec B (multivalued, pipe-separated in TSV).
Closes a self-contradiction where description rules forbade examples
but pointed to a Spec B field that didn't exist.
- Add backslash-escaping for codes ('\,', '\|', '\'). Codes can now
contain commas, pipes, and backslashes. Replaces the v1 'no commas
at all' limitation. Tighten the codes regex to enforce the escape
grammar and reject malformed strings (multi-comma in code position,
empty tokens, A | | B, etc.).
- Add LinkML rules forbidding 'codes' on non-permissible_values rows
and forbidding 'unit'/'min'/'max' on non-numeric rows. linkml-validate
treats multi-slot postconditions as AND, so the numeric rules are
split into three separate rules (one per slot) to fire on any single
violation.
- Document multivalued TSV serialization (pipe-separated within a cell)
for see_also and example_values.
- Add 'time' and 'datetime' rows to the example files so all 10 type
values are demonstrated.
- Demonstrate codes-with-escaping in a race_ethnicity row across all
three example files.
- Document fractional-bounds-on-integer as a content-quality lint case
(schema cannot cleanly enforce; lint pass catches).
- Remove stale 'Escaping for | and , in code values' bullet from Future revisions; escaping is shipped in v1. - Rephrase the codes-escaping intro to avoid the awkward ',', '|', or '\' literal markup that confuses Sphinx; spell it out as prose, with the bullets below showing the escapes. - Add 'Multivalued TSV cells' subsection specifying that whitespace around the | separator is trimmed and pipes within values must be escaped as \|. Same convention as codes encoding. Spec B entries for see_also and example_values now reference this rule. - Rewrite both example TSVs with uniform column counts (minimal: 7 fields per row, optional: 11). Earlier hand-edited rows had inconsistent tab counts that put values in wrong columns and could mis-parse in strict TSV readers.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Defines schema-automator's canonical, opinionated data dictionary format — the forward-looking target spec we ask new studies to produce when onboarding to the pipeline. Closes #191.
This is a spec-only PR: the deliverable is documentation + a LinkML schema + worked examples. Implementation of ingestion (#192), reconciliation (#193), and handling of non-canonical/legacy DDs (#200) is tracked separately and lands in subsequent PRs.
What's in the spec
Two-spec structure. Spec A is the recommended field set for every entry: `name`, `type`, `description`, plus type-conditional `codes` (when permissible_values), `unit`, `min`, and `max` (when numeric). Spec B is a set of optional columns researchers may adopt independently à la carte: `label`, `multivalued`, `required`, `pattern`, `uri`, `see_also`.
Type vocabulary fixed at 10 researcher-comprehensible values (`string`, `integer`, `decimal`, `boolean`, `date`, `datetime`, `time`, `uri`, `curie`, `permissible_values`). Curated subset of LinkML's built-in types — technical primitives (`ncname`, `jsonpath`, etc.) are excluded.
Codes encoded REDCap-style as `code, label | code, label | ...` with bareword shorthand for value-equals-meaning cases. Comma separator (not `=`) avoids collisions with code values containing operators like `>=` or `<=`.
`none` token is the explicit "not applicable" value for unit/min/max — distinct from an empty cell, which means "the author has not declared this field" and is reported as a conformance issue.
Negative requirements on `description`: must not contain code lists, units, ranges, or example values — those have dedicated fields. Stated as a hard rule; enforcement is a separate content-quality lint.
No document-level metadata in v1. Every productive option (header that breaks CSV, sidecar coupling, filename encoding) has worse trade-offs than dropping it; the spec is intentionally substrate-only.
Files
Conformance modes
Validation
The LinkML schema validates the example YAML cleanly via `linkml-validate`. Type vocabulary violations are caught (verified with a synthetic invalid-type example).
Known LinkML quirks (worth filing upstream)
The `min` and `max` slots use slot-level `any_of` (decimal OR `equals_string: none`) without a top-level `range:`. This is because:
Both reproduce on linkml/linkml-runtime 1.10.0 (latest). A schema comment explains the workaround. Upstream issues to be filed after merge.
Test plan