Skip to content

Commit 36c5280

Browse files
committed
Add 'Relationship to existing formats' and 'Out of scope' sections (#191)
Position the spec relative to Frictionless Table Schema, REDCap, and SchemaSheets: small extension to the universal common ground (name, description, partial type info), implementing the core of those formats, with adapters planned. Explicitly state SchemaSheets supersession. Reorganize 'Future revisions' to cover only additive in-format extensions; move use cases that need a different artifact entirely (hierarchical data, multi-table relationships, variable versioning, cross-column constraints, domain extensions) to a new 'Out of scope' section.
1 parent 0fb52ae commit 36c5280

1 file changed

Lines changed: 93 additions & 4 deletions

File tree

docs/data_dictionary_format.rst

Lines changed: 93 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@ This is a forward-looking, prescriptive target spec. The audience is a data
88
owner at a *new* study, preparing their data for sharing, before any
99
inconsistent local conventions have set in. Handling of legacy or messy
1010
data dictionaries is a separate concern; those are normalized into this
11-
format by an upstream layer (see :ref:`linkml/schema-automator#200`).
11+
format by an upstream layer (see `linkml/schema-automator#200
12+
<https://github.com/linkml/schema-automator/issues/200>`_).
1213

1314
Goals
1415
-----
@@ -244,19 +245,107 @@ adds coupling, filename encoding that's fragile to parse) has worse
244245
trade-offs than dropping document-level metadata entirely. If a future
245246
revision adds it, the substrate decision will need to be revisited.
246247

248+
Relationship to existing formats
249+
--------------------------------
250+
251+
Across communities, real-world data dictionaries converge on a small core:
252+
**name, description, and (sometimes, partially) type information**. That
253+
common ground is what nearly every author actually writes, regardless of
254+
which format they nominally produce. This spec is a deliberately small
255+
extension of that core — adding the structure needed to validate and
256+
enrich, while staying close to what people already draft in the wild.
257+
258+
The flat-file shape is a deliberate alignment, not an oversight. Every
259+
format below (with the partial exception of XML codebook standards) starts
260+
from row-per-variable, columns-of-attributes. That's how data dictionaries
261+
naturally exist in researcher hands — a spreadsheet, a CSV, a few columns
262+
deep. Adding hierarchy, sidecar files, or required tooling would make
263+
authoring harder *and* make conversion from existing formats harder. We
264+
chose simplicity at this layer in exchange for keeping the
265+
existing-format-to-this-format adapters tractable.
266+
267+
We are implementing the core of the formats listed below. If our spec ever
268+
diverges substantially from that core, that's a signal we've over-rotated
269+
on novelty and should re-examine the divergence rather than ship it.
270+
271+
**Frictionless Table Schema** (data.gov, OpenML, the ``datasets`` package).
272+
Our type vocabulary is a researcher-comprehensible subset of Frictionless's;
273+
``pattern`` and per-column structure align directly. Where we diverge: we
274+
treat enumerated values as a primary type (``permissible_values``) rather
275+
than a string-with-enum-constraint, matching how researchers think about
276+
multiple-choice columns. We intend to ship a Frictionless adapter
277+
(translator in both directions).
278+
279+
**REDCap data dictionary** (de facto standard in clinical research). Our
280+
codes encoding (``code, label | code, label | ...`` with bareword
281+
shorthand) is taken directly from REDCap. Our type-as-major-axis approach
282+
mirrors REDCap's Field Type. We diverge by collapsing REDCap's UI-flavored
283+
types (``radio``, ``dropdown``, ``checkbox``) into the single semantic
284+
type ``permissible_values`` — a data dictionary describes data, not the
285+
form widget that produced it. We intend to ship a REDCap adapter.
286+
287+
**SchemaSheets** (LinkML community). This format supersedes SchemaSheets
288+
for the data-dictionary use case. Where SchemaSheets requires authors to
289+
work with LinkML conventions (slot/class distinctions, slot URIs, mixin
290+
semantics), we present a researcher-comprehensible flat surface.
291+
SchemaSheets capabilities not directly expressible here will be addressed
292+
during the migration period. A SchemaSheets adapter is part of the
293+
planned tooling.
294+
295+
**DDI / CDISC** (heavyweight institutional standards). These describe a
296+
different scope — full study lifecycle, regulatory submission. Not
297+
displaced; complementary in cases where they apply.
298+
299+
Out of scope
300+
------------
301+
302+
This format intentionally covers a single, flat tabular dataset described
303+
by a single data dictionary. The following use cases need a higher-order
304+
artifact on top of, or alongside, this format, and are explicitly out of
305+
scope:
306+
307+
- **Hierarchical or nested data.** Columns containing JSON-shaped
308+
structures, repeated groups, document trees. Authoring as a flat
309+
row-per-column descriptor doesn't fit. Use a structured data model
310+
(LinkML class definitions, JSON Schema, Avro) for the nesting and
311+
reference this format for leaf-level columns where applicable.
312+
313+
- **Multi-table datasets with relationships.** Several related tables
314+
sharing keys (e.g., dbGaP's phs/pht/phv structure, relational
315+
databases). Each table can have its own data dictionary in this format,
316+
but the relational structure between them — foreign keys, joins, shared
317+
dimensions — is the job of a higher-order schema artifact. A simple
318+
manifest pattern (a list of data dictionaries plus a relations file)
319+
would address this without burdening per-DD authoring.
320+
321+
- **Variable versioning and lineage.** "This variable was renamed from
322+
``old_name`` in v3," "valid from 2020-01 to 2023-06," provenance chains.
323+
Versioning concerns belong at the dataset/schema level, not per-row.
324+
325+
- **Cross-column constraints.** Conditional requireds, dependencies
326+
("field A is required only if field B is X"), value relationships
327+
across columns. The general case needs a real expression language, not
328+
a small format extension.
329+
330+
- **Domain-specific metadata.** Clinical (codeset versions, encounter
331+
types), survey (question wording, branching logic), environmental
332+
(sensor calibration, measurement uncertainty). The format is
333+
intentionally domain-naive. Domain extensions can live in optional
334+
Spec B columns by convention, but defining them is the responsibility
335+
of domain communities, not this spec.
336+
247337
Future revisions
248338
----------------
249339

250-
The following are explicitly deferred from v1:
340+
The following are deferred from v1 but may be added as small additive
341+
extensions to *this* format in future revisions:
251342

252343
- Named, reusable enum definitions (e.g., declaring an enum once and
253344
referencing it from multiple ``permissible_values`` columns).
254345
- A ``format`` field for refining types (``string`` + ``format: email``,
255346
date format strings, etc.).
256347
- Escaping for ``|`` and ``,`` in code values.
257-
- Cross-column constraints (conditional requireds, dependencies).
258348
- Document-level metadata and file-level conventions.
259-
- Per-variable provenance, lineage, and versioning.
260349

261350
Examples
262351
--------

0 commit comments

Comments
 (0)