@@ -8,7 +8,8 @@ This is a forward-looking, prescriptive target spec. The audience is a data
88owner at a *new * study, preparing their data for sharing, before any
99inconsistent local conventions have set in. Handling of legacy or messy
1010data dictionaries is a separate concern; those are normalized into this
11- format by an upstream layer (see :ref: `linkml/schema-automator#200 `).
11+ format by an upstream layer (see `linkml/schema-automator#200
12+ <https://github.com/linkml/schema-automator/issues/200> `_).
1213
1314Goals
1415-----
@@ -244,19 +245,107 @@ adds coupling, filename encoding that's fragile to parse) has worse
244245trade-offs than dropping document-level metadata entirely. If a future
245246revision adds it, the substrate decision will need to be revisited.
246247
248+ Relationship to existing formats
249+ --------------------------------
250+
251+ Across communities, real-world data dictionaries converge on a small core:
252+ **name, description, and (sometimes, partially) type information **. That
253+ common ground is what nearly every author actually writes, regardless of
254+ which format they nominally produce. This spec is a deliberately small
255+ extension of that core — adding the structure needed to validate and
256+ enrich, while staying close to what people already draft in the wild.
257+
258+ The flat-file shape is a deliberate alignment, not an oversight. Every
259+ format below (with the partial exception of XML codebook standards) starts
260+ from row-per-variable, columns-of-attributes. That's how data dictionaries
261+ naturally exist in researcher hands — a spreadsheet, a CSV, a few columns
262+ deep. Adding hierarchy, sidecar files, or required tooling would make
263+ authoring harder *and * make conversion from existing formats harder. We
264+ chose simplicity at this layer in exchange for keeping the
265+ existing-format-to-this-format adapters tractable.
266+
267+ We are implementing the core of the formats listed below. If our spec ever
268+ diverges substantially from that core, that's a signal we've over-rotated
269+ on novelty and should re-examine the divergence rather than ship it.
270+
271+ **Frictionless Table Schema ** (data.gov, OpenML, the ``datasets `` package).
272+ Our type vocabulary is a researcher-comprehensible subset of Frictionless's;
273+ ``pattern `` and per-column structure align directly. Where we diverge: we
274+ treat enumerated values as a primary type (``permissible_values ``) rather
275+ than a string-with-enum-constraint, matching how researchers think about
276+ multiple-choice columns. We intend to ship a Frictionless adapter
277+ (translator in both directions).
278+
279+ **REDCap data dictionary ** (de facto standard in clinical research). Our
280+ codes encoding (``code, label | code, label | ... `` with bareword
281+ shorthand) is taken directly from REDCap. Our type-as-major-axis approach
282+ mirrors REDCap's Field Type. We diverge by collapsing REDCap's UI-flavored
283+ types (``radio ``, ``dropdown ``, ``checkbox ``) into the single semantic
284+ type ``permissible_values `` — a data dictionary describes data, not the
285+ form widget that produced it. We intend to ship a REDCap adapter.
286+
287+ **SchemaSheets ** (LinkML community). This format supersedes SchemaSheets
288+ for the data-dictionary use case. Where SchemaSheets requires authors to
289+ work with LinkML conventions (slot/class distinctions, slot URIs, mixin
290+ semantics), we present a researcher-comprehensible flat surface.
291+ SchemaSheets capabilities not directly expressible here will be addressed
292+ during the migration period. A SchemaSheets adapter is part of the
293+ planned tooling.
294+
295+ **DDI / CDISC ** (heavyweight institutional standards). These describe a
296+ different scope — full study lifecycle, regulatory submission. Not
297+ displaced; complementary in cases where they apply.
298+
299+ Out of scope
300+ ------------
301+
302+ This format intentionally covers a single, flat tabular dataset described
303+ by a single data dictionary. The following use cases need a higher-order
304+ artifact on top of, or alongside, this format, and are explicitly out of
305+ scope:
306+
307+ - **Hierarchical or nested data. ** Columns containing JSON-shaped
308+ structures, repeated groups, document trees. Authoring as a flat
309+ row-per-column descriptor doesn't fit. Use a structured data model
310+ (LinkML class definitions, JSON Schema, Avro) for the nesting and
311+ reference this format for leaf-level columns where applicable.
312+
313+ - **Multi-table datasets with relationships. ** Several related tables
314+ sharing keys (e.g., dbGaP's phs/pht/phv structure, relational
315+ databases). Each table can have its own data dictionary in this format,
316+ but the relational structure between them — foreign keys, joins, shared
317+ dimensions — is the job of a higher-order schema artifact. A simple
318+ manifest pattern (a list of data dictionaries plus a relations file)
319+ would address this without burdening per-DD authoring.
320+
321+ - **Variable versioning and lineage. ** "This variable was renamed from
322+ ``old_name `` in v3," "valid from 2020-01 to 2023-06," provenance chains.
323+ Versioning concerns belong at the dataset/schema level, not per-row.
324+
325+ - **Cross-column constraints. ** Conditional requireds, dependencies
326+ ("field A is required only if field B is X"), value relationships
327+ across columns. The general case needs a real expression language, not
328+ a small format extension.
329+
330+ - **Domain-specific metadata. ** Clinical (codeset versions, encounter
331+ types), survey (question wording, branching logic), environmental
332+ (sensor calibration, measurement uncertainty). The format is
333+ intentionally domain-naive. Domain extensions can live in optional
334+ Spec B columns by convention, but defining them is the responsibility
335+ of domain communities, not this spec.
336+
247337Future revisions
248338----------------
249339
250- The following are explicitly deferred from v1:
340+ The following are deferred from v1 but may be added as small additive
341+ extensions to *this * format in future revisions:
251342
252343- Named, reusable enum definitions (e.g., declaring an enum once and
253344 referencing it from multiple ``permissible_values `` columns).
254345- A ``format `` field for refining types (``string `` + ``format: email ``,
255346 date format strings, etc.).
256347- Escaping for ``| `` and ``, `` in code values.
257- - Cross-column constraints (conditional requireds, dependencies).
258348- Document-level metadata and file-level conventions.
259- - Per-variable provenance, lineage, and versioning.
260349
261350Examples
262351--------
0 commit comments