Skip to content

dbGaP variable digest adapter for canonical DD format #206

@amc-corey-cox

Description

@amc-corey-cox

Scope

Adapter in the linkml-map-driven adapter ecosystem (#202 umbrella). Translates dbGaP variable digest XML files (`.data_dict.xml` and `.var_report.xml`) into the canonical schema-automator data dictionary format from #191.

Sibling of #203 (Frictionless, merged via #205) and #204 (REDCap, open). Being prioritized over #204 because of the in-progress dm-bip work that needs it.

Companion to dm-bip PR #320 which lands the fetcher layer (cohort definitions, FTP download, local cache, `dm-bip fetch-digests` CLI). #320 currently ships an inline ad-hoc adapter; this issue replaces that inline adapter with a proper `linkml-map`-based one in schema-automator. After this lands, dm-bip's `fetch_digests.py` drops its inline parse + translate + write-canonical logic and calls into schema-automator.

Deliverables

Source-format LinkML schema

LinkML schema describing dbGaP variable digest XML:

  • `DataTable` (root of data_dict.xml): `id` (pht), `study_id` (phs), `participant_set`, `date_created`, `description`, list of `Variable`.
  • `Variable`: `id` (phv), `name`, `description`, `type`, list of `Value` (encoded values).
  • `Value`: `code` attribute, label text content.
  • `VarReport` (root of var_report.xml): mirror of data_table attrs plus per-variable stats blocks.
  • `ReportVariable`: `id`, `var_name`, `calculated_type`, `reported_type`, `description`, `stats` (with empirical min/max/n/nulls plus per-code enum counts).

Trans-spec(s)

  • `dbgap_to_dd.transform.yaml` — primary. Maps data_dict.xml → canonical DD; lossy collapses for dbGaP's 36-variant type vocabulary (catalogued in the spec-design corpus survey).
  • Optional second trans-spec or merge step for incorporating var_report.xml enrichments (empirical min/max for numerics).

Python wrapper

Thin XML→dict layer that parses the two XML files via `defusedxml`, optionally merges them, and feeds linkml-map. Mirrors the `adapter.py` pattern from the Frictionless adapter.

CLI

`schemauto adapt-dbgap <data_dict.xml> [--var-report <var_report.xml>] [--output ]`

Naming convention follows dm-bip's `..dd.tsv` pattern when output isn't specified.

Test fixtures

Mirror dm-bip's JHS_Subject `data_dict.xml` + `var_report.xml` pair (and possibly one or two more from the cohort survey done during spec design) into `tests/resources/`.

Mappings carried over from dm-bip PR #320 (worth preserving)

  • Type vocabulary mapping: `string` → `string`, `integer` → `integer`, `decimal` → `decimal`, `encoded value` → `permissible_values`, `date` → `date`, `datetime` → `datetime`, `time` → `time`, `boolean` → `boolean`.
  • Per-variable URI: `uri: dbgap:` (e.g., `dbgap:phv00124545.v4`) — retains dbGaP variable identity in the canonical DD.

Things to handle beyond dm-bip's PR #320 (per "best on this pass")

  • Composite type strings (`decimal, encoded`, `encoded, string`, etc.) — design decision needed.
  • Typos in dbGaP type field (`sting`, `strin`, etc.) — collapse to `string` with a warning.
  • Empty `` — fall back to var_report's `calculated_type` if present, else infer from codes presence.
  • Use our codes serialization utility's proper backslash-escaping rather than the lossy character replacement in #320's `_sanitize_label`.
  • Use var_report's empirical min/max when available (deferred if too involved).

dm-bip follow-up (not in this issue)

Once this adapter ships, dm-bip's `fetch_digests.py` is simplified to: fetch → call `schemauto.adapters.dbgap.dbgap_to_dd()` → write. Tracked separately on dm-bip side.

Sub-issue of #202.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions