Scope
Adapter in the linkml-map-driven adapter ecosystem (#202 umbrella). Translates dbGaP variable digest XML files (`.data_dict.xml` and `.var_report.xml`) into the canonical schema-automator data dictionary format from #191.
Sibling of #203 (Frictionless, merged via #205) and #204 (REDCap, open). Being prioritized over #204 because of the in-progress dm-bip work that needs it.
Companion to dm-bip PR #320 which lands the fetcher layer (cohort definitions, FTP download, local cache, `dm-bip fetch-digests` CLI). #320 currently ships an inline ad-hoc adapter; this issue replaces that inline adapter with a proper `linkml-map`-based one in schema-automator. After this lands, dm-bip's `fetch_digests.py` drops its inline parse + translate + write-canonical logic and calls into schema-automator.
Deliverables
Source-format LinkML schema
LinkML schema describing dbGaP variable digest XML:
- `DataTable` (root of data_dict.xml): `id` (pht), `study_id` (phs), `participant_set`, `date_created`, `description`, list of `Variable`.
- `Variable`: `id` (phv), `name`, `description`, `type`, list of `Value` (encoded values).
- `Value`: `code` attribute, label text content.
- `VarReport` (root of var_report.xml): mirror of data_table attrs plus per-variable stats blocks.
- `ReportVariable`: `id`, `var_name`, `calculated_type`, `reported_type`, `description`, `stats` (with empirical min/max/n/nulls plus per-code enum counts).
Trans-spec(s)
- `dbgap_to_dd.transform.yaml` — primary. Maps data_dict.xml → canonical DD; lossy collapses for dbGaP's 36-variant type vocabulary (catalogued in the spec-design corpus survey).
- Optional second trans-spec or merge step for incorporating var_report.xml enrichments (empirical min/max for numerics).
Python wrapper
Thin XML→dict layer that parses the two XML files via `defusedxml`, optionally merges them, and feeds linkml-map. Mirrors the `adapter.py` pattern from the Frictionless adapter.
CLI
`schemauto adapt-dbgap <data_dict.xml> [--var-report <var_report.xml>] [--output ]`
Naming convention follows dm-bip's `..dd.tsv` pattern when output isn't specified.
Test fixtures
Mirror dm-bip's JHS_Subject `data_dict.xml` + `var_report.xml` pair (and possibly one or two more from the cohort survey done during spec design) into `tests/resources/`.
Mappings carried over from dm-bip PR #320 (worth preserving)
- Type vocabulary mapping: `string` → `string`, `integer` → `integer`, `decimal` → `decimal`, `encoded value` → `permissible_values`, `date` → `date`, `datetime` → `datetime`, `time` → `time`, `boolean` → `boolean`.
- Per-variable URI: `uri: dbgap:` (e.g., `dbgap:phv00124545.v4`) — retains dbGaP variable identity in the canonical DD.
Things to handle beyond dm-bip's PR #320 (per "best on this pass")
- Composite type strings (`decimal, encoded`, `encoded, string`, etc.) — design decision needed.
- Typos in dbGaP type field (`sting`, `strin`, etc.) — collapse to `string` with a warning.
- Empty `` — fall back to var_report's `calculated_type` if present, else infer from codes presence.
- Use our codes serialization utility's proper backslash-escaping rather than the lossy character replacement in #320's `_sanitize_label`.
- Use var_report's empirical min/max when available (deferred if too involved).
dm-bip follow-up (not in this issue)
Once this adapter ships, dm-bip's `fetch_digests.py` is simplified to: fetch → call `schemauto.adapters.dbgap.dbgap_to_dd()` → write. Tracked separately on dm-bip side.
Sub-issue of #202.
Scope
Adapter in the linkml-map-driven adapter ecosystem (#202 umbrella). Translates dbGaP variable digest XML files (`.data_dict.xml` and `.var_report.xml`) into the canonical schema-automator data dictionary format from #191.
Sibling of #203 (Frictionless, merged via #205) and #204 (REDCap, open). Being prioritized over #204 because of the in-progress dm-bip work that needs it.
Companion to dm-bip PR #320 which lands the fetcher layer (cohort definitions, FTP download, local cache, `dm-bip fetch-digests` CLI). #320 currently ships an inline ad-hoc adapter; this issue replaces that inline adapter with a proper `linkml-map`-based one in schema-automator. After this lands, dm-bip's `fetch_digests.py` drops its inline parse + translate + write-canonical logic and calls into schema-automator.
Deliverables
Source-format LinkML schema
LinkML schema describing dbGaP variable digest XML:
Trans-spec(s)
Python wrapper
Thin XML→dict layer that parses the two XML files via `defusedxml`, optionally merges them, and feeds linkml-map. Mirrors the `adapter.py` pattern from the Frictionless adapter.
CLI
`schemauto adapt-dbgap <data_dict.xml> [--var-report <var_report.xml>] [--output ]`
Naming convention follows dm-bip's `..dd.tsv` pattern when output isn't specified.
Test fixtures
Mirror dm-bip's JHS_Subject `data_dict.xml` + `var_report.xml` pair (and possibly one or two more from the cohort survey done during spec design) into `tests/resources/`.
Mappings carried over from dm-bip PR #320 (worth preserving)
Things to handle beyond dm-bip's PR #320 (per "best on this pass")
dm-bip follow-up (not in this issue)
Once this adapter ships, dm-bip's `fetch_digests.py` is simplified to: fetch → call `schemauto.adapters.dbgap.dbgap_to_dd()` → write. Tracked separately on dm-bip side.
Sub-issue of #202.