Add dbGaP digest fetcher and adapt-digests pipeline stage#320
Add dbGaP digest fetcher and adapt-digests pipeline stage#320amc-corey-cox wants to merge 6 commits into
Conversation
Introduces src/dm_bip/prepare_study/fetch_digests.py and a CLI command (`dm-bip fetch-digests`) for fetching data_dict.xml + var_report.xml files from dbGaP's public FTP, with local caching and dataclass-based parsers for both file types. Cohort version pins are sourced from upstream NHLBI-BDC-DMC-HV's hv-lint/cohorts.yaml — partial structure brought over ahead of the full hv-lint migration tracked in #312. Refs #204
Adds a `parse-digests` CLI command that reads cached data_dict.xml files for a cohort and writes one TSV per data table in the schema-automator canonical data dictionary format (linkml/schema-automator#201). Outputs land at `output/<cohort>/dd/<phs>.<pht>.dd.tsv` with all Spec A columns plus `uri` from Spec B. dbGaP types are translated to the canonical 10-value vocabulary; encoded values are rendered REDCap-style (`code, label | code, label`); each variable's `uri` carries the dbGaP phv accession as a CURIE for traceability. `unit`, `min`, and `max` are emitted empty pending richer var_report parsing. Refs #204
|
Follow-up commit (`68731b0`) adds the consumable endpoint we discussed: `dm-bip parse-digests `. What's new:
Why canonical DD output: gives the rest of the pipeline (#307 schema enrichment, eventually schema-automator's #192 ingestion) a real artifact to consume on disk, not just in-memory dataclasses. Additional smoke test for the test plan:
#204 still stays open: this gives us a usable endpoint and probably enough for a first pass at #307, but BDC bucket fetching, the legacy-format adapter for `prepare_metadata.py`, and any "richer corpus" work remain. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #320 +/- ##
==========================================
+ Coverage 75.52% 79.32% +3.80%
==========================================
Files 9 10 +1
Lines 625 740 +115
==========================================
+ Hits 472 587 +115
Misses 153 153 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds a foundational “fetch + parse” layer for dbGaP variable digest files (supporting #204) by introducing a new digest module, new CLI commands to fetch/parse into a canonical TSV shape, and fixtures/tests to validate the XML parsers.
Changes:
- Added
prepare_study.fetch_digestsmodule with cohort registry loading (from upstreamcohorts.yaml), XML parsing fordata_dict.xml/var_report.xml, and cached fetching from NCBI. - Added new CLI commands
dm-bip fetch-digestsanddm-bip parse-digests. - Added parser unit tests plus real XML fixtures; added
defusedxmldependency and ignored.dbgap-cache/.
Reviewed changes
Copilot reviewed 6 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/dm_bip/prepare_study/fetch_digests.py |
Implements cohort loading, digest fetching with local cache, XML parsers, and canonical TSV writer. |
src/dm_bip/cli.py |
Exposes fetch-digests and parse-digests commands via Typer. |
tests/unit/test_fetch_digests.py |
Adds unit tests covering XML parsing and canonical TSV output. |
tests/input/dbgap_digests/JHS_Subject.data_dict.xml |
Real-world fixture for data_dict.xml parser tests. |
tests/input/dbgap_digests/JHS_Subject.var_report.xml |
Real-world fixture for var_report.xml parser tests. |
pyproject.toml |
Adds direct dependency on defusedxml. |
uv.lock |
Locks defusedxml into the environment. |
.gitignore |
Ignores the local .dbgap-cache/ directory created by the new CLI. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
csiege
left a comment
There was a problem hiding this comment.
Thanks for the useful foundation here. I found three correctness issues that look worth addressing before this becomes a dependable canonical-DD path:
-
Incorrect type translation for many real dbGaP dictionaries
src/dm_bip/prepare_study/fetch_digests.pyonly maps a small set of exact dbGaP type strings, then silently defaults unknown types tostring. In the local dbGaP cache, common real values includenumeric,encoded,decimal, encoded,num,integer, encoded,continuous decimal,enumerated integer, andencoded values. Those would all becomestring, soparse-digestswould emit materially wrong canonical data dictionaries: numeric fields lose numeric type, encoded fields losepermissible_values, and downstream schema generation would be incorrect. -
Codes encoding mutates labels instead of following the canonical-DD escaping rules
_sanitize_label()replaces commas with semicolons and pipes with slashes inside labels. The referenced schema-automator spec says labels may contain unescaped commas after the first separator comma, and literal pipe/backslash/comma in code values should be escaped with backslashes. The current implementation changes source metadata content, and the new test locks in that incorrect behavior by asserting comma removal. Labels likeBlack, non-Hispanicshould stay as written, not becomeBlack; non-Hispanic. -
Generated numeric rows are knowingly non-conformant for
unit,min, andmax_dd_row()always emits emptyunit,min, andmax. The schema-automator canonical spec distinguishes empty cells from the explicitnonetoken, and says empty cells are conformance issues when these fields apply. This may be acceptable as an interim draft output, but the CLI description says it converts digests into canonical-DD TSVs, so consumers may reasonably expect valid canonical rows. At minimum, numeric fields should probably emitnonewhere genuinely unitless/unbounded, or the command should make clear that the output is incomplete/non-strict.
…mator Earlier commits on this branch shipped an inline ad-hoc adapter: XML parsers, a type-vocab translator, and a canonical-DD TSV writer. That work now lives upstream in linkml/schema-automator#207 (merged), via a LinkML schema + declarative linkml-map trans-spec + schemauto adapt-dbgap CLI. dm-bip now provides: - fetch_digests.py: cohort registry loader (sourced from upstream NHLBI-BDC-DMC-HV/hv-lint/cohorts.yaml), FTP fetcher with local cache. - Pair discovery + digest_pairs.mk emission, because dbGaP's data_dict.xml and var_report.xml filenames don't share a stem (var_report has an extra .p<participant_set> segment). - pipeline.Makefile targets fetch-digests and adapt-digests. adapt-digests uses .SECONDEXPANSION and the included digest_pairs.mk to dispatch one schemauto adapt-dbgap call per pair. Dropped: parse_data_dict, parse_var_report, _translate_type, _DBGAP_TYPE_MAP, _sanitize_label, _encode_codes, _dd_row, write_canonical_dd, _dd_output_filename, parse_cached_digests, DD_TSV_COLUMNS, plus their dataclasses and the parse-digests CLI command. defusedxml is no longer a direct dependency. Still pending: a schema-automator release that includes the dbgap adapter, after which the pin in pyproject.toml gets bumped. Refs #204
- Tighten _DIGEST_FILENAME_RE to reject path separators in scraped hrefs (defense-in-depth against directory traversal from the FTP listing). - Add tests for list_digest_files, fetch_digests (cached vs refresh), pair_digests (with and without .p<N> participant-set segment, unmatched case logs a warning), and write_pairs_mk output format.
Caught by CI's `ruff format --check`: a multi-line implicit string concat in test_returns_sorted_unique_filenames fits in 120 chars on one line.
Summary
Adds the dm-bip side of dbGaP variable digest ingestion: a fetcher that pulls paired
*.data_dict.xml+*.var_report.xmlfiles from dbGaP's public FTP into a local cache, plus Make targets that drive per-pair adaptation via SchemaAutomator'sschemauto adapt-dbgapCLI.After fetch + adapt, downstream stages (
schema-create,validate-data,map-data) work against the canonical-DD TSVs without further dm-bip changes.Refs #204.
Closes #204.
Architecture
What's in this PR
src/dm_bip/prepare_study/fetch_digests.py— cohort registry loader (sourced from upstreamRTIInternational/NHLBI-BDC-DMC-HV/hv-lint/cohorts.yaml), FTP fetcher with local cache, directory-listing scraper. No parsing — that lives in schema-automator.dm-bip fetch-digests <cohort-key>CLI:--list,--refresh,--cache-dir.pipeline.Makefile:fetch-digests— populates the cache forDM_COHORT.adapt-digests— pattern rule that callsschemauto adapt-dbgapper pair, writesoutput/<cohort>/dd/<phs>.<pht>.dd.tsv.History — how we got here
Earlier commits on this branch shipped an inline ad-hoc adapter: parsers for
data_dict.xml/var_report.xml, a translator from dbGaP's type vocabulary to schema-automator's canonical 10-type system, and a TSV writer for the canonical DD format. During review we decided that work belonged upstream in the linkml-map-driven adapter ecosystem rather than as bespoke dm-bip code.That parser/translator/writer chunk now lives in linkml/schema-automator#207 (closes linkml/schema-automator#206), which:
linkml-maptrans-spec doing the type-vocab translation — more comprehensive than what we had (handles dbGaP typos, composite types, empty-type fallback tocalculated_type).schemauto adapt-dbgapCLI consumed by our new Make target.This dm-bip PR has been rewritten to be just the fetcher + orchestration glue. The fetch + Make split also matches the rest of
pipeline.Makefile(each stage is a Make target with explicit inputs/outputs).Blocked on
schema-automatorpin inpyproject.tomlcan be bumped.Things deliberately deferred
#204; tracked separately.adapt-digestsidempotency under repeated runs — depends on schema-automator's output being byte-stable, which in turn depends onlinkml-mapnot introducing nondeterministic ordering. Out of scope here; revisit if Make rebuild churn becomes a problem.Test plan
make test— fetcher-layer unit tests pass.uv run ruff check .— lint clean.dm-bip fetch-digests --list— fetches upstreamcohorts.yaml, prints all cohorts.make fetch-digests DM_COHORT=jhs— populates the cache.make adapt-digests DM_COHORT=jhs— produces TSVs underoutput/jhs/dd/.