Add dbGaP digest fetcher and adapt-digests pipeline stage by amc-corey-cox · Pull Request #320 · linkml/dm-bip

amc-corey-cox · 2026-05-04T20:43:11Z

Summary

Adds the dm-bip side of dbGaP variable digest ingestion: a fetcher that pulls paired *.data_dict.xml + *.var_report.xml files from dbGaP's public FTP into a local cache, plus Make targets that drive per-pair adaptation via SchemaAutomator's schemauto adapt-dbgap CLI.

After fetch + adapt, downstream stages (schema-create, validate-data, map-data) work against the canonical-DD TSVs without further dm-bip changes.

Refs #204.
Closes #204.

Architecture

[upstream cohorts.yaml] → load_cohorts (Python)
                              ↓
[dbGaP FTP] ────fetch───→ .dbgap-cache/<cohort>/<phs.v>/pheno_variable_summaries/
                              │  paired *.data_dict.xml + *.var_report.xml
                              ↓
                 schemauto adapt-dbgap  (per pair, via Make)
                              ↓
                  output/<cohort>/dd/<phs>.<pht>.dd.tsv
                              ↓
                 schema-create / map-data / ...

What's in this PR

src/dm_bip/prepare_study/fetch_digests.py — cohort registry loader (sourced from upstream RTIInternational/NHLBI-BDC-DMC-HV/hv-lint/cohorts.yaml), FTP fetcher with local cache, directory-listing scraper. No parsing — that lives in schema-automator.
dm-bip fetch-digests <cohort-key> CLI: --list, --refresh, --cache-dir.
New Make targets in pipeline.Makefile:
- fetch-digests — populates the cache for DM_COHORT.
- adapt-digests — pattern rule that calls schemauto adapt-dbgap per pair, writes output/<cohort>/dd/<phs>.<pht>.dd.tsv.

History — how we got here

Earlier commits on this branch shipped an inline ad-hoc adapter: parsers for data_dict.xml / var_report.xml, a translator from dbGaP's type vocabulary to schema-automator's canonical 10-type system, and a TSV writer for the canonical DD format. During review we decided that work belonged upstream in the linkml-map-driven adapter ecosystem rather than as bespoke dm-bip code.

That parser/translator/writer chunk now lives in linkml/schema-automator#207 (closes linkml/schema-automator#206), which:

Adds a LinkML schema describing the dbGaP digest XML form.
Adds a declarative linkml-map trans-spec doing the type-vocab translation — more comprehensive than what we had (handles dbGaP typos, composite types, empty-type fallback to calculated_type).
Adds the schemauto adapt-dbgap CLI consumed by our new Make target.

This dm-bip PR has been rewritten to be just the fetcher + orchestration glue. The fetch + Make split also matches the rest of pipeline.Makefile (each stage is a Make target with explicit inputs/outputs).

Blocked on

dbGaP variable digest adapter (closes #206) schema-automator#207 merging.
A schema-automator release that includes the dbgap adapter, so the schema-automator pin in pyproject.toml can be bumped.

Things deliberately deferred

BDC bucket fetching — auth complexity; deferred.
Variable corpus assembly — the downstream consumer of #204; tracked separately.
adapt-digests idempotency under repeated runs — depends on schema-automator's output being byte-stable, which in turn depends on linkml-map not introducing nondeterministic ordering. Out of scope here; revisit if Make rebuild churn becomes a problem.

Test plan

make test — fetcher-layer unit tests pass.
uv run ruff check . — lint clean.
dm-bip fetch-digests --list — fetches upstream cohorts.yaml, prints all cohorts.
make fetch-digests DM_COHORT=jhs — populates the cache.
After SA release: make adapt-digests DM_COHORT=jhs — produces TSVs under output/jhs/dd/.

Introduces src/dm_bip/prepare_study/fetch_digests.py and a CLI command (`dm-bip fetch-digests`) for fetching data_dict.xml + var_report.xml files from dbGaP's public FTP, with local caching and dataclass-based parsers for both file types. Cohort version pins are sourced from upstream NHLBI-BDC-DMC-HV's hv-lint/cohorts.yaml — partial structure brought over ahead of the full hv-lint migration tracked in #312. Refs #204

Adds a `parse-digests` CLI command that reads cached data_dict.xml files for a cohort and writes one TSV per data table in the schema-automator canonical data dictionary format (linkml/schema-automator#201). Outputs land at `output/<cohort>/dd/<phs>.<pht>.dd.tsv` with all Spec A columns plus `uri` from Spec B. dbGaP types are translated to the canonical 10-value vocabulary; encoded values are rendered REDCap-style (`code, label | code, label`); each variable's `uri` carries the dbGaP phv accession as a CURIE for traceability. `unit`, `min`, and `max` are emitted empty pending richer var_report parsing. Refs #204

amc-corey-cox · 2026-05-05T17:51:43Z

Follow-up commit (`68731b0`) adds the consumable endpoint we discussed: `dm-bip parse-digests `.

What's new:

Reads cached `data_dict.xml` files (output of `fetch-digests`)
Writes one TSV per data table in schema-automator canonical DD format: `output//dd/..dd.tsv`
All Spec A columns (`name`, `type`, `description`, `codes`, `unit`, `min`, `max`) plus `uri` from Spec B
dbGaP types translated to canonical vocabulary (`String`/`string` → `string`, `encoded value` → `permissible_values`, etc.); unknowns default to `string`
Encoded values rendered REDCap-style (`code, label | code, label`) with label sanitization for separator characters
`uri` column carries `dbgap:` for traceability
`unit`, `min`, `max` emitted empty for now (deferred until richer var_report parsing)

Why canonical DD output: gives the rest of the pipeline (#307 schema enrichment, eventually schema-automator's #192 ingestion) a real artifact to consume on disk, not just in-memory dataclasses.

Additional smoke test for the test plan:

`dm-bip parse-digests jhs` — produces `output/jhs/dd/phs000286.pht*.dd.tsv` files; spot-check one TSV opens cleanly in a spreadsheet/editor

#204 still stays open: this gives us a usable endpoint and probably enough for a first pass at #307, but BDC bucket fetching, the legacy-format adapter for `prepare_metadata.py`, and any "richer corpus" work remain.

codecov-commenter · 2026-05-06T17:08:03Z

Codecov Report

❌ Patch coverage is 76.31579% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.51%. Comparing base (5486f82) to head (d8d7756).

Files with missing lines	Patch %	Lines
src/dm_bip/cli.py	0.00%	18 Missing ⚠️
src/dm_bip/prepare_study/fetch_digests.py	90.62%	9 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #320      +/-   ##
==========================================
- Coverage   83.18%   82.51%   -0.68%     
==========================================
  Files          13       14       +1     
  Lines        1041     1155     +114     
==========================================
+ Hits          866      953      +87     
- Misses        175      202      +27

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Adds a foundational “fetch + parse” layer for dbGaP variable digest files (supporting #204) by introducing a new digest module, new CLI commands to fetch/parse into a canonical TSV shape, and fixtures/tests to validate the XML parsers.

Changes:

Added prepare_study.fetch_digests module with cohort registry loading (from upstream cohorts.yaml), XML parsing for data_dict.xml / var_report.xml, and cached fetching from NCBI.
Added new CLI commands dm-bip fetch-digests and dm-bip parse-digests.
Added parser unit tests plus real XML fixtures; added defusedxml dependency and ignored .dbgap-cache/.

Reviewed changes

Copilot reviewed 6 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`src/dm_bip/prepare_study/fetch_digests.py`	Implements cohort loading, digest fetching with local cache, XML parsers, and canonical TSV writer.
`src/dm_bip/cli.py`	Exposes `fetch-digests` and `parse-digests` commands via Typer.
`tests/unit/test_fetch_digests.py`	Adds unit tests covering XML parsing and canonical TSV output.
`tests/input/dbgap_digests/JHS_Subject.data_dict.xml`	Real-world fixture for `data_dict.xml` parser tests.
`tests/input/dbgap_digests/JHS_Subject.var_report.xml`	Real-world fixture for `var_report.xml` parser tests.
`pyproject.toml`	Adds direct dependency on `defusedxml`.
`uv.lock`	Locks `defusedxml` into the environment.
`.gitignore`	Ignores the local `.dbgap-cache/` directory created by the new CLI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

csiege

Thanks for the useful foundation here. I found three correctness issues that look worth addressing before this becomes a dependable canonical-DD path:

Incorrect type translation for many real dbGaP dictionaries

src/dm_bip/prepare_study/fetch_digests.py only maps a small set of exact dbGaP type strings, then silently defaults unknown types to string. In the local dbGaP cache, common real values include numeric, encoded, decimal, encoded, num, integer, encoded, continuous decimal, enumerated integer, and encoded values. Those would all become string, so parse-digests would emit materially wrong canonical data dictionaries: numeric fields lose numeric type, encoded fields lose permissible_values, and downstream schema generation would be incorrect.
Codes encoding mutates labels instead of following the canonical-DD escaping rules

_sanitize_label() replaces commas with semicolons and pipes with slashes inside labels. The referenced schema-automator spec says labels may contain unescaped commas after the first separator comma, and literal pipe/backslash/comma in code values should be escaped with backslashes. The current implementation changes source metadata content, and the new test locks in that incorrect behavior by asserting comma removal. Labels like Black, non-Hispanic should stay as written, not become Black; non-Hispanic.
Generated numeric rows are knowingly non-conformant for unit, min, and max

_dd_row() always emits empty unit, min, and max. The schema-automator canonical spec distinguishes empty cells from the explicit none token, and says empty cells are conformance issues when these fields apply. This may be acceptable as an interim draft output, but the CLI description says it converts digests into canonical-DD TSVs, so consumers may reasonably expect valid canonical rows. At minimum, numeric fields should probably emit none where genuinely unitless/unbounded, or the command should make clear that the output is incomplete/non-strict.

…mator Earlier commits on this branch shipped an inline ad-hoc adapter: XML parsers, a type-vocab translator, and a canonical-DD TSV writer. That work now lives upstream in linkml/schema-automator#207 (merged), via a LinkML schema + declarative linkml-map trans-spec + schemauto adapt-dbgap CLI. dm-bip now provides: - fetch_digests.py: cohort registry loader (sourced from upstream NHLBI-BDC-DMC-HV/hv-lint/cohorts.yaml), FTP fetcher with local cache. - Pair discovery + digest_pairs.mk emission, because dbGaP's data_dict.xml and var_report.xml filenames don't share a stem (var_report has an extra .p<participant_set> segment). - pipeline.Makefile targets fetch-digests and adapt-digests. adapt-digests uses .SECONDEXPANSION and the included digest_pairs.mk to dispatch one schemauto adapt-dbgap call per pair. Dropped: parse_data_dict, parse_var_report, _translate_type, _DBGAP_TYPE_MAP, _sanitize_label, _encode_codes, _dd_row, write_canonical_dd, _dd_output_filename, parse_cached_digests, DD_TSV_COLUMNS, plus their dataclasses and the parse-digests CLI command. defusedxml is no longer a direct dependency. Still pending: a schema-automator release that includes the dbgap adapter, after which the pin in pyproject.toml gets bumped. Refs #204

- Tighten _DIGEST_FILENAME_RE to reject path separators in scraped hrefs (defense-in-depth against directory traversal from the FTP listing). - Add tests for list_digest_files, fetch_digests (cached vs refresh), pair_digests (with and without .p<N> participant-set segment, unmatched case logs a warning), and write_pairs_mk output format.

Caught by CI's `ruff format --check`: a multi-line implicit string concat in test_returns_sorted_unique_filenames fits in 120 chars on one line.

madanucd

LGTM - approving.

Verified locally against real dbGaP data:

dm-bip fetch-digests --list correctly loads all 11 cohorts from upstream cohorts.yaml
dm-bip fetch-digests jhs and copdgene fetched and cached XML files correctly under .dbgap-cache/
.p<N> pairing logic works correctly on real filenames - data_dict and var_report paired despite differing stems
Unmatched tables handled gracefully with warnings, not errors (3 ESP/genotype tables in COPDGene have var_report files under different phs IDs - phs000296, phs000765 - so they don't pair; expected behavior, known limitation)
digest_pairs.mk generated correctly with explicit DBGAP_DD_ / DBGAP_VR_ vars per pair
python -m pytest tests/unit/test_fetch_digests.py -v passes locally
All 3 CI checks passing

The three correctness issues from csiege's earlier review (type translation, label escaping, min/max conformance) are fully addressed upstream in schema-automator#207 (merged).

adapt-digests not smoke-tested locally since schemauto adapt-dbgap requires a schema-automator release that hasn't been cut yet - but the Make rule is straightforward orchestration with no dm-bip logic to review.

Remaining follow-up: schema-automator release + pyproject.toml pin bump before adapt-digests is usable end-to-end.

amc-corey-cox added 2 commits May 4, 2026 15:42

Apply ruff format

826f69e

amc-corey-cox requested review from Copilot, csiege and madanucd May 7, 2026 19:28

Copilot started reviewing on behalf of amc-corey-cox May 8, 2026 12:36 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

Comment thread src/dm_bip/prepare_study/fetch_digests.py

Comment thread src/dm_bip/prepare_study/fetch_digests.py

csiege reviewed May 12, 2026

View reviewed changes

This was referenced May 12, 2026

dbGaP variable digest adapter for canonical DD format linkml/schema-automator#206

Closed

dbGaP variable digest adapter (closes #206) linkml/schema-automator#207

Merged

amc-corey-cox changed the title ~~Add dbGaP variable digest fetcher and parser~~ Add dbGaP digest fetcher and adapt-digests pipeline stage May 12, 2026

amc-corey-cox marked this pull request as ready for review May 12, 2026 19:15

amc-corey-cox added 2 commits May 12, 2026 14:40

Apply ruff format

dcf5b3b

Caught by CI's `ruff format --check`: a multi-line implicit string concat in test_returns_sorted_unique_filenames fits in 120 chars on one line.

amc-corey-cox added the staged Work ready or in progress, waiting on upstream release label May 13, 2026

amc-corey-cox assigned madanucd Jun 5, 2026

madanucd and others added 2 commits June 5, 2026 15:04

Merge branch 'main' into fetch-variable-digests

ef38f30

Apply ruff format

d8d7756

madanucd approved these changes Jun 5, 2026

View reviewed changes

amc-corey-cox merged commit fed713d into main Jun 10, 2026
3 checks passed

amc-corey-cox deleted the fetch-variable-digests branch June 10, 2026 13:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add dbGaP digest fetcher and adapt-digests pipeline stage#320

Add dbGaP digest fetcher and adapt-digests pipeline stage#320
amc-corey-cox merged 8 commits into
mainfrom
fetch-variable-digests

amc-corey-cox commented May 4, 2026 •

edited

Loading

Uh oh!

amc-corey-cox commented May 5, 2026

Uh oh!

codecov-commenter commented May 6, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

csiege left a comment

Uh oh!

madanucd left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

amc-corey-cox commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

What's in this PR

History — how we got here

Blocked on

Things deliberately deferred

Test plan

Uh oh!

amc-corey-cox commented May 5, 2026

Uh oh!

codecov-commenter commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

csiege left a comment

Choose a reason for hiding this comment

Uh oh!

madanucd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

amc-corey-cox commented May 4, 2026 •

edited

Loading

codecov-commenter commented May 6, 2026 •

edited

Loading