Skip to content

Add dbGaP digest fetcher and adapt-digests pipeline stage#320

Open
amc-corey-cox wants to merge 6 commits into
mainfrom
fetch-variable-digests
Open

Add dbGaP digest fetcher and adapt-digests pipeline stage#320
amc-corey-cox wants to merge 6 commits into
mainfrom
fetch-variable-digests

Conversation

@amc-corey-cox
Copy link
Copy Markdown
Collaborator

@amc-corey-cox amc-corey-cox commented May 4, 2026

Summary

Adds the dm-bip side of dbGaP variable digest ingestion: a fetcher that pulls paired *.data_dict.xml + *.var_report.xml files from dbGaP's public FTP into a local cache, plus Make targets that drive per-pair adaptation via SchemaAutomator's schemauto adapt-dbgap CLI.

After fetch + adapt, downstream stages (schema-create, validate-data, map-data) work against the canonical-DD TSVs without further dm-bip changes.

Refs #204.
Closes #204.

Architecture

[upstream cohorts.yaml] → load_cohorts (Python)
                              ↓
[dbGaP FTP] ────fetch───→ .dbgap-cache/<cohort>/<phs.v>/pheno_variable_summaries/
                              │  paired *.data_dict.xml + *.var_report.xml
                              ↓
                 schemauto adapt-dbgap  (per pair, via Make)
                              ↓
                  output/<cohort>/dd/<phs>.<pht>.dd.tsv
                              ↓
                 schema-create / map-data / ...

What's in this PR

  • src/dm_bip/prepare_study/fetch_digests.py — cohort registry loader (sourced from upstream RTIInternational/NHLBI-BDC-DMC-HV/hv-lint/cohorts.yaml), FTP fetcher with local cache, directory-listing scraper. No parsing — that lives in schema-automator.
  • dm-bip fetch-digests <cohort-key> CLI: --list, --refresh, --cache-dir.
  • New Make targets in pipeline.Makefile:
    • fetch-digests — populates the cache for DM_COHORT.
    • adapt-digests — pattern rule that calls schemauto adapt-dbgap per pair, writes output/<cohort>/dd/<phs>.<pht>.dd.tsv.

History — how we got here

Earlier commits on this branch shipped an inline ad-hoc adapter: parsers for data_dict.xml / var_report.xml, a translator from dbGaP's type vocabulary to schema-automator's canonical 10-type system, and a TSV writer for the canonical DD format. During review we decided that work belonged upstream in the linkml-map-driven adapter ecosystem rather than as bespoke dm-bip code.

That parser/translator/writer chunk now lives in linkml/schema-automator#207 (closes linkml/schema-automator#206), which:

  • Adds a LinkML schema describing the dbGaP digest XML form.
  • Adds a declarative linkml-map trans-spec doing the type-vocab translation — more comprehensive than what we had (handles dbGaP typos, composite types, empty-type fallback to calculated_type).
  • Adds the schemauto adapt-dbgap CLI consumed by our new Make target.

This dm-bip PR has been rewritten to be just the fetcher + orchestration glue. The fetch + Make split also matches the rest of pipeline.Makefile (each stage is a Make target with explicit inputs/outputs).

Blocked on

Things deliberately deferred

  • BDC bucket fetching — auth complexity; deferred.
  • Variable corpus assembly — the downstream consumer of #204; tracked separately.
  • adapt-digests idempotency under repeated runs — depends on schema-automator's output being byte-stable, which in turn depends on linkml-map not introducing nondeterministic ordering. Out of scope here; revisit if Make rebuild churn becomes a problem.

Test plan

  • make test — fetcher-layer unit tests pass.
  • uv run ruff check . — lint clean.
  • dm-bip fetch-digests --list — fetches upstream cohorts.yaml, prints all cohorts.
  • make fetch-digests DM_COHORT=jhs — populates the cache.
  • After SA release: make adapt-digests DM_COHORT=jhs — produces TSVs under output/jhs/dd/.

Introduces src/dm_bip/prepare_study/fetch_digests.py and a CLI command
(`dm-bip fetch-digests`) for fetching data_dict.xml + var_report.xml
files from dbGaP's public FTP, with local caching and dataclass-based
parsers for both file types.

Cohort version pins are sourced from upstream NHLBI-BDC-DMC-HV's
hv-lint/cohorts.yaml — partial structure brought over ahead of the
full hv-lint migration tracked in #312.

Refs #204
Adds a `parse-digests` CLI command that reads cached data_dict.xml files
for a cohort and writes one TSV per data table in the schema-automator
canonical data dictionary format (linkml/schema-automator#201).

Outputs land at `output/<cohort>/dd/<phs>.<pht>.dd.tsv` with all Spec A
columns plus `uri` from Spec B. dbGaP types are translated to the
canonical 10-value vocabulary; encoded values are rendered REDCap-style
(`code, label | code, label`); each variable's `uri` carries the dbGaP
phv accession as a CURIE for traceability. `unit`, `min`, and `max` are
emitted empty pending richer var_report parsing.

Refs #204
@amc-corey-cox
Copy link
Copy Markdown
Collaborator Author

Follow-up commit (`68731b0`) adds the consumable endpoint we discussed: `dm-bip parse-digests `.

What's new:

  • Reads cached `data_dict.xml` files (output of `fetch-digests`)
  • Writes one TSV per data table in schema-automator canonical DD format: `output//dd/..dd.tsv`
  • All Spec A columns (`name`, `type`, `description`, `codes`, `unit`, `min`, `max`) plus `uri` from Spec B
  • dbGaP types translated to canonical vocabulary (`String`/`string` → `string`, `encoded value` → `permissible_values`, etc.); unknowns default to `string`
  • Encoded values rendered REDCap-style (`code, label | code, label`) with label sanitization for separator characters
  • `uri` column carries `dbgap:` for traceability
  • `unit`, `min`, `max` emitted empty for now (deferred until richer var_report parsing)

Why canonical DD output: gives the rest of the pipeline (#307 schema enrichment, eventually schema-automator's #192 ingestion) a real artifact to consume on disk, not just in-memory dataclasses.

Additional smoke test for the test plan:

  • `dm-bip parse-digests jhs` — produces `output/jhs/dd/phs000286.pht*.dd.tsv` files; spot-check one TSV opens cleanly in a spreadsheet/editor

#204 still stays open: this gives us a usable endpoint and probably enough for a first pass at #307, but BDC bucket fetching, the legacy-format adapter for `prepare_metadata.py`, and any "richer corpus" work remain.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 6, 2026

Codecov Report

❌ Patch coverage is 76.31579% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.32%. Comparing base (8f4f431) to head (dcf5b3b).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/dm_bip/cli.py 0.00% 18 Missing ⚠️
src/dm_bip/prepare_study/fetch_digests.py 90.62% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #320      +/-   ##
==========================================
+ Coverage   75.52%   79.32%   +3.80%     
==========================================
  Files           9       10       +1     
  Lines         625      740     +115     
==========================================
+ Hits          472      587     +115     
  Misses        153      153              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a foundational “fetch + parse” layer for dbGaP variable digest files (supporting #204) by introducing a new digest module, new CLI commands to fetch/parse into a canonical TSV shape, and fixtures/tests to validate the XML parsers.

Changes:

  • Added prepare_study.fetch_digests module with cohort registry loading (from upstream cohorts.yaml), XML parsing for data_dict.xml / var_report.xml, and cached fetching from NCBI.
  • Added new CLI commands dm-bip fetch-digests and dm-bip parse-digests.
  • Added parser unit tests plus real XML fixtures; added defusedxml dependency and ignored .dbgap-cache/.

Reviewed changes

Copilot reviewed 6 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/dm_bip/prepare_study/fetch_digests.py Implements cohort loading, digest fetching with local cache, XML parsers, and canonical TSV writer.
src/dm_bip/cli.py Exposes fetch-digests and parse-digests commands via Typer.
tests/unit/test_fetch_digests.py Adds unit tests covering XML parsing and canonical TSV output.
tests/input/dbgap_digests/JHS_Subject.data_dict.xml Real-world fixture for data_dict.xml parser tests.
tests/input/dbgap_digests/JHS_Subject.var_report.xml Real-world fixture for var_report.xml parser tests.
pyproject.toml Adds direct dependency on defusedxml.
uv.lock Locks defusedxml into the environment.
.gitignore Ignores the local .dbgap-cache/ directory created by the new CLI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/dm_bip/prepare_study/fetch_digests.py
Comment thread src/dm_bip/prepare_study/fetch_digests.py
Copy link
Copy Markdown
Collaborator

@csiege csiege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the useful foundation here. I found three correctness issues that look worth addressing before this becomes a dependable canonical-DD path:

  1. Incorrect type translation for many real dbGaP dictionaries

    src/dm_bip/prepare_study/fetch_digests.py only maps a small set of exact dbGaP type strings, then silently defaults unknown types to string. In the local dbGaP cache, common real values include numeric, encoded, decimal, encoded, num, integer, encoded, continuous decimal, enumerated integer, and encoded values. Those would all become string, so parse-digests would emit materially wrong canonical data dictionaries: numeric fields lose numeric type, encoded fields lose permissible_values, and downstream schema generation would be incorrect.

  2. Codes encoding mutates labels instead of following the canonical-DD escaping rules

    _sanitize_label() replaces commas with semicolons and pipes with slashes inside labels. The referenced schema-automator spec says labels may contain unescaped commas after the first separator comma, and literal pipe/backslash/comma in code values should be escaped with backslashes. The current implementation changes source metadata content, and the new test locks in that incorrect behavior by asserting comma removal. Labels like Black, non-Hispanic should stay as written, not become Black; non-Hispanic.

  3. Generated numeric rows are knowingly non-conformant for unit, min, and max

    _dd_row() always emits empty unit, min, and max. The schema-automator canonical spec distinguishes empty cells from the explicit none token, and says empty cells are conformance issues when these fields apply. This may be acceptable as an interim draft output, but the CLI description says it converts digests into canonical-DD TSVs, so consumers may reasonably expect valid canonical rows. At minimum, numeric fields should probably emit none where genuinely unitless/unbounded, or the command should make clear that the output is incomplete/non-strict.

@amc-corey-cox amc-corey-cox changed the title Add dbGaP variable digest fetcher and parser Add dbGaP digest fetcher and adapt-digests pipeline stage May 12, 2026
…mator

Earlier commits on this branch shipped an inline ad-hoc adapter: XML
parsers, a type-vocab translator, and a canonical-DD TSV writer. That
work now lives upstream in linkml/schema-automator#207 (merged), via a
LinkML schema + declarative linkml-map trans-spec + schemauto
adapt-dbgap CLI.

dm-bip now provides:
- fetch_digests.py: cohort registry loader (sourced from upstream
  NHLBI-BDC-DMC-HV/hv-lint/cohorts.yaml), FTP fetcher with local cache.
- Pair discovery + digest_pairs.mk emission, because dbGaP's
  data_dict.xml and var_report.xml filenames don't share a stem
  (var_report has an extra .p<participant_set> segment).
- pipeline.Makefile targets fetch-digests and adapt-digests. adapt-digests
  uses .SECONDEXPANSION and the included digest_pairs.mk to dispatch one
  schemauto adapt-dbgap call per pair.

Dropped: parse_data_dict, parse_var_report, _translate_type,
_DBGAP_TYPE_MAP, _sanitize_label, _encode_codes, _dd_row,
write_canonical_dd, _dd_output_filename, parse_cached_digests,
DD_TSV_COLUMNS, plus their dataclasses and the parse-digests CLI
command. defusedxml is no longer a direct dependency.

Still pending: a schema-automator release that includes the dbgap
adapter, after which the pin in pyproject.toml gets bumped.

Refs #204
@amc-corey-cox amc-corey-cox marked this pull request as ready for review May 12, 2026 19:15
- Tighten _DIGEST_FILENAME_RE to reject path separators in scraped hrefs
  (defense-in-depth against directory traversal from the FTP listing).
- Add tests for list_digest_files, fetch_digests (cached vs refresh),
  pair_digests (with and without .p<N> participant-set segment, unmatched
  case logs a warning), and write_pairs_mk output format.
Caught by CI's `ruff format --check`: a multi-line implicit string concat
in test_returns_sorted_unique_filenames fits in 120 chars on one line.
@amc-corey-cox amc-corey-cox added the staged Work ready or in progress, waiting on upstream release label May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

staged Work ready or in progress, waiting on upstream release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add variable digest files to pipeline dbGaP variable digest adapter for canonical DD format

4 participants