Fix documentation and deduplication issues by enriquea · Pull Request #188 · bigbio/hvantk

enriquea · 2026-06-08T08:56:05Z

This pull request introduces robust support for documentation-only ("stub") datasets in the drift detection pipeline. It standardizes the handling and reporting of sources that lack programmatically accessible data, ensuring they are surfaced as visible warnings rather than producing misleading results. Additionally, the pull request improves deduplication logic in gene set burden calculations and makes minor catalog corrections.

Drift detection improvements:

Added a new status="stub" for datasets with no programmatic probe, updating the drift pipeline and reporting to handle these as surfaced warnings rather than silent successes or failures. This includes a new handle_stub function, updates to CLI summary output, and changes to the drift runner logic to recognize and propagate stub status. (.github/scripts/drift_to_pr.py, hvantk/core/plugin/drift_runner.py) [1] [2] [3] [4] [5] [6] [7]
Introduced a standardized API for stub fingerprints (stub_fingerprint, PROBE_STATUS_STUB, STUB_FINGERPRINT_TOKEN) to be used by drift probes that cannot programmatically fingerprint their source. (hvantk/core/plugin/api.py)
Updated all stub drift probes (e.g., alphagenome, cosmic_cgc, dbnsfp, ensembl_gene, gevir) to use the new stub_fingerprint helper and provide clear reasons for stub status. (hvantk/skills/alphagenome/drift_probe.py, hvantk/skills/cosmic_cgc/drift_probe.py, hvantk/skills/dbnsfp/drift_probe.py, hvantk/skills/ensembl_gene/drift_probe.py, hvantk/skills/gevir/drift_probe.py) [1] [2] [3] [4] [5]

Gene set burden calculation improvements:

Improved deduplication of gene lists within gene sets to ensure accurate burden calculations and consistent reporting, regardless of input order or duplicates. (hvantk/algorithms/enrichex/burden.py) [1] [2]

Data catalog corrections:

Updated the data_source field in several dataset catalogs to reflect the actual upstream source instead of "Custom" and fixed a file format typo from "gff" to "gff3". (hvantk/skills/clingen/catalog/datasets.json, hvantk/skills/clinvar/catalog/datasets.json, hvantk/skills/dbnsfp/catalog/datasets.json, hvantk/skills/ensembl_gene/catalog/datasets.json, hvantk/skills/gevir/catalog/datasets.json) [1] [2] [3] [4] [5] [6]

Reporting improvements:

Improved summary markdown rendering for QC reports to use scientific notation for very small values, preventing them from being displayed as "0.000". (hvantk/algorithms/hgc/qc_report.py)

Summary by CodeRabbit

New Features
- Added a formal "stub" status for documentation-only datasets and CLI warnings for stub results; drift results now include richer context (observed/expected/diff/errors).
Bug Fixes
- Deduplicated genes within gene-sets to prevent double-counting.
- Improved numeric formatting in QC reports.
- Clearer catalog validation error messages.
Documentation
- Updated dataset source metadata across multiple catalogs.
Tests
- Added tests for gene-set deduplication and stub probe behavior.

) (#186) * fix(drift): structured stub sentinel + WARNING for doc-only probes (#177) The 7 documentation-only plugins (alphagenome, cosmic_cgc, dbnsfp, ensembl_gene, gevir, gnomad_metrics, pqtl) returned a placeholder 'sha256:phase-k-stub-not-implemented' fingerprint. That fake sha256 was stamped into provenance as if a real probe ran, and at drift time the missing baseline surfaced as an indistinct 'probe_failed' (or a false-green 'clean' had a baseline been regenerated from the stub). Option 1 from #177: - Add api.stub_fingerprint(reason) returning a structured sentinel {probe_status: 'stub', reason, fingerprint: 'stub:no-programmatic-source'}. run_builder records the honest token instead of a fake sha256. - drift_runner invokes the probe first and short-circuits a probe_status == 'stub' result to a distinct status='stub' (no baseline required). - drift_cli prints a visible WARNING (stderr) for stubs and exits clean, so an intentional doc-only stub is never a silent false-green. - Rewrite the 7 stub probes to call stub_fingerprint() with a provider-specific reason. Adds one regression test guarding stub -> status='stub' (not clean/ probe_failed). * fix(drift): emit stub WARNING to stderr in --json mode + surface stubs in CI summary (#186 review) Addresses Copilot's review on PR #186: the stub WARNING was only emitted on the human-readable output path, so `hvantk drift --all --json` (used by .github/workflows/drift.yml, which captures stdout to a file) re-hid documentation-only stubs. - drift_cli: move the stub WARNING into a shared loop that runs in BOTH --json and human modes, writing to stderr so machine-readable stdout stays clean while CI step logs stay visibly non-green. - drift_to_pr.py: count `stub` entries in the summary and (new handle_stub) write a `- STUB:` line to GITHUB_STEP_SUMMARY, symmetric with drifted/probe_failed, so stubs are not dropped from the rendered CI summary. Docstring updated to document the `stub` status. - test: regression test asserting --json mode emits the stub WARNING to stderr (and keeps stdout clean JSON). The drift_to_pr.py step-summary handler was surfaced by an adversarial review of the CLI fix and verified offline via a --dry-run synthetic report. * test(drift): make stub-WARNING test work on Click >= 8.2 (mix_stderr removed) CI runs Click >= 8.2, which removed the `CliRunner(mix_stderr=...)` kwarg (streams are always captured separately there), so the test raised TypeError: CliRunner.__init__() got an unexpected keyword argument 'mix_stderr'. - Construct the runner with mix_stderr=False on Click 8.1.x, falling back to CliRunner() on >= 8.2 (try/except TypeError). - Assert on `result.stdout` (stdout-only on both versions) rather than `result.output` (which mixes stderr back in on >= 8.2). Verified the exact assertions pass under both Click 8.1.8 and 8.4.1.

…rden dedup, qc sci-notation, lint) (#187) * fix(#185): safe deferred-review polish (catalog data_source, gff3, burden dedup, qc sci-notation, lint) Addresses the mechanical / behavior-preserving subset of #185. Each item was re-verified at HEAD (post PR-5..PR-8 refactor) before changing. Catalog metadata: - Normalize data_source "Custom" -> canonical names across the 10 migrated genomics catalogs (ClinGen, ClinVar, dbNSFP, Ensembl, GeVIR, gnomAD, GTEx, GWAS Catalog, INSIDER, MSigDB) so `catalog list --data-source <Name>` and unified_registry.search(data_source=...) match the real source name. - ensembl_gene file format "gff" -> "gff3" (path is *.gff3.gz; cosmetic). Code: - enrichex/burden.py _build_gene_to_sets_ht: dedup genes within a set via dict.fromkeys (order-preserving) so a duplicated gene is not double-counted after explode_rows. Latent (parse_geneset_tsv dedups upstream); guards other callers. + regression test (verified non-vacuous). - hgc/qc_report.py render_qc_summary_markdown: scientific notation for very small magnitudes (tiny HWE p-values) instead of collapsing to "0.000"; exact 0 stays "0.000". - tools/plugins/plugins_cli.py: split the bundled-schema read and the user catalog read into separate try blocks (correct error attribution) and chain exceptions with `from exc` (Ruff B904). - tools/enrichex/burden_cli.py: drop placeholder-less f-string prefixes (Ruff F541), 7 occurrences. Deliberately NOT included (need maintainer decisions, tracked in #185): - schema-enum tightening for data_source - gnomad_metrics variant_types SV/CNV scope (domain call) - gwas_catalog / msigdb URL pinning vs provenance-only - unified_registry.py fail-fast for an unmapped primary_domain (blast radius) * fix(#185): dedup gene_sets at compute_geneset_burden_mt entry for size/coverage consistency (#187 review) Copilot flagged that deduping only inside _build_gene_to_sets_ht left gene_set_size (len(genes)), gene_coverage_pct, and the min_gene_set_size filter computed from the RAW list — so a caller passing duplicate genes within a set would get a deduped membership (n_genes_found) inconsistent with a non-deduped size/denominator. Fix: normalize each set's gene list once at the top of compute_geneset_burden_mt (order-preserving dict.fromkeys) so the min-size filter, gene_set_size, gene_coverage_pct, and the gene->set membership all derive from the same deduped genes. No-op for the canonical pipeline (parse_geneset_tsv already dedups). The _build_gene_to_sets_ht dedup is kept as defense for direct callers. + regression test asserting gene_set_size reflects the deduped count.

qodo-code-review · 2026-06-08T08:56:09Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

coderabbitai · 2026-06-08T08:56:23Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3d98497f-797c-4a0a-bc05-1c07e4012e75

📥 Commits

Reviewing files that changed from the base of the PR and between 9424342 and c4484ab.

📒 Files selected for processing (2)

hvantk/core/plugin/drift_runner.py
hvantk/tests/test_drift_runner.py

🚧 Files skipped from review as they are similar to previous changes (1)

hvantk/core/plugin/drift_runner.py

📝 Walkthrough

Walkthrough

Adds a "stub" probe sentinel and shared stub helper; enriches drift-runner results and flows to surface stub warnings; converts several probe modules to use the stub helper; adds gene-set deduplication and tests; corrects dataset catalog metadata and minor CLI/formatting tweaks.

Changes

Drift Probe Stub Status Infrastructure

Layer / File(s)	Summary
Core stub sentinel and helper function `hvantk/core/plugin/api.py`	Adds `PROBE_STATUS_STUB`, `STUB_FINGERPRINT_TOKEN`, and `stub_fingerprint(reason)` to produce a standardized stub fingerprint payload.
DriftResult enrichment and probe ordering `hvantk/core/plugin/drift_runner.py`	Expands `DriftResult` with `observed`, `expected`, `diff`, and `probe_error`; runs probe before loading baseline; returns early for `stub` with observed payload; wraps/augments probe errors and includes observed when baseline missing.
CLI warning emission for stub status `hvantk/tools/plugins/drift_cli.py`	Emits a formatted stderr WARNING for results with `status == "stub"` (JSON and human modes), extracting reason from observed payloads while preserving exit-code logic.
PR generation script stub handling `.github/scripts/drift_to_pr.py`	Counts `stub` entries in summaries and adds `handle_stub()` to record stubs into the GitHub step summary instead of creating PRs.
Drift probe modules converted to shared stub helper `hvantk/skills/{alphagenome,cosmic_cgc,dbnsfp,ensembl_gene,gevir,gnomad_metrics,pqtl}/drift_probe.py`	Replaces hardcoded placeholder dicts with calls to `stub_fingerprint(_REASON)` and updated module docstrings explaining stub rationale.
Drift CLI and runner regression tests for stub status `hvantk/tests/test_drift_cli.py`, `hvantk/tests/test_drift_runner.py`	Adds tests verifying stub probes yield `status == "stub"`, CLI emits stderr warnings without contaminating JSON stdout, and probe-failure + missing-baseline preserves both messages.

Gene-Set Burden Deduplication

Layer / File(s)	Summary
Deduplication in gene-to-set mapping and burden computation `hvantk/algorithms/enrichex/burden.py`	Deduplicates gene lists per set (order-preserving via dict.fromkeys) in `_build_gene_to_sets_ht` and early in `compute_geneset_burden_mt` before size filtering and membership/coverage calculations.
Deduplication regression tests `hvantk/tests/enrichex/test_burden_hail.py`	Adds tests asserting deduplication prevents double-counting and that compute_geneset_burden_mt uses deduplicated gene_set_size.

Dataset Catalog and Metadata Updates

Layer / File(s)	Summary
Dataset data_source field corrections `hvantk/skills/{clingen,clinvar,dbnsfp,ensembl_gene,gevir,gnomad_metrics,gtex_eqtl,gwas_catalog,insider,msigdb}/catalog/datasets.json`	Updates `data_source` from `"Custom"` to canonical upstream source names (ClinGen, ClinVar, dbNSFP, Ensembl, GeVIR, gnomAD, GTEx, GWAS Catalog, INSIDER, MSigDB).
File format specification correction `hvantk/skills/ensembl_gene/catalog/datasets.json`	Changes the GFF file `format` from `"gff"` to `"gff3"`.

Supporting Improvements

Layer / File(s)	Summary
QC report numeric formatting rule `hvantk/algorithms/hgc/qc_report.py`	Refines numeric formatting selection to avoid collapsing very small non-zero values to `"0.000"` while keeping exact zero fixed.
CLI output formatting and error handling `hvantk/tools/enrichex/burden_cli.py`, `hvantk/tools/plugins/plugins_cli.py`	Removes f-strings from burden CLI headers for consistency and splits plugin validate error handling to provide clearer, separate messages for bundled schema vs catalog file read/parse failures.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

bigbio/hvantk#184: Both PRs touch hvantk/algorithms/hgc/qc_report.py—this PR refines numeric formatting while the related PR introduced related rendering helpers.

Suggested labels

documentation, enhancement, Review effort 5/5

Suggested reviewers

ypriverol

Poem

🐰 I stamp a stub with gentle cheer,
A warning bell the runners hear.
Genes pruned clean so counts stay true,
Catalogs fixed and tests anew.
The rabbit hops — the changes clear!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title "Fix documentation and deduplication issues" is generic and vague, using a high-level umbrella term without clearly conveying the main technical contribution of the PR.	Consider using a more specific title that highlights the primary change, such as "Add stub status support for documentation-only datasets" or "Support documentation-only datasets with stub status in drift detection."

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codacy-production · 2026-06-08T08:56:33Z

Not up to standards ⛔

🔴 Issues 12 high · 26 minor

Alerts:
⚠ 38 issues (≤ 0 issues of at least minor severity)

Results:
38 new issues

Category Results

Documentation 26 minor

ErrorProne 1 high

Security 11 high

View in Codacy

🟢 Metrics 8 complexity

Metric Results

Complexity 8

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

Copilot

Pull request overview

This PR improves the drift-detection pipeline to explicitly support documentation-only (“stub”) datasets, ensuring they surface as visible warnings rather than producing misleading “clean” or “probe_failed” results. It also tightens gene set burden calculations by deduplicating within-set gene lists consistently, and applies small catalog and report-rendering corrections.

Changes:

Add a first-class stub probe contract (stub_fingerprint / PROBE_STATUS_STUB) and propagate stub status through drift runner, CLI output, and CI reporting.
Normalize/deduplicate gene lists within gene sets to prevent double-counting and inconsistent set-size/coverage metrics; add regression tests.
Apply minor catalog metadata fixes and improve QC markdown formatting for very small numeric values.

Reviewed changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`.github/scripts/drift_to_pr.py`	Surfaces `stub` entries in CI step summary output without opening PRs for them.
`hvantk/core/plugin/api.py`	Introduces standardized stub fingerprint API/constants for probes that cannot be programmatically fingerprinted.
`hvantk/core/plugin/drift_runner.py`	Propagates stub probe results as `status="stub"` and changes runner ordering to probe-first to avoid false “probe_failed” on stub baselines.
`hvantk/tools/plugins/drift_cli.py`	Emits stub WARNINGs to stderr in both human and `--json` modes while keeping stdout machine-readable.
`hvantk/tools/plugins/plugins_cli.py`	Improves error handling when reading bundled catalog schema / catalog JSON.
`hvantk/algorithms/enrichex/burden.py`	Deduplicates genes within each gene set (order-preserving) to keep sizes/membership consistent and avoid double counting.
`hvantk/tools/enrichex/burden_cli.py`	Minor string formatting cleanup in CLI output.
`hvantk/algorithms/hgc/qc_report.py`	Uses scientific notation for very small magnitudes to avoid rendering as `0.000`.
`hvantk/tests/test_drift_runner.py`	Adds regression test ensuring stub probes are reported as `stub` (not false-green or probe_failed) when baseline is missing.
`hvantk/tests/test_drift_cli.py`	Adds regression test ensuring stub WARNING appears on stderr even in `--json` mode.
`hvantk/tests/enrichex/test_burden_hail.py`	Adds regression tests for within-set deduplication affecting `gene_set_size` and gene→set membership.
`hvantk/skills/alphagenome/drift_probe.py`	Converts stub probe implementation to standardized `stub_fingerprint(...)`.
`hvantk/skills/cosmic_cgc/drift_probe.py`	Converts stub probe implementation to standardized `stub_fingerprint(...)`.
`hvantk/skills/dbnsfp/drift_probe.py`	Converts stub probe implementation to standardized `stub_fingerprint(...)`.
`hvantk/skills/ensembl_gene/drift_probe.py`	Converts stub probe implementation to standardized `stub_fingerprint(...)`.
`hvantk/skills/gevir/drift_probe.py`	Converts stub probe implementation to standardized `stub_fingerprint(...)`.
`hvantk/skills/gnomad_metrics/drift_probe.py`	Converts stub probe implementation to standardized `stub_fingerprint(...)`.
`hvantk/skills/pqtl/drift_probe.py`	Converts stub probe implementation to standardized `stub_fingerprint(...)`.
`hvantk/skills/clingen/catalog/datasets.json`	Corrects `data_source` metadata.
`hvantk/skills/clinvar/catalog/datasets.json`	Corrects `data_source` metadata.
`hvantk/skills/dbnsfp/catalog/datasets.json`	Corrects `data_source` metadata.
`hvantk/skills/ensembl_gene/catalog/datasets.json`	Corrects `data_source` metadata and fixes file `format` from `gff` to `gff3`.
`hvantk/skills/gevir/catalog/datasets.json`	Corrects `data_source` metadata.
`hvantk/skills/gnomad_metrics/catalog/datasets.json`	Corrects `data_source` metadata.
`hvantk/skills/gtex_eqtl/catalog/datasets.json`	Corrects `data_source` metadata.
`hvantk/skills/gwas_catalog/catalog/datasets.json`	Corrects `data_source` metadata.
`hvantk/skills/insider/catalog/datasets.json`	Corrects `data_source` metadata.
`hvantk/skills/msigdb/catalog/datasets.json`	Corrects `data_source` metadata.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

enriquea · 2026-06-08T10:15:00Z

    try:
        observed = _invoke_with_timeout(spec.drift_probe, timeout=timeout)
    except DriftProbeError as exc:
        return DriftResult(
            dataset_name=spec.name,
            status="probe_failed",
-            expected=expected,
            probe_error=exc,


Valid — verified at drift_runner.py (the #186 probe-first reordering): the two probe-failure except branches early-returned probe_failed without ever reading spec.test_paths.drift_fingerprint, so a probe exception combined with a missing baseline surfaced only the probe error.

Fixed in c4484ab: both probe-failure paths now go through a _probe_failed() helper that appends expected fingerprint is missing at <path> when the baseline file is absent, while preserving the underlying probe error. No change when a baseline exists (so real plugins are unaffected). Added a regression test asserting the error mentions both the probe failure and the missing fingerprint.

…188 review) Copilot (PR #188) noted that the probe-first ordering introduced in #186 means a probe_failed early-return no longer mentions a missing expected fingerprint: if the probe raises AND no baseline is committed, only the probe error surfaced, masking the missing-baseline config error. Route both probe-failure paths through a _probe_failed() helper that appends "expected fingerprint is missing at <path>" when the baseline file is absent, while preserving the underlying probe error. No change when a baseline exists. + regression test (probe raises + no baseline -> error mentions both).

enriquea added 2 commits June 5, 2026 00:42

enriquea requested a review from Copilot June 8, 2026 08:56

enriquea self-assigned this Jun 8, 2026

Copilot started reviewing on behalf of enriquea June 8, 2026 08:56 View session

Copilot AI reviewed Jun 8, 2026

View reviewed changes

enriquea merged commit 63f166e into main Jun 8, 2026
6 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix documentation and deduplication issues#188

Fix documentation and deduplication issues#188
enriquea merged 3 commits into
mainfrom
dev

enriquea commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

qodo-code-review Bot commented Jun 8, 2026

Uh oh!

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

codacy-production Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

enriquea Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

enriquea commented Jun 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

qodo-code-review Bot commented Jun 8, 2026

Qodo reviews are paused for this user.

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

codacy-production Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Not up to standards ⛔

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

enriquea Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

enriquea commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

codacy-production Bot commented Jun 8, 2026 •

edited

Loading