Skip to content

Fix documentation and deduplication issues#188

Merged
enriquea merged 3 commits into
mainfrom
dev
Jun 8, 2026
Merged

Fix documentation and deduplication issues#188
enriquea merged 3 commits into
mainfrom
dev

Conversation

@enriquea

@enriquea enriquea commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

This pull request introduces robust support for documentation-only ("stub") datasets in the drift detection pipeline. It standardizes the handling and reporting of sources that lack programmatically accessible data, ensuring they are surfaced as visible warnings rather than producing misleading results. Additionally, the pull request improves deduplication logic in gene set burden calculations and makes minor catalog corrections.

Drift detection improvements:

  • Added a new status="stub" for datasets with no programmatic probe, updating the drift pipeline and reporting to handle these as surfaced warnings rather than silent successes or failures. This includes a new handle_stub function, updates to CLI summary output, and changes to the drift runner logic to recognize and propagate stub status. (.github/scripts/drift_to_pr.py, hvantk/core/plugin/drift_runner.py) [1] [2] [3] [4] [5] [6] [7]
  • Introduced a standardized API for stub fingerprints (stub_fingerprint, PROBE_STATUS_STUB, STUB_FINGERPRINT_TOKEN) to be used by drift probes that cannot programmatically fingerprint their source. (hvantk/core/plugin/api.py)
  • Updated all stub drift probes (e.g., alphagenome, cosmic_cgc, dbnsfp, ensembl_gene, gevir) to use the new stub_fingerprint helper and provide clear reasons for stub status. (hvantk/skills/alphagenome/drift_probe.py, hvantk/skills/cosmic_cgc/drift_probe.py, hvantk/skills/dbnsfp/drift_probe.py, hvantk/skills/ensembl_gene/drift_probe.py, hvantk/skills/gevir/drift_probe.py) [1] [2] [3] [4] [5]

Gene set burden calculation improvements:

  • Improved deduplication of gene lists within gene sets to ensure accurate burden calculations and consistent reporting, regardless of input order or duplicates. (hvantk/algorithms/enrichex/burden.py) [1] [2]

Data catalog corrections:

  • Updated the data_source field in several dataset catalogs to reflect the actual upstream source instead of "Custom" and fixed a file format typo from "gff" to "gff3". (hvantk/skills/clingen/catalog/datasets.json, hvantk/skills/clinvar/catalog/datasets.json, hvantk/skills/dbnsfp/catalog/datasets.json, hvantk/skills/ensembl_gene/catalog/datasets.json, hvantk/skills/gevir/catalog/datasets.json) [1] [2] [3] [4] [5] [6]

Reporting improvements:

  • Improved summary markdown rendering for QC reports to use scientific notation for very small values, preventing them from being displayed as "0.000". (hvantk/algorithms/hgc/qc_report.py)

Summary by CodeRabbit

  • New Features

    • Added a formal "stub" status for documentation-only datasets and CLI warnings for stub results; drift results now include richer context (observed/expected/diff/errors).
  • Bug Fixes

    • Deduplicated genes within gene-sets to prevent double-counting.
    • Improved numeric formatting in QC reports.
    • Clearer catalog validation error messages.
  • Documentation

    • Updated dataset source metadata across multiple catalogs.
  • Tests

    • Added tests for gene-set deduplication and stub probe behavior.

enriquea added 2 commits June 5, 2026 00:42
) (#186)

* fix(drift): structured stub sentinel + WARNING for doc-only probes (#177)

The 7 documentation-only plugins (alphagenome, cosmic_cgc, dbnsfp,
ensembl_gene, gevir, gnomad_metrics, pqtl) returned a placeholder
'sha256:phase-k-stub-not-implemented' fingerprint. That fake sha256 was
stamped into provenance as if a real probe ran, and at drift time the
missing baseline surfaced as an indistinct 'probe_failed' (or a
false-green 'clean' had a baseline been regenerated from the stub).

Option 1 from #177:
- Add api.stub_fingerprint(reason) returning a structured sentinel
  {probe_status: 'stub', reason, fingerprint: 'stub:no-programmatic-source'}.
  run_builder records the honest token instead of a fake sha256.
- drift_runner invokes the probe first and short-circuits a probe_status
  == 'stub' result to a distinct status='stub' (no baseline required).
- drift_cli prints a visible WARNING (stderr) for stubs and exits clean,
  so an intentional doc-only stub is never a silent false-green.
- Rewrite the 7 stub probes to call stub_fingerprint() with a
  provider-specific reason.

Adds one regression test guarding stub -> status='stub' (not clean/
probe_failed).

* fix(drift): emit stub WARNING to stderr in --json mode + surface stubs in CI summary (#186 review)

Addresses Copilot's review on PR #186: the stub WARNING was only emitted
on the human-readable output path, so `hvantk drift --all --json` (used by
.github/workflows/drift.yml, which captures stdout to a file) re-hid
documentation-only stubs.

- drift_cli: move the stub WARNING into a shared loop that runs in BOTH
  --json and human modes, writing to stderr so machine-readable stdout
  stays clean while CI step logs stay visibly non-green.
- drift_to_pr.py: count `stub` entries in the summary and (new handle_stub)
  write a `- STUB:` line to GITHUB_STEP_SUMMARY, symmetric with
  drifted/probe_failed, so stubs are not dropped from the rendered CI
  summary. Docstring updated to document the `stub` status.
- test: regression test asserting --json mode emits the stub WARNING to
  stderr (and keeps stdout clean JSON).

The drift_to_pr.py step-summary handler was surfaced by an adversarial
review of the CLI fix and verified offline via a --dry-run synthetic report.

* test(drift): make stub-WARNING test work on Click >= 8.2 (mix_stderr removed)

CI runs Click >= 8.2, which removed the `CliRunner(mix_stderr=...)` kwarg
(streams are always captured separately there), so the test raised
TypeError: CliRunner.__init__() got an unexpected keyword argument
'mix_stderr'.

- Construct the runner with mix_stderr=False on Click 8.1.x, falling back
  to CliRunner() on >= 8.2 (try/except TypeError).
- Assert on `result.stdout` (stdout-only on both versions) rather than
  `result.output` (which mixes stderr back in on >= 8.2).

Verified the exact assertions pass under both Click 8.1.8 and 8.4.1.
…rden dedup, qc sci-notation, lint) (#187)

* fix(#185): safe deferred-review polish (catalog data_source, gff3, burden dedup, qc sci-notation, lint)

Addresses the mechanical / behavior-preserving subset of #185. Each item was
re-verified at HEAD (post PR-5..PR-8 refactor) before changing.

Catalog metadata:
- Normalize data_source "Custom" -> canonical names across the 10 migrated
  genomics catalogs (ClinGen, ClinVar, dbNSFP, Ensembl, GeVIR, gnomAD, GTEx,
  GWAS Catalog, INSIDER, MSigDB) so `catalog list --data-source <Name>` and
  unified_registry.search(data_source=...) match the real source name.
- ensembl_gene file format "gff" -> "gff3" (path is *.gff3.gz; cosmetic).

Code:
- enrichex/burden.py _build_gene_to_sets_ht: dedup genes within a set via
  dict.fromkeys (order-preserving) so a duplicated gene is not double-counted
  after explode_rows. Latent (parse_geneset_tsv dedups upstream); guards other
  callers. + regression test (verified non-vacuous).
- hgc/qc_report.py render_qc_summary_markdown: scientific notation for very
  small magnitudes (tiny HWE p-values) instead of collapsing to "0.000";
  exact 0 stays "0.000".
- tools/plugins/plugins_cli.py: split the bundled-schema read and the user
  catalog read into separate try blocks (correct error attribution) and chain
  exceptions with `from exc` (Ruff B904).
- tools/enrichex/burden_cli.py: drop placeholder-less f-string prefixes (Ruff
  F541), 7 occurrences.

Deliberately NOT included (need maintainer decisions, tracked in #185):
- schema-enum tightening for data_source
- gnomad_metrics variant_types SV/CNV scope (domain call)
- gwas_catalog / msigdb URL pinning vs provenance-only
- unified_registry.py fail-fast for an unmapped primary_domain (blast radius)

* fix(#185): dedup gene_sets at compute_geneset_burden_mt entry for size/coverage consistency (#187 review)

Copilot flagged that deduping only inside _build_gene_to_sets_ht left
gene_set_size (len(genes)), gene_coverage_pct, and the min_gene_set_size
filter computed from the RAW list — so a caller passing duplicate genes
within a set would get a deduped membership (n_genes_found) inconsistent
with a non-deduped size/denominator.

Fix: normalize each set's gene list once at the top of
compute_geneset_burden_mt (order-preserving dict.fromkeys) so the min-size
filter, gene_set_size, gene_coverage_pct, and the gene->set membership all
derive from the same deduped genes. No-op for the canonical pipeline
(parse_geneset_tsv already dedups). The _build_gene_to_sets_ht dedup is kept
as defense for direct callers.

+ regression test asserting gene_set_size reflects the deduped count.
@qodo-code-review

Copy link
Copy Markdown
Contributor

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@enriquea enriquea requested a review from Copilot June 8, 2026 08:56
@enriquea enriquea self-assigned this Jun 8, 2026
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3d98497f-797c-4a0a-bc05-1c07e4012e75

📥 Commits

Reviewing files that changed from the base of the PR and between 9424342 and c4484ab.

📒 Files selected for processing (2)
  • hvantk/core/plugin/drift_runner.py
  • hvantk/tests/test_drift_runner.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • hvantk/core/plugin/drift_runner.py

📝 Walkthrough

Walkthrough

Adds a "stub" probe sentinel and shared stub helper; enriches drift-runner results and flows to surface stub warnings; converts several probe modules to use the stub helper; adds gene-set deduplication and tests; corrects dataset catalog metadata and minor CLI/formatting tweaks.

Changes

Drift Probe Stub Status Infrastructure

Layer / File(s) Summary
Core stub sentinel and helper function
hvantk/core/plugin/api.py
Adds PROBE_STATUS_STUB, STUB_FINGERPRINT_TOKEN, and stub_fingerprint(reason) to produce a standardized stub fingerprint payload.
DriftResult enrichment and probe ordering
hvantk/core/plugin/drift_runner.py
Expands DriftResult with observed, expected, diff, and probe_error; runs probe before loading baseline; returns early for stub with observed payload; wraps/augments probe errors and includes observed when baseline missing.
CLI warning emission for stub status
hvantk/tools/plugins/drift_cli.py
Emits a formatted stderr WARNING for results with status == "stub" (JSON and human modes), extracting reason from observed payloads while preserving exit-code logic.
PR generation script stub handling
.github/scripts/drift_to_pr.py
Counts stub entries in summaries and adds handle_stub() to record stubs into the GitHub step summary instead of creating PRs.
Drift probe modules converted to shared stub helper
hvantk/skills/{alphagenome,cosmic_cgc,dbnsfp,ensembl_gene,gevir,gnomad_metrics,pqtl}/drift_probe.py
Replaces hardcoded placeholder dicts with calls to stub_fingerprint(_REASON) and updated module docstrings explaining stub rationale.
Drift CLI and runner regression tests for stub status
hvantk/tests/test_drift_cli.py, hvantk/tests/test_drift_runner.py
Adds tests verifying stub probes yield status == "stub", CLI emits stderr warnings without contaminating JSON stdout, and probe-failure + missing-baseline preserves both messages.

Gene-Set Burden Deduplication

Layer / File(s) Summary
Deduplication in gene-to-set mapping and burden computation
hvantk/algorithms/enrichex/burden.py
Deduplicates gene lists per set (order-preserving via dict.fromkeys) in _build_gene_to_sets_ht and early in compute_geneset_burden_mt before size filtering and membership/coverage calculations.
Deduplication regression tests
hvantk/tests/enrichex/test_burden_hail.py
Adds tests asserting deduplication prevents double-counting and that compute_geneset_burden_mt uses deduplicated gene_set_size.

Dataset Catalog and Metadata Updates

Layer / File(s) Summary
Dataset data_source field corrections
hvantk/skills/{clingen,clinvar,dbnsfp,ensembl_gene,gevir,gnomad_metrics,gtex_eqtl,gwas_catalog,insider,msigdb}/catalog/datasets.json
Updates data_source from "Custom" to canonical upstream source names (ClinGen, ClinVar, dbNSFP, Ensembl, GeVIR, gnomAD, GTEx, GWAS Catalog, INSIDER, MSigDB).
File format specification correction
hvantk/skills/ensembl_gene/catalog/datasets.json
Changes the GFF file format from "gff" to "gff3".

Supporting Improvements

Layer / File(s) Summary
QC report numeric formatting rule
hvantk/algorithms/hgc/qc_report.py
Refines numeric formatting selection to avoid collapsing very small non-zero values to "0.000" while keeping exact zero fixed.
CLI output formatting and error handling
hvantk/tools/enrichex/burden_cli.py, hvantk/tools/plugins/plugins_cli.py
Removes f-strings from burden CLI headers for consistency and splits plugin validate error handling to provide clearer, separate messages for bundled schema vs catalog file read/parse failures.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • bigbio/hvantk#184: Both PRs touch hvantk/algorithms/hgc/qc_report.py—this PR refines numeric formatting while the related PR introduced related rendering helpers.

Suggested labels

documentation, enhancement, Review effort 5/5

Suggested reviewers

  • ypriverol

Poem

🐰 I stamp a stub with gentle cheer,
A warning bell the runners hear.
Genes pruned clean so counts stay true,
Catalogs fixed and tests anew.
The rabbit hops — the changes clear!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title "Fix documentation and deduplication issues" is generic and vague, using a high-level umbrella term without clearly conveying the main technical contribution of the PR. Consider using a more specific title that highlights the primary change, such as "Add stub status support for documentation-only datasets" or "Support documentation-only datasets with stub status in drift detection."
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codacy-production

codacy-production Bot commented Jun 8, 2026

Copy link
Copy Markdown

Not up to standards ⛔

🔴 Issues 12 high · 26 minor

Alerts:
⚠ 38 issues (≤ 0 issues of at least minor severity)

Results:
38 new issues

Category Results
Documentation 26 minor
ErrorProne 1 high
Security 11 high

View in Codacy

🟢 Metrics 8 complexity

Metric Results
Complexity 8

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the drift-detection pipeline to explicitly support documentation-only (“stub”) datasets, ensuring they surface as visible warnings rather than producing misleading “clean” or “probe_failed” results. It also tightens gene set burden calculations by deduplicating within-set gene lists consistently, and applies small catalog and report-rendering corrections.

Changes:

  • Add a first-class stub probe contract (stub_fingerprint / PROBE_STATUS_STUB) and propagate stub status through drift runner, CLI output, and CI reporting.
  • Normalize/deduplicate gene lists within gene sets to prevent double-counting and inconsistent set-size/coverage metrics; add regression tests.
  • Apply minor catalog metadata fixes and improve QC markdown formatting for very small numeric values.

Reviewed changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
.github/scripts/drift_to_pr.py Surfaces stub entries in CI step summary output without opening PRs for them.
hvantk/core/plugin/api.py Introduces standardized stub fingerprint API/constants for probes that cannot be programmatically fingerprinted.
hvantk/core/plugin/drift_runner.py Propagates stub probe results as status="stub" and changes runner ordering to probe-first to avoid false “probe_failed” on stub baselines.
hvantk/tools/plugins/drift_cli.py Emits stub WARNINGs to stderr in both human and --json modes while keeping stdout machine-readable.
hvantk/tools/plugins/plugins_cli.py Improves error handling when reading bundled catalog schema / catalog JSON.
hvantk/algorithms/enrichex/burden.py Deduplicates genes within each gene set (order-preserving) to keep sizes/membership consistent and avoid double counting.
hvantk/tools/enrichex/burden_cli.py Minor string formatting cleanup in CLI output.
hvantk/algorithms/hgc/qc_report.py Uses scientific notation for very small magnitudes to avoid rendering as 0.000.
hvantk/tests/test_drift_runner.py Adds regression test ensuring stub probes are reported as stub (not false-green or probe_failed) when baseline is missing.
hvantk/tests/test_drift_cli.py Adds regression test ensuring stub WARNING appears on stderr even in --json mode.
hvantk/tests/enrichex/test_burden_hail.py Adds regression tests for within-set deduplication affecting gene_set_size and gene→set membership.
hvantk/skills/alphagenome/drift_probe.py Converts stub probe implementation to standardized stub_fingerprint(...).
hvantk/skills/cosmic_cgc/drift_probe.py Converts stub probe implementation to standardized stub_fingerprint(...).
hvantk/skills/dbnsfp/drift_probe.py Converts stub probe implementation to standardized stub_fingerprint(...).
hvantk/skills/ensembl_gene/drift_probe.py Converts stub probe implementation to standardized stub_fingerprint(...).
hvantk/skills/gevir/drift_probe.py Converts stub probe implementation to standardized stub_fingerprint(...).
hvantk/skills/gnomad_metrics/drift_probe.py Converts stub probe implementation to standardized stub_fingerprint(...).
hvantk/skills/pqtl/drift_probe.py Converts stub probe implementation to standardized stub_fingerprint(...).
hvantk/skills/clingen/catalog/datasets.json Corrects data_source metadata.
hvantk/skills/clinvar/catalog/datasets.json Corrects data_source metadata.
hvantk/skills/dbnsfp/catalog/datasets.json Corrects data_source metadata.
hvantk/skills/ensembl_gene/catalog/datasets.json Corrects data_source metadata and fixes file format from gff to gff3.
hvantk/skills/gevir/catalog/datasets.json Corrects data_source metadata.
hvantk/skills/gnomad_metrics/catalog/datasets.json Corrects data_source metadata.
hvantk/skills/gtex_eqtl/catalog/datasets.json Corrects data_source metadata.
hvantk/skills/gwas_catalog/catalog/datasets.json Corrects data_source metadata.
hvantk/skills/insider/catalog/datasets.json Corrects data_source metadata.
hvantk/skills/msigdb/catalog/datasets.json Corrects data_source metadata.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread hvantk/core/plugin/drift_runner.py Outdated
Comment on lines 47 to 53
try:
observed = _invoke_with_timeout(spec.drift_probe, timeout=timeout)
except DriftProbeError as exc:
return DriftResult(
dataset_name=spec.name,
status="probe_failed",
expected=expected,
probe_error=exc,

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid — verified at drift_runner.py (the #186 probe-first reordering): the two probe-failure except branches early-returned probe_failed without ever reading spec.test_paths.drift_fingerprint, so a probe exception combined with a missing baseline surfaced only the probe error.

Fixed in c4484ab: both probe-failure paths now go through a _probe_failed() helper that appends expected fingerprint is missing at <path> when the baseline file is absent, while preserving the underlying probe error. No change when a baseline exists (so real plugins are unaffected). Added a regression test asserting the error mentions both the probe failure and the missing fingerprint.

…188 review)

Copilot (PR #188) noted that the probe-first ordering introduced in #186
means a probe_failed early-return no longer mentions a missing expected
fingerprint: if the probe raises AND no baseline is committed, only the probe
error surfaced, masking the missing-baseline config error.

Route both probe-failure paths through a _probe_failed() helper that appends
"expected fingerprint is missing at <path>" when the baseline file is absent,
while preserving the underlying probe error. No change when a baseline exists.

+ regression test (probe raises + no baseline -> error mentions both).
@enriquea enriquea merged commit 63f166e into main Jun 8, 2026
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants