fix(#185): safe deferred-review polish (catalog data_source, gff3, burden dedup, qc sci-notation, lint) by enriquea · Pull Request #187 · bigbio/hvantk

enriquea · 2026-06-04T23:05:53Z

Addresses the safe, mechanical / behavior-preserving subset of #185 (deferred review findings from the #184 release). Each item was re-verified at HEAD (after the PR-5..PR-8 refactor moved code under algorithms/ and tools/) via a parallel verification pass before any edit; no existing test asserted the old values.

Included

Catalog metadata

data_source normalization — the 10 migrated genomics catalogs carried the placeholder "Custom", so catalog list --data-source ClinGen and unified_registry.search(data_source=…) never matched the real source name. Normalized as a set to canonical names: ClinGen, ClinVar, dbNSFP, Ensembl, GeVIR, gnomAD, GTEx, GWAS Catalog, INSIDER, MSigDB. (Verified: catalog list --data-source ClinGen now returns the ClinGen entry.)
ensembl_gene file format "gff" → "gff3" (path is *.gff3.gz, description says GFF3). Cosmetic.

Code

enrichex/burden.py _build_gene_to_sets_ht — dedup genes within a set via dict.fromkeys (order-preserving) so a gene listed twice in one set isn't double-counted after explode_rows (n_genes_found / burden). Latent today (parse_geneset_tsv dedups upstream); this guards other callers. + regression test (TestBuildGeneToSetsHt), verified non-vacuous (fails on the pre-fix code, passes after).
hgc/qc_report.py render_qc_summary_markdown — scientific notation for very small magnitudes (e.g. tiny HWE p-values) instead of collapsing to 0.000; an exact 0 stays 0.000.
tools/plugins/plugins_cli.py — split the bundled-schema read and the user-catalog read into separate try blocks (so the error message blames the right file) and chain exceptions with raise … from exc (Ruff B904).
tools/enrichex/burden_cli.py — drop placeholder-less f prefixes (Ruff F541), 7 occurrences.

Deliberately NOT included — need maintainer decisions (still tracked in #185)

Tightening the catalog-schema data_source to an enum.
gnomad_metrics variant_types SV/CNV scope (domain call — gnomAD v4 does publish SV/CNV).
gwas_catalog / msigdb URL: pin to a release-specific URL vs mark provenance-only.
resources/unified_registry.py fail-fast on an unmapped primary_domain (deliberate design choice with real blast radius).

Verification

10 catalog JSONs parse; catalog list --data-source filter works by canonical name.
flake8 (CI selection E9,F63,F7,F82) clean; all 7 F541 cleared in burden_cli.py.
Non-Hail suites pass: test_plugins_cli, test_unified_registry_per_plugin, test_cli_catalog, hgc/test_cli_qc, hgc/test_hgc_qc_visualization, and the full enrichex/ dir (32).
Hail dedup regression test passes; proven discriminating against the pre-fix code.

…rden dedup, qc sci-notation, lint) Addresses the mechanical / behavior-preserving subset of #185. Each item was re-verified at HEAD (post PR-5..PR-8 refactor) before changing. Catalog metadata: - Normalize data_source "Custom" -> canonical names across the 10 migrated genomics catalogs (ClinGen, ClinVar, dbNSFP, Ensembl, GeVIR, gnomAD, GTEx, GWAS Catalog, INSIDER, MSigDB) so `catalog list --data-source <Name>` and unified_registry.search(data_source=...) match the real source name. - ensembl_gene file format "gff" -> "gff3" (path is *.gff3.gz; cosmetic). Code: - enrichex/burden.py _build_gene_to_sets_ht: dedup genes within a set via dict.fromkeys (order-preserving) so a duplicated gene is not double-counted after explode_rows. Latent (parse_geneset_tsv dedups upstream); guards other callers. + regression test (verified non-vacuous). - hgc/qc_report.py render_qc_summary_markdown: scientific notation for very small magnitudes (tiny HWE p-values) instead of collapsing to "0.000"; exact 0 stays "0.000". - tools/plugins/plugins_cli.py: split the bundled-schema read and the user catalog read into separate try blocks (correct error attribution) and chain exceptions with `from exc` (Ruff B904). - tools/enrichex/burden_cli.py: drop placeholder-less f-string prefixes (Ruff F541), 7 occurrences. Deliberately NOT included (need maintainer decisions, tracked in #185): - schema-enum tightening for data_source - gnomad_metrics variant_types SV/CNV scope (domain call) - gwas_catalog / msigdb URL pinning vs provenance-only - unified_registry.py fail-fast for an unmapped primary_domain (blast radius)

qodo-code-review · 2026-06-04T23:05:57Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

coderabbitai · 2026-06-04T23:06:01Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8ead58b3-4d1c-48ec-b9a7-6f7f9eea97cc

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/185-safe-polish

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR applies a set of behavior-preserving (or safety-tightening) “polish” fixes tracked in #185: normalizing plugin catalog metadata for better search/filtering, tightening CLI error reporting/lint compliance, and adding a regression guard against within-gene-set duplicate genes affecting burden results.

Changes:

Normalize data_source values across migrated genomics catalogs (and fix Ensembl GFF3 format) to make registry/CLI filtering match real sources.
Prevent duplicate genes within a single gene set from being double-counted in burden calculations, and add a Hail regression test.
Improve QC markdown formatting for tiny statistics, and polish CLI code (exception chaining + remove placeholder-less f-strings).

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
hvantk/tools/plugins/plugins_cli.py	Splits schema vs catalog reads into separate `try` blocks and chains exceptions for clearer CLI errors (Ruff B904).
hvantk/tools/enrichex/burden_cli.py	Removes placeholder-less f-strings (Ruff F541) in CLI output paths.
hvantk/tests/enrichex/test_burden_hail.py	Adds Hail regression tests for within-set dedup behavior in `_build_gene_to_sets_ht`.
hvantk/skills/msigdb/catalog/datasets.json	Normalizes `data_source` to `MSigDB`.
hvantk/skills/insider/catalog/datasets.json	Normalizes `data_source` to `INSIDER`.
hvantk/skills/gwas_catalog/catalog/datasets.json	Normalizes `data_source` to `GWAS Catalog`.
hvantk/skills/gtex_eqtl/catalog/datasets.json	Normalizes `data_source` to `GTEx`.
hvantk/skills/gnomad_metrics/catalog/datasets.json	Normalizes `data_source` to `gnomAD`.
hvantk/skills/gevir/catalog/datasets.json	Normalizes `data_source` to `GeVIR`.
hvantk/skills/ensembl_gene/catalog/datasets.json	Normalizes `data_source` to `Ensembl` and corrects file `format` from `gff` to `gff3`.
hvantk/skills/dbnsfp/catalog/datasets.json	Normalizes `data_source` to `dbNSFP`.
hvantk/skills/clinvar/catalog/datasets.json	Normalizes `data_source` to `ClinVar`.
hvantk/skills/clingen/catalog/datasets.json	Normalizes `data_source` to `ClinGen`.
hvantk/algorithms/hgc/qc_report.py	Uses scientific notation for very small magnitudes in QC summary markdown rendering.
hvantk/algorithms/enrichex/burden.py	Dedups genes within each gene set (order-preserving) to prevent downstream double-counting after `explode_rows`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

enriquea · 2026-06-04T23:30:11Z

+        # dict.fromkeys dedups within a single set while preserving order, so a
+        # gene listed twice in one set is not double-counted after explode_rows
+        # (n_genes_found / burden). set() would lose order; the canonical
+        # parse_geneset_tsv already dedups upstream, so this guards other callers.
+        for gene in dict.fromkeys(genes):


Valid — verified in compute_geneset_burden_mt: min_gene_set_size filtered on raw len(v) (line 404), gene_set_size was len(genes) (line 537), and gene_coverage_pct divided by that raw size (line 552), while n_genes_found came from the deduped membership — so a non-deduplicated caller would get inconsistent size/coverage.

Fixed in 9c3ce41 by normalizing each set's gene list once at the entry of compute_geneset_burden_mt (gene_sets = {name: list(dict.fromkeys(genes)) …}), so the min-size filter, gene_set_size, gene_coverage_pct, and the gene→set membership all derive from the same deduped genes. It's a no-op for the canonical pipeline (parse_geneset_tsv already dedups). Kept the _build_gene_to_sets_ht dedup as defense for direct callers, and added a regression test asserting gene_set_size reflects the deduped count.

…e/coverage consistency (#187 review) Copilot flagged that deduping only inside _build_gene_to_sets_ht left gene_set_size (len(genes)), gene_coverage_pct, and the min_gene_set_size filter computed from the RAW list — so a caller passing duplicate genes within a set would get a deduped membership (n_genes_found) inconsistent with a non-deduped size/denominator. Fix: normalize each set's gene list once at the top of compute_geneset_burden_mt (order-preserving dict.fromkeys) so the min-size filter, gene_set_size, gene_coverage_pct, and the gene->set membership all derive from the same deduped genes. No-op for the canonical pipeline (parse_geneset_tsv already dedups). The _build_gene_to_sets_ht dedup is kept as defense for direct callers. + regression test asserting gene_set_size reflects the deduped count.

codacy-production · 2026-06-04T23:52:11Z

Not up to standards ⛔

🔴 Issues 9 minor

Alerts:
⚠ 9 issues (≤ 0 issues of at least minor severity)

Results:
9 new issues

Category Results

Documentation 9 minor

View in Codacy

🟢 Metrics 4 complexity

Metric Results

Complexity 4

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

enriquea requested a review from Copilot June 4, 2026 23:21

Copilot started reviewing on behalf of enriquea June 4, 2026 23:22 View session

Copilot AI reviewed Jun 4, 2026

View reviewed changes

enriquea merged commit 9424342 into dev Jun 4, 2026
2 checks passed

enriquea deleted the fix/185-safe-polish branch June 4, 2026 23:36

enriquea mentioned this pull request Jun 5, 2026

Follow-up: deferred review findings from release #184 (catalog metadata + minor polish) #185

Open

enriquea mentioned this pull request Jun 15, 2026

fix(catalog): resolve remaining #185 deferred findings #196

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(#185): safe deferred-review polish (catalog data_source, gff3, burden dedup, qc sci-notation, lint)#187

fix(#185): safe deferred-review polish (catalog data_source, gff3, burden dedup, qc sci-notation, lint)#187
enriquea merged 2 commits into
devfrom
fix/185-safe-polish

enriquea commented Jun 4, 2026

Uh oh!

qodo-code-review Bot commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

Review skipped

Uh oh!

Copilot AI left a comment

Uh oh!

enriquea Jun 4, 2026

Uh oh!

Uh oh!

codacy-production Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

enriquea commented Jun 4, 2026

Included

Deliberately NOT included — need maintainer decisions (still tracked in #185)

Verification

Uh oh!

qodo-code-review Bot commented Jun 4, 2026

Qodo reviews are paused for this user.

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

enriquea Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codacy-production Bot commented Jun 4, 2026

Not up to standards ⛔

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading