Skip to content

fix(#185): safe deferred-review polish (catalog data_source, gff3, burden dedup, qc sci-notation, lint)#187

Merged
enriquea merged 2 commits into
devfrom
fix/185-safe-polish
Jun 4, 2026
Merged

fix(#185): safe deferred-review polish (catalog data_source, gff3, burden dedup, qc sci-notation, lint)#187
enriquea merged 2 commits into
devfrom
fix/185-safe-polish

Conversation

@enriquea

@enriquea enriquea commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Addresses the safe, mechanical / behavior-preserving subset of #185 (deferred review findings from the #184 release). Each item was re-verified at HEAD (after the PR-5..PR-8 refactor moved code under algorithms/ and tools/) via a parallel verification pass before any edit; no existing test asserted the old values.

Included

Catalog metadata

  • data_source normalization — the 10 migrated genomics catalogs carried the placeholder "Custom", so catalog list --data-source ClinGen and unified_registry.search(data_source=…) never matched the real source name. Normalized as a set to canonical names: ClinGen, ClinVar, dbNSFP, Ensembl, GeVIR, gnomAD, GTEx, GWAS Catalog, INSIDER, MSigDB. (Verified: catalog list --data-source ClinGen now returns the ClinGen entry.)
  • ensembl_gene file format "gff""gff3" (path is *.gff3.gz, description says GFF3). Cosmetic.

Code

  • enrichex/burden.py _build_gene_to_sets_ht — dedup genes within a set via dict.fromkeys (order-preserving) so a gene listed twice in one set isn't double-counted after explode_rows (n_genes_found / burden). Latent today (parse_geneset_tsv dedups upstream); this guards other callers. + regression test (TestBuildGeneToSetsHt), verified non-vacuous (fails on the pre-fix code, passes after).
  • hgc/qc_report.py render_qc_summary_markdown — scientific notation for very small magnitudes (e.g. tiny HWE p-values) instead of collapsing to 0.000; an exact 0 stays 0.000.
  • tools/plugins/plugins_cli.py — split the bundled-schema read and the user-catalog read into separate try blocks (so the error message blames the right file) and chain exceptions with raise … from exc (Ruff B904).
  • tools/enrichex/burden_cli.py — drop placeholder-less f prefixes (Ruff F541), 7 occurrences.

Deliberately NOT included — need maintainer decisions (still tracked in #185)

  • Tightening the catalog-schema data_source to an enum.
  • gnomad_metrics variant_types SV/CNV scope (domain call — gnomAD v4 does publish SV/CNV).
  • gwas_catalog / msigdb URL: pin to a release-specific URL vs mark provenance-only.
  • resources/unified_registry.py fail-fast on an unmapped primary_domain (deliberate design choice with real blast radius).

Verification

  • 10 catalog JSONs parse; catalog list --data-source filter works by canonical name.
  • flake8 (CI selection E9,F63,F7,F82) clean; all 7 F541 cleared in burden_cli.py.
  • Non-Hail suites pass: test_plugins_cli, test_unified_registry_per_plugin, test_cli_catalog, hgc/test_cli_qc, hgc/test_hgc_qc_visualization, and the full enrichex/ dir (32).
  • Hail dedup regression test passes; proven discriminating against the pre-fix code.

…rden dedup, qc sci-notation, lint)

Addresses the mechanical / behavior-preserving subset of #185. Each item was
re-verified at HEAD (post PR-5..PR-8 refactor) before changing.

Catalog metadata:
- Normalize data_source "Custom" -> canonical names across the 10 migrated
  genomics catalogs (ClinGen, ClinVar, dbNSFP, Ensembl, GeVIR, gnomAD, GTEx,
  GWAS Catalog, INSIDER, MSigDB) so `catalog list --data-source <Name>` and
  unified_registry.search(data_source=...) match the real source name.
- ensembl_gene file format "gff" -> "gff3" (path is *.gff3.gz; cosmetic).

Code:
- enrichex/burden.py _build_gene_to_sets_ht: dedup genes within a set via
  dict.fromkeys (order-preserving) so a duplicated gene is not double-counted
  after explode_rows. Latent (parse_geneset_tsv dedups upstream); guards other
  callers. + regression test (verified non-vacuous).
- hgc/qc_report.py render_qc_summary_markdown: scientific notation for very
  small magnitudes (tiny HWE p-values) instead of collapsing to "0.000";
  exact 0 stays "0.000".
- tools/plugins/plugins_cli.py: split the bundled-schema read and the user
  catalog read into separate try blocks (correct error attribution) and chain
  exceptions with `from exc` (Ruff B904).
- tools/enrichex/burden_cli.py: drop placeholder-less f-string prefixes (Ruff
  F541), 7 occurrences.

Deliberately NOT included (need maintainer decisions, tracked in #185):
- schema-enum tightening for data_source
- gnomad_metrics variant_types SV/CNV scope (domain call)
- gwas_catalog / msigdb URL pinning vs provenance-only
- unified_registry.py fail-fast for an unmapped primary_domain (blast radius)
@qodo-code-review

Copy link
Copy Markdown
Contributor

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8ead58b3-4d1c-48ec-b9a7-6f7f9eea97cc

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/185-safe-polish

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR applies a set of behavior-preserving (or safety-tightening) “polish” fixes tracked in #185: normalizing plugin catalog metadata for better search/filtering, tightening CLI error reporting/lint compliance, and adding a regression guard against within-gene-set duplicate genes affecting burden results.

Changes:

  • Normalize data_source values across migrated genomics catalogs (and fix Ensembl GFF3 format) to make registry/CLI filtering match real sources.
  • Prevent duplicate genes within a single gene set from being double-counted in burden calculations, and add a Hail regression test.
  • Improve QC markdown formatting for tiny statistics, and polish CLI code (exception chaining + remove placeholder-less f-strings).

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
hvantk/tools/plugins/plugins_cli.py Splits schema vs catalog reads into separate try blocks and chains exceptions for clearer CLI errors (Ruff B904).
hvantk/tools/enrichex/burden_cli.py Removes placeholder-less f-strings (Ruff F541) in CLI output paths.
hvantk/tests/enrichex/test_burden_hail.py Adds Hail regression tests for within-set dedup behavior in _build_gene_to_sets_ht.
hvantk/skills/msigdb/catalog/datasets.json Normalizes data_source to MSigDB.
hvantk/skills/insider/catalog/datasets.json Normalizes data_source to INSIDER.
hvantk/skills/gwas_catalog/catalog/datasets.json Normalizes data_source to GWAS Catalog.
hvantk/skills/gtex_eqtl/catalog/datasets.json Normalizes data_source to GTEx.
hvantk/skills/gnomad_metrics/catalog/datasets.json Normalizes data_source to gnomAD.
hvantk/skills/gevir/catalog/datasets.json Normalizes data_source to GeVIR.
hvantk/skills/ensembl_gene/catalog/datasets.json Normalizes data_source to Ensembl and corrects file format from gff to gff3.
hvantk/skills/dbnsfp/catalog/datasets.json Normalizes data_source to dbNSFP.
hvantk/skills/clinvar/catalog/datasets.json Normalizes data_source to ClinVar.
hvantk/skills/clingen/catalog/datasets.json Normalizes data_source to ClinGen.
hvantk/algorithms/hgc/qc_report.py Uses scientific notation for very small magnitudes in QC summary markdown rendering.
hvantk/algorithms/enrichex/burden.py Dedups genes within each gene set (order-preserving) to prevent downstream double-counting after explode_rows.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +284 to +288
# dict.fromkeys dedups within a single set while preserving order, so a
# gene listed twice in one set is not double-counted after explode_rows
# (n_genes_found / burden). set() would lose order; the canonical
# parse_geneset_tsv already dedups upstream, so this guards other callers.
for gene in dict.fromkeys(genes):

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid — verified in compute_geneset_burden_mt: min_gene_set_size filtered on raw len(v) (line 404), gene_set_size was len(genes) (line 537), and gene_coverage_pct divided by that raw size (line 552), while n_genes_found came from the deduped membership — so a non-deduplicated caller would get inconsistent size/coverage.

Fixed in 9c3ce41 by normalizing each set's gene list once at the entry of compute_geneset_burden_mt (gene_sets = {name: list(dict.fromkeys(genes)) …}), so the min-size filter, gene_set_size, gene_coverage_pct, and the gene→set membership all derive from the same deduped genes. It's a no-op for the canonical pipeline (parse_geneset_tsv already dedups). Kept the _build_gene_to_sets_ht dedup as defense for direct callers, and added a regression test asserting gene_set_size reflects the deduped count.

…e/coverage consistency (#187 review)

Copilot flagged that deduping only inside _build_gene_to_sets_ht left
gene_set_size (len(genes)), gene_coverage_pct, and the min_gene_set_size
filter computed from the RAW list — so a caller passing duplicate genes
within a set would get a deduped membership (n_genes_found) inconsistent
with a non-deduped size/denominator.

Fix: normalize each set's gene list once at the top of
compute_geneset_burden_mt (order-preserving dict.fromkeys) so the min-size
filter, gene_set_size, gene_coverage_pct, and the gene->set membership all
derive from the same deduped genes. No-op for the canonical pipeline
(parse_geneset_tsv already dedups). The _build_gene_to_sets_ht dedup is kept
as defense for direct callers.

+ regression test asserting gene_set_size reflects the deduped count.
@enriquea enriquea merged commit 9424342 into dev Jun 4, 2026
2 checks passed
@enriquea enriquea deleted the fix/185-safe-polish branch June 4, 2026 23:36
@codacy-production

Copy link
Copy Markdown

Not up to standards ⛔

🔴 Issues 9 minor

Alerts:
⚠ 9 issues (≤ 0 issues of at least minor severity)

Results:
9 new issues

Category Results
Documentation 9 minor

View in Codacy

🟢 Metrics 4 complexity

Metric Results
Complexity 4

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants