fix(#185): safe deferred-review polish (catalog data_source, gff3, burden dedup, qc sci-notation, lint)#187
Conversation
…rden dedup, qc sci-notation, lint) Addresses the mechanical / behavior-preserving subset of #185. Each item was re-verified at HEAD (post PR-5..PR-8 refactor) before changing. Catalog metadata: - Normalize data_source "Custom" -> canonical names across the 10 migrated genomics catalogs (ClinGen, ClinVar, dbNSFP, Ensembl, GeVIR, gnomAD, GTEx, GWAS Catalog, INSIDER, MSigDB) so `catalog list --data-source <Name>` and unified_registry.search(data_source=...) match the real source name. - ensembl_gene file format "gff" -> "gff3" (path is *.gff3.gz; cosmetic). Code: - enrichex/burden.py _build_gene_to_sets_ht: dedup genes within a set via dict.fromkeys (order-preserving) so a duplicated gene is not double-counted after explode_rows. Latent (parse_geneset_tsv dedups upstream); guards other callers. + regression test (verified non-vacuous). - hgc/qc_report.py render_qc_summary_markdown: scientific notation for very small magnitudes (tiny HWE p-values) instead of collapsing to "0.000"; exact 0 stays "0.000". - tools/plugins/plugins_cli.py: split the bundled-schema read and the user catalog read into separate try blocks (correct error attribution) and chain exceptions with `from exc` (Ruff B904). - tools/enrichex/burden_cli.py: drop placeholder-less f-string prefixes (Ruff F541), 7 occurrences. Deliberately NOT included (need maintainer decisions, tracked in #185): - schema-enum tightening for data_source - gnomad_metrics variant_types SV/CNV scope (domain call) - gwas_catalog / msigdb URL pinning vs provenance-only - unified_registry.py fail-fast for an unmapped primary_domain (blast radius)
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR applies a set of behavior-preserving (or safety-tightening) “polish” fixes tracked in #185: normalizing plugin catalog metadata for better search/filtering, tightening CLI error reporting/lint compliance, and adding a regression guard against within-gene-set duplicate genes affecting burden results.
Changes:
- Normalize
data_sourcevalues across migrated genomics catalogs (and fix Ensembl GFF3format) to make registry/CLI filtering match real sources. - Prevent duplicate genes within a single gene set from being double-counted in burden calculations, and add a Hail regression test.
- Improve QC markdown formatting for tiny statistics, and polish CLI code (exception chaining + remove placeholder-less f-strings).
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| hvantk/tools/plugins/plugins_cli.py | Splits schema vs catalog reads into separate try blocks and chains exceptions for clearer CLI errors (Ruff B904). |
| hvantk/tools/enrichex/burden_cli.py | Removes placeholder-less f-strings (Ruff F541) in CLI output paths. |
| hvantk/tests/enrichex/test_burden_hail.py | Adds Hail regression tests for within-set dedup behavior in _build_gene_to_sets_ht. |
| hvantk/skills/msigdb/catalog/datasets.json | Normalizes data_source to MSigDB. |
| hvantk/skills/insider/catalog/datasets.json | Normalizes data_source to INSIDER. |
| hvantk/skills/gwas_catalog/catalog/datasets.json | Normalizes data_source to GWAS Catalog. |
| hvantk/skills/gtex_eqtl/catalog/datasets.json | Normalizes data_source to GTEx. |
| hvantk/skills/gnomad_metrics/catalog/datasets.json | Normalizes data_source to gnomAD. |
| hvantk/skills/gevir/catalog/datasets.json | Normalizes data_source to GeVIR. |
| hvantk/skills/ensembl_gene/catalog/datasets.json | Normalizes data_source to Ensembl and corrects file format from gff to gff3. |
| hvantk/skills/dbnsfp/catalog/datasets.json | Normalizes data_source to dbNSFP. |
| hvantk/skills/clinvar/catalog/datasets.json | Normalizes data_source to ClinVar. |
| hvantk/skills/clingen/catalog/datasets.json | Normalizes data_source to ClinGen. |
| hvantk/algorithms/hgc/qc_report.py | Uses scientific notation for very small magnitudes in QC summary markdown rendering. |
| hvantk/algorithms/enrichex/burden.py | Dedups genes within each gene set (order-preserving) to prevent downstream double-counting after explode_rows. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # dict.fromkeys dedups within a single set while preserving order, so a | ||
| # gene listed twice in one set is not double-counted after explode_rows | ||
| # (n_genes_found / burden). set() would lose order; the canonical | ||
| # parse_geneset_tsv already dedups upstream, so this guards other callers. | ||
| for gene in dict.fromkeys(genes): |
There was a problem hiding this comment.
Valid — verified in compute_geneset_burden_mt: min_gene_set_size filtered on raw len(v) (line 404), gene_set_size was len(genes) (line 537), and gene_coverage_pct divided by that raw size (line 552), while n_genes_found came from the deduped membership — so a non-deduplicated caller would get inconsistent size/coverage.
Fixed in 9c3ce41 by normalizing each set's gene list once at the entry of compute_geneset_burden_mt (gene_sets = {name: list(dict.fromkeys(genes)) …}), so the min-size filter, gene_set_size, gene_coverage_pct, and the gene→set membership all derive from the same deduped genes. It's a no-op for the canonical pipeline (parse_geneset_tsv already dedups). Kept the _build_gene_to_sets_ht dedup as defense for direct callers, and added a regression test asserting gene_set_size reflects the deduped count.
…e/coverage consistency (#187 review) Copilot flagged that deduping only inside _build_gene_to_sets_ht left gene_set_size (len(genes)), gene_coverage_pct, and the min_gene_set_size filter computed from the RAW list — so a caller passing duplicate genes within a set would get a deduped membership (n_genes_found) inconsistent with a non-deduped size/denominator. Fix: normalize each set's gene list once at the top of compute_geneset_burden_mt (order-preserving dict.fromkeys) so the min-size filter, gene_set_size, gene_coverage_pct, and the gene->set membership all derive from the same deduped genes. No-op for the canonical pipeline (parse_geneset_tsv already dedups). The _build_gene_to_sets_ht dedup is kept as defense for direct callers. + regression test asserting gene_set_size reflects the deduped count.
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| Documentation | 9 minor |
🟢 Metrics 4 complexity
Metric Results Complexity 4
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Addresses the safe, mechanical / behavior-preserving subset of #185 (deferred review findings from the #184 release). Each item was re-verified at HEAD (after the PR-5..PR-8 refactor moved code under
algorithms/andtools/) via a parallel verification pass before any edit; no existing test asserted the old values.Included
Catalog metadata
data_sourcenormalization — the 10 migrated genomics catalogs carried the placeholder"Custom", socatalog list --data-source ClinGenandunified_registry.search(data_source=…)never matched the real source name. Normalized as a set to canonical names: ClinGen, ClinVar, dbNSFP, Ensembl, GeVIR, gnomAD, GTEx, GWAS Catalog, INSIDER, MSigDB. (Verified:catalog list --data-source ClinGennow returns the ClinGen entry.)ensembl_genefileformat"gff"→"gff3"(path is*.gff3.gz, description says GFF3). Cosmetic.Code
enrichex/burden.py_build_gene_to_sets_ht— dedup genes within a set viadict.fromkeys(order-preserving) so a gene listed twice in one set isn't double-counted afterexplode_rows(n_genes_found/burden). Latent today (parse_geneset_tsvdedups upstream); this guards other callers. + regression test (TestBuildGeneToSetsHt), verified non-vacuous (fails on the pre-fix code, passes after).hgc/qc_report.pyrender_qc_summary_markdown— scientific notation for very small magnitudes (e.g. tiny HWE p-values) instead of collapsing to0.000; an exact0stays0.000.tools/plugins/plugins_cli.py— split the bundled-schema read and the user-catalog read into separatetryblocks (so the error message blames the right file) and chain exceptions withraise … from exc(Ruff B904).tools/enrichex/burden_cli.py— drop placeholder-lessfprefixes (Ruff F541), 7 occurrences.Deliberately NOT included — need maintainer decisions (still tracked in #185)
data_sourceto an enum.gnomad_metricsvariant_typesSV/CNV scope (domain call — gnomAD v4 does publish SV/CNV).gwas_catalog/msigdbURL: pin to a release-specific URL vs mark provenance-only.resources/unified_registry.pyfail-fast on an unmappedprimary_domain(deliberate design choice with real blast radius).Verification
catalog list --data-sourcefilter works by canonical name.E9,F63,F7,F82) clean; all 7 F541 cleared inburden_cli.py.test_plugins_cli,test_unified_registry_per_plugin,test_cli_catalog,hgc/test_cli_qc,hgc/test_hgc_qc_visualization, and the fullenrichex/dir (32).