Captures the CodeRabbit findings from the dev → main release PR #184 that were deferred rather than fixed in that release. Each was independently verified against the code at HEAD; all are either pre-existing (ported verbatim by #178 / preserved byte-identically by the PR-5..PR-8 refactors) or design/data decisions — so fixing them inside a behavior-preserving release-to-main PR was out of scope. They are real-but-non-blocking and belong here.
Catalog metadata (plugin-sourced catalog, #178)
data_source: "Custom" normalization — clingen/clinvar/dbnsfp (and all 10 migrated genomics catalogs) carry the placeholder "Custom", which unified_registry.search(data_source=…) and catalog list --data-source filter on, so e.g. --data-source ClinGen won't match. Schema (catalog_entry.schema.json) allows free-text, so it's valid but not searchable by real source name. Fix as a set: normalize all to canonical names (ClinGen, ClinVar, dbNSFP, Ensembl, GeVIR, gnomAD, GTEx, GWAS Catalog, INSIDER, MSigDB) and consider tightening the schema to an enum. (Doing only the 3 CodeRabbit-flagged entries would leave the genomics set inconsistent.)
ensembl_gene file format: "gff" → should be "gff3" (path is *.gff3.gz, description says GFF3). Cosmetic; no consumer reads the literal.
gnomad_metrics variant_types lists SV/CNV but only *.sites.vcf.bgz (SNV/indel) files are declared. Either narrow to ["SNV","indel"] or add the gnomAD v4 SV/CNV release files. (Domain judgment — gnomAD v4 does publish SV/CNV products.)
gwas_catalog URL — entry is accessioned (e115 / r2026-04-27) but url is the rolling …/downloads/full endpoint (which drift_probe.py notes is now 404). Pin to a release-specific URL or explicitly mark unpinned; also align SKILL.md's stale registry path.
msigdb URL — pins c2.cp.v2026.1.Hs.symbols.gmt + size but url is the generic collections landing page (source is login/license-gated → manual). Mark provenance-only or record the versioned URL. (Same landing-page pattern as insider/ucsc/gtex.)
Code polish
enrichex/burden.py _build_gene_to_sets_ht — no per-gene-set dedup, so a duplicated gene in a set would be double-counted after explode_rows (n_genes_found/burden). Latent only: the canonical parse_geneset_tsv already dedupes upstream, so the normal pipeline never hits it. Order-preserving fix: for gene in dict.fromkeys(genes): (not set(genes) — nondeterministic order).
hgc/qc_report.py render_qc_summary_markdown — tiny stats (e.g. HWE p-values) format as 0.000 (:.3f for abs < 1000). Pre-existing (byte-identical to the old inline CLI block). Use sci-notation for abs < 1e-3. (Note repo p-value-display conventions when fixing.)
resources/unified_registry.py — a catalog-owning provider whose primary_domain isn't in _DOMAIN_TO_OMICS is silently continue-skipped (drops its datasets + skips dup-accession enforcement). Currently unreachable (all catalog providers map), but a fail-fast raise is a deliberate design choice with real blast radius (a future unmapped-domain plugin would hard-crash the whole registry/CLI). Decide intent + add a regression test.
tools/enrichex/burden_cli.py:45 — click.echo(f"…") f-string without placeholders (Ruff F541; not in the CI flake8 selection). Drop the f.
tools/plugins/plugins_cli.py:81-87 — one try covers both the bundled catalog_entry.schema.json read and the user catalog read, but the error always blames the user catalog; also Ruff B904 (raise … from). Split the reads + chain the exception.
Not actionable (verified false / non-issue)
expression_atlas/shared/datasets.py file-type keys — not a bug: the catalog encodes the machine key in description as "File type: <key>" and hydration strips that prefix; reconstruction yields the correct transcript-tpm/sdrf keys (verified across the whole catalog).
Surfaced by CodeRabbit on #184; each verified independently before deferral.
Captures the CodeRabbit findings from the
dev → mainrelease PR #184 that were deferred rather than fixed in that release. Each was independently verified against the code at HEAD; all are either pre-existing (ported verbatim by #178 / preserved byte-identically by the PR-5..PR-8 refactors) or design/data decisions — so fixing them inside a behavior-preserving release-to-mainPR was out of scope. They are real-but-non-blocking and belong here.Catalog metadata (plugin-sourced catalog, #178)
data_source: "Custom"normalization — clingen/clinvar/dbnsfp (and all 10 migrated genomics catalogs) carry the placeholder"Custom", whichunified_registry.search(data_source=…)andcatalog list --data-sourcefilter on, so e.g.--data-source ClinGenwon't match. Schema (catalog_entry.schema.json) allows free-text, so it's valid but not searchable by real source name. Fix as a set: normalize all to canonical names (ClinGen, ClinVar, dbNSFP, Ensembl, GeVIR, gnomAD, GTEx, GWAS Catalog, INSIDER, MSigDB) and consider tightening the schema to an enum. (Doing only the 3 CodeRabbit-flagged entries would leave the genomics set inconsistent.)ensembl_genefileformat: "gff"→ should be"gff3"(path is*.gff3.gz, description says GFF3). Cosmetic; no consumer reads the literal.gnomad_metricsvariant_typeslistsSV/CNVbut only*.sites.vcf.bgz(SNV/indel) files are declared. Either narrow to["SNV","indel"]or add the gnomAD v4 SV/CNV release files. (Domain judgment — gnomAD v4 does publish SV/CNV products.)gwas_catalogURL — entry is accessioned (e115 / r2026-04-27) buturlis the rolling…/downloads/fullendpoint (whichdrift_probe.pynotes is now 404). Pin to a release-specific URL or explicitly mark unpinned; also align SKILL.md's stale registry path.msigdbURL — pinsc2.cp.v2026.1.Hs.symbols.gmt+ size buturlis the generic collections landing page (source is login/license-gated → manual). Mark provenance-only or record the versioned URL. (Same landing-page pattern as insider/ucsc/gtex.)Code polish
enrichex/burden.py_build_gene_to_sets_ht— no per-gene-set dedup, so a duplicated gene in a set would be double-counted afterexplode_rows(n_genes_found/burden). Latent only: the canonicalparse_geneset_tsvalready dedupes upstream, so the normal pipeline never hits it. Order-preserving fix:for gene in dict.fromkeys(genes):(notset(genes)— nondeterministic order).hgc/qc_report.pyrender_qc_summary_markdown— tiny stats (e.g. HWE p-values) format as0.000(:.3fforabs < 1000). Pre-existing (byte-identical to the old inline CLI block). Use sci-notation forabs < 1e-3. (Note repo p-value-display conventions when fixing.)resources/unified_registry.py— a catalog-owning provider whoseprimary_domainisn't in_DOMAIN_TO_OMICSis silentlycontinue-skipped (drops its datasets + skips dup-accession enforcement). Currently unreachable (all catalog providers map), but a fail-fastraiseis a deliberate design choice with real blast radius (a future unmapped-domain plugin would hard-crash the whole registry/CLI). Decide intent + add a regression test.tools/enrichex/burden_cli.py:45—click.echo(f"…")f-string without placeholders (Ruff F541; not in the CI flake8 selection). Drop thef.tools/plugins/plugins_cli.py:81-87— onetrycovers both the bundledcatalog_entry.schema.jsonread and the user catalog read, but the error always blames the user catalog; also Ruff B904 (raise … from). Split the reads + chain the exception.Not actionable (verified false / non-issue)
expression_atlas/shared/datasets.pyfile-type keys — not a bug: the catalog encodes the machine key indescriptionas"File type: <key>"and hydration strips that prefix; reconstruction yields the correcttranscript-tpm/sdrfkeys (verified across the whole catalog).Surfaced by CodeRabbit on #184; each verified independently before deferral.