Skip to content

Follow-up: deferred review findings from release #184 (catalog metadata + minor polish) #185

@enriquea

Description

@enriquea

Captures the CodeRabbit findings from the dev → main release PR #184 that were deferred rather than fixed in that release. Each was independently verified against the code at HEAD; all are either pre-existing (ported verbatim by #178 / preserved byte-identically by the PR-5..PR-8 refactors) or design/data decisions — so fixing them inside a behavior-preserving release-to-main PR was out of scope. They are real-but-non-blocking and belong here.

Catalog metadata (plugin-sourced catalog, #178)

  • data_source: "Custom" normalization — clingen/clinvar/dbnsfp (and all 10 migrated genomics catalogs) carry the placeholder "Custom", which unified_registry.search(data_source=…) and catalog list --data-source filter on, so e.g. --data-source ClinGen won't match. Schema (catalog_entry.schema.json) allows free-text, so it's valid but not searchable by real source name. Fix as a set: normalize all to canonical names (ClinGen, ClinVar, dbNSFP, Ensembl, GeVIR, gnomAD, GTEx, GWAS Catalog, INSIDER, MSigDB) and consider tightening the schema to an enum. (Doing only the 3 CodeRabbit-flagged entries would leave the genomics set inconsistent.)
  • ensembl_gene file format: "gff" → should be "gff3" (path is *.gff3.gz, description says GFF3). Cosmetic; no consumer reads the literal.
  • gnomad_metrics variant_types lists SV/CNV but only *.sites.vcf.bgz (SNV/indel) files are declared. Either narrow to ["SNV","indel"] or add the gnomAD v4 SV/CNV release files. (Domain judgment — gnomAD v4 does publish SV/CNV products.)
  • gwas_catalog URL — entry is accessioned (e115 / r2026-04-27) but url is the rolling …/downloads/full endpoint (which drift_probe.py notes is now 404). Pin to a release-specific URL or explicitly mark unpinned; also align SKILL.md's stale registry path.
  • msigdb URL — pins c2.cp.v2026.1.Hs.symbols.gmt + size but url is the generic collections landing page (source is login/license-gated → manual). Mark provenance-only or record the versioned URL. (Same landing-page pattern as insider/ucsc/gtex.)

Code polish

  • enrichex/burden.py _build_gene_to_sets_ht — no per-gene-set dedup, so a duplicated gene in a set would be double-counted after explode_rows (n_genes_found/burden). Latent only: the canonical parse_geneset_tsv already dedupes upstream, so the normal pipeline never hits it. Order-preserving fix: for gene in dict.fromkeys(genes): (not set(genes) — nondeterministic order).
  • hgc/qc_report.py render_qc_summary_markdown — tiny stats (e.g. HWE p-values) format as 0.000 (:.3f for abs < 1000). Pre-existing (byte-identical to the old inline CLI block). Use sci-notation for abs < 1e-3. (Note repo p-value-display conventions when fixing.)
  • resources/unified_registry.py — a catalog-owning provider whose primary_domain isn't in _DOMAIN_TO_OMICS is silently continue-skipped (drops its datasets + skips dup-accession enforcement). Currently unreachable (all catalog providers map), but a fail-fast raise is a deliberate design choice with real blast radius (a future unmapped-domain plugin would hard-crash the whole registry/CLI). Decide intent + add a regression test.
  • tools/enrichex/burden_cli.py:45click.echo(f"…") f-string without placeholders (Ruff F541; not in the CI flake8 selection). Drop the f.
  • tools/plugins/plugins_cli.py:81-87 — one try covers both the bundled catalog_entry.schema.json read and the user catalog read, but the error always blames the user catalog; also Ruff B904 (raise … from). Split the reads + chain the exception.

Not actionable (verified false / non-issue)

  • expression_atlas/shared/datasets.py file-type keys — not a bug: the catalog encodes the machine key in description as "File type: <key>" and hydration strips that prefix; reconstruction yields the correct transcript-tpm/sdrf keys (verified across the whole catalog).

Surfaced by CodeRabbit on #184; each verified independently before deferral.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions