Follow-up: deferred review findings from release #184 (catalog metadata + minor polish)

Captures the CodeRabbit findings from the `dev → main` release PR #184 that were **deferred** rather than fixed in that release. Each was independently verified against the code at HEAD; all are either **pre-existing** (ported verbatim by #178 / preserved byte-identically by the PR-5..PR-8 refactors) or **design/data decisions** — so fixing them inside a behavior-preserving release-to-`main` PR was out of scope. They are real-but-non-blocking and belong here.

### Catalog metadata (plugin-sourced catalog, #178)
- **`data_source: "Custom"` normalization** — clingen/clinvar/dbnsfp (and all 10 migrated *genomics* catalogs) carry the placeholder `"Custom"`, which `unified_registry.search(data_source=…)` and `catalog list --data-source` filter on, so e.g. `--data-source ClinGen` won't match. Schema (`catalog_entry.schema.json`) allows free-text, so it's valid but not searchable by real source name. **Fix as a set:** normalize all to canonical names (ClinGen, ClinVar, dbNSFP, Ensembl, GeVIR, gnomAD, GTEx, GWAS Catalog, INSIDER, MSigDB) and consider tightening the schema to an enum. (Doing only the 3 CodeRabbit-flagged entries would leave the genomics set inconsistent.)
- **`ensembl_gene` file `format: "gff"`** → should be `"gff3"` (path is `*.gff3.gz`, description says GFF3). Cosmetic; no consumer reads the literal.
- **`gnomad_metrics` `variant_types` lists `SV`/`CNV`** but only `*.sites.vcf.bgz` (SNV/indel) files are declared. Either narrow to `["SNV","indel"]` or add the gnomAD v4 SV/CNV release files. (Domain judgment — gnomAD v4 does publish SV/CNV products.)
- **`gwas_catalog` URL** — entry is accessioned (`e115 / r2026-04-27`) but `url` is the rolling `…/downloads/full` endpoint (which `drift_probe.py` notes is now 404). Pin to a release-specific URL or explicitly mark unpinned; also align SKILL.md's stale registry path.
- **`msigdb` URL** — pins `c2.cp.v2026.1.Hs.symbols.gmt` + size but `url` is the generic collections landing page (source is login/license-gated → manual). Mark provenance-only or record the versioned URL. (Same landing-page pattern as insider/ucsc/gtex.)

### Code polish
- **`enrichex/burden.py` `_build_gene_to_sets_ht`** — no per-gene-set dedup, so a duplicated gene in a set would be double-counted after `explode_rows` (`n_genes_found`/`burden`). Latent only: the canonical `parse_geneset_tsv` already dedupes upstream, so the normal pipeline never hits it. Order-preserving fix: `for gene in dict.fromkeys(genes):` (not `set(genes)` — nondeterministic order).
- **`hgc/qc_report.py` `render_qc_summary_markdown`** — tiny stats (e.g. HWE p-values) format as `0.000` (`:.3f` for `abs < 1000`). Pre-existing (byte-identical to the old inline CLI block). Use sci-notation for `abs < 1e-3`. (Note repo p-value-display conventions when fixing.)
- **`resources/unified_registry.py`** — a catalog-owning provider whose `primary_domain` isn't in `_DOMAIN_TO_OMICS` is silently `continue`-skipped (drops its datasets + skips dup-accession enforcement). Currently unreachable (all catalog providers map), but a fail-fast `raise` is a deliberate design choice with real blast radius (a future unmapped-domain plugin would hard-crash the whole registry/CLI). Decide intent + add a regression test.
- **`tools/enrichex/burden_cli.py:45`** — `click.echo(f"…")` f-string without placeholders (Ruff F541; not in the CI flake8 selection). Drop the `f`.
- **`tools/plugins/plugins_cli.py:81-87`** — one `try` covers both the bundled `catalog_entry.schema.json` read and the user catalog read, but the error always blames the user catalog; also Ruff B904 (`raise … from`). Split the reads + chain the exception.

### Not actionable (verified false / non-issue)
- `expression_atlas/shared/datasets.py` file-type keys — **not a bug**: the catalog encodes the machine key in `description` as `"File type: <key>"` and hydration strips that prefix; reconstruction yields the correct `transcript-tpm`/`sdrf` keys (verified across the whole catalog).

_Surfaced by CodeRabbit on #184; each verified independently before deferral._

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Follow-up: deferred review findings from release #184 (catalog metadata + minor polish) #185

Catalog metadata (plugin-sourced catalog, #178)

Code polish

Not actionable (verified false / non-issue)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Follow-up: deferred review findings from release #184 (catalog metadata + minor polish) #185

Description

Catalog metadata (plugin-sourced catalog, #178)

Code polish

Not actionable (verified false / non-issue)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions