Skip to content

Add gedi/price test data: chr19+chr22 Ribo-seq cohort#2061

Merged
pinin4fjords merged 3 commits into
nf-core:modulesfrom
pinin4fjords:gedi-price-test-data
May 19, 2026
Merged

Add gedi/price test data: chr19+chr22 Ribo-seq cohort#2061
pinin4fjords merged 3 commits into
nf-core:modulesfrom
pinin4fjords:gedi-price-test-data

Conversation

@pinin4fjords

@pinin4fjords pinin4fjords commented May 19, 2026

Copy link
Copy Markdown
Member

Summary

Test fixture set for nf-core/modules#11693 (gedi/price) under data/genomics/homo_sapiens/riboseq_expression/price/.

PRICE's Bayesian model needs a much higher candidate-ORF count than the existing chr20 fixtures provide for its expectation-maximisation to converge. This PR adds a minimised chr19+chr22 cohort that gives PRICE enough signal to call ORFs in CI.

PRICE output on this fixture set: 380 ORF rows (377-line orfs.tsv including header and metadata).

Files

File Size Notes
Homo_sapiens.GRCh38_chr19_22.pc_exon_masked.fa.gz 3.43 MB chr19+chr22 primary assembly, hard-masked outside protein-coding exons
Homo_sapiens.GRCh38.111_chr19_22.pc.gtf.gz 1.66 MB chr19+chr22 protein-coding-gene-only, lean attributes
4x bams/SRX1178088{5,6,7,8}.chr19_22.ds50.bam 1.06-1.52 MB each 4-sample Ribo-seq cohort, 50% downsample, filtered to protein-coding gene loci
4x .bai ~90 KB each
README.md < 4 KB derivation recipe

Each file is < 4 MB. Total: 10.88 MB across 11 files (down from 19.68 MB across 14 files in an earlier revision; see force-push history).

Why these sizes — empirical justification

These fixtures are at the floor of what makes PRICE produce non-empty output. Every dimension was minimised iteratively against a real gedi -e Price invocation. Probes that didn't make it:

Variant Result
chr20 alone, 2-sample cohort (existing fixtures) ~235 candidate ORFs → NoiseModel.computeMeanSpline crashes during inference
chr20 alone, 6-sample cohort reaches inference, produces 0 ORFs
chr19 alone, 6-sample @ds50 0 ORFs (PRICE estimates model but finds nothing)
chr22 alone, 6-sample @ds50 0 ORFs
chr19+chr22, 3-sample @ds50 Index 0 out of bounds for length 0 in PriceOrfInference
chr19+chr22, 2-sample @ds50 same crash
chr19+chr22 4-sample, CDS-only FASTA mask PRICE crashes because GTF UTRs reference N positions in the masked FASTA
chr19+chr22 4-sample, ds30 0 ORFs
chr19+chr22 4-sample, ds50 (this PR) 380 ORFs ✓

The empirical floor along each axis:

  • 2 chromosomes are needed. Neither chr19 nor chr22 alone produces ORFs at any downsample level (insufficient candidate ORFs per chromosome).
  • 4 samples are needed. 3-sample and 2-sample cohorts crash PRICE's noise-model inference.
  • Protein-coding-exon masking, not CDS-only. UTR regions in the GTF must reference real sequence (not Ns) or PRICE crashes.
  • 50% downsample is the floor. 30% downsample produces 0 ORFs.

Could we go smaller?

Three avenues considered and rejected:

  1. One chromosome with deeper coverage. Both chr19 and chr22 alone failed at any cohort/downsample tried; PRICE needs the candidate-ORF count from two chromosomes.
  2. More aggressive FASTA masking (CDS-only). Crashes PRICE because UTRs in the GTF dereference to N positions.
  3. Smaller cohort (≤3 samples). Crashes PRICE's noise model.

The 3.43 MB FASTA is the largest file in the set. It's chr19+chr22 protein-coding exons of GRCh38 primary assembly with everything outside those regions hard-masked; this is the most aggressive masking PRICE tolerates while still producing ORFs.

Source

Original BAMs: SRX1178088{5,6,7,8}.Aligned.sortedByCoord.out.bam from a full-scale nf-core/riboseq test_full run. Same upstream sample identifiers as the existing chr20 fixtures in this branch.

Test plan

  • modules/nf-core/gedi/price/tests/main.nf.test in nf-core/modules#11693 consumes these fixtures via raw.githubusercontent.com URLs pinned to this PR's branch; URLs to be updated to modules after this PR merges
  • nf-core modules test --profile docker gedi/price produces 380-ORF orfs.tsv with two-pass snapshot stability

Four Ribo-seq samples downsampled to chr19+chr22 protein-coding-gene
loci, with a protein-coding-only reference. Sized so every file is
under 4 MiB and PRICE still produces a non-empty orfs.tsv (381 lines).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an entry under genomics/homo_sapiens/riboseq_expression for the new
price/ fixtures, matching the existing plastid/ and ribocode/ block style.
Notes why a second chromosome (chr19+chr22) and 4-sample cohort are
needed - PRICE's candidate-ORF count and noise-model floor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pinin4fjords pinin4fjords marked this pull request as ready for review May 19, 2026 16:26
Replaced the reference to "Seqera Platform stage commit c4cb19dc" with
the persistent SRA accession trail (SRR15480788/9/90/91 from GSE182201)
plus the alignment tooling. The Platform workdir wouldn't be reachable
to anyone reading this README later.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@luisas

luisas commented May 19, 2026

Copy link
Copy Markdown

Nice! LGTM!

@pinin4fjords pinin4fjords merged commit 7552f5d into nf-core:modules May 19, 2026
1 check passed
pinin4fjords added a commit to pinin4fjords/nf-core-modules that referenced this pull request May 19, 2026
nf-core/test-datasets#2061 merged; fixtures now live on the modules branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
delfiterradas pushed a commit to grst/modules that referenced this pull request May 19, 2026
* feat(gedi): add gedi/indexgenome and gedi/price modules

Adds two modules wrapping the GEDI / PRICE toolkit (`bioconda::gedi=1.0.6a`) for Ribo-seq translated-ORF discovery. PRICE (Erhard et al. 2018, doi:10.1038/nmeth.4631) calls translated ORFs from ribosome profiling data with near-cognate start codon detection.

`gedi/indexgenome` wraps `gedi -e IndexGenome`, producing the `.oml` genome index directory consumed by PRICE.

`gedi/price` wraps `bamlist2cit` + `gedi -e Price`, taking a cohort of Ribo-seq BAMs plus the genome index and emitting ORF predictions (`*.orfs.tsv` + `*.cit` + sidecars). One-shot across the cohort - PRICE is not per-sample.

Both modules use Wave-built community containers from `bioconda::gedi=1.0.6a`. The bioconda recipe was merged 2026-05-16; using Wave directly for now.

Source: nf-core/riboseq#174.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(gedi/indexgenome): use ${prefix} for the index output directory

Previously hard-coded the output directory as `price_index`. Switching to
`${prefix}` (default `${meta.id}`, overridable via `task.ext.prefix`) lets
callers control the directory name and matches the nf-core convention for
publishable directory outputs.

The default ${meta.id} keeps the directory keyed to the reference id, so
when `gedi/price` opens `${index}/${meta2.id}.oml`, the lookup still
resolves provided meta ids match (already the case in the test chain).

Snapshot regenerated: the index directory name in the output snapshot
changes from `price_index` to the test's `homo_sapiens_chr20` (its meta.id).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(gedi/price): add real test using minimised chr19+chr22 fixtures

Replaces the stub-only PRICE test with an end-to-end test that runs
PRICE on a minimal cohort of four Ribo-seq samples (chr19+chr22,
protein-coding-only reference). The cohort produces 380 ORF calls;
snapshot captures the orfs.tsv line count for stability validation.

Fixtures published in nf-core/test-datasets PR nf-core#2061.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(gedi/indexgenome): update meta.yml output name after ${prefix} refactor

The earlier `${prefix}` refactor (commit 0ca4c45) changed the index
output declaration from `path("price_index")` to `path("${prefix}")`,
but the meta.yml output entry still hard-coded `price_index` — causing
CI lint to flag `correct_meta_outputs: Module meta.yml does not match
main.nf`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* style(gedi/indexgenome): collapse leftover alignment padding on index emit

After the `${prefix}` refactor (commit 0ca4c45) the index output line
was the only `tuple val(meta), path(...)` emit in the module, so the
52-space alignment padding it kept from when the path was `price_index`
no longer aligns with anything.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(gedi): correct licence (GPL-3.0) and Gedi description in meta.yml

Two cross-cutting fixes from review of nf-core#11693:

- Licence was Apache-2.0 in both meta.yml files; the upstream repo
  erhard-lab/gedi is GPL-3.0. Corrected.
- "GEDI (Gene Expression Data Integration)" was unverified — the
  upstream README/wiki/paper don't expand the acronym that way.
  Replaced with the upstream one-liner phrasing. PRICE meta.yml also
  adds the verified PRICE expansion (Probabilistic Inference of Codon
  Activities by an EM algorithm) from the GEDI wiki.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(gedi/price): point fixtures at nf-core/test-datasets@modules

nf-core/test-datasets#2061 merged; fixtures now live on the modules branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
manascripts pushed a commit to manascripts/modules that referenced this pull request May 21, 2026
* feat(gedi): add gedi/indexgenome and gedi/price modules

Adds two modules wrapping the GEDI / PRICE toolkit (`bioconda::gedi=1.0.6a`) for Ribo-seq translated-ORF discovery. PRICE (Erhard et al. 2018, doi:10.1038/nmeth.4631) calls translated ORFs from ribosome profiling data with near-cognate start codon detection.

`gedi/indexgenome` wraps `gedi -e IndexGenome`, producing the `.oml` genome index directory consumed by PRICE.

`gedi/price` wraps `bamlist2cit` + `gedi -e Price`, taking a cohort of Ribo-seq BAMs plus the genome index and emitting ORF predictions (`*.orfs.tsv` + `*.cit` + sidecars). One-shot across the cohort - PRICE is not per-sample.

Both modules use Wave-built community containers from `bioconda::gedi=1.0.6a`. The bioconda recipe was merged 2026-05-16; using Wave directly for now.

Source: nf-core/riboseq#174.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(gedi/indexgenome): use ${prefix} for the index output directory

Previously hard-coded the output directory as `price_index`. Switching to
`${prefix}` (default `${meta.id}`, overridable via `task.ext.prefix`) lets
callers control the directory name and matches the nf-core convention for
publishable directory outputs.

The default ${meta.id} keeps the directory keyed to the reference id, so
when `gedi/price` opens `${index}/${meta2.id}.oml`, the lookup still
resolves provided meta ids match (already the case in the test chain).

Snapshot regenerated: the index directory name in the output snapshot
changes from `price_index` to the test's `homo_sapiens_chr20` (its meta.id).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(gedi/price): add real test using minimised chr19+chr22 fixtures

Replaces the stub-only PRICE test with an end-to-end test that runs
PRICE on a minimal cohort of four Ribo-seq samples (chr19+chr22,
protein-coding-only reference). The cohort produces 380 ORF calls;
snapshot captures the orfs.tsv line count for stability validation.

Fixtures published in nf-core/test-datasets PR nf-core#2061.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(gedi/indexgenome): update meta.yml output name after ${prefix} refactor

The earlier `${prefix}` refactor (commit 0ca4c45) changed the index
output declaration from `path("price_index")` to `path("${prefix}")`,
but the meta.yml output entry still hard-coded `price_index` — causing
CI lint to flag `correct_meta_outputs: Module meta.yml does not match
main.nf`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* style(gedi/indexgenome): collapse leftover alignment padding on index emit

After the `${prefix}` refactor (commit 0ca4c45) the index output line
was the only `tuple val(meta), path(...)` emit in the module, so the
52-space alignment padding it kept from when the path was `price_index`
no longer aligns with anything.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(gedi): correct licence (GPL-3.0) and Gedi description in meta.yml

Two cross-cutting fixes from review of nf-core#11693:

- Licence was Apache-2.0 in both meta.yml files; the upstream repo
  erhard-lab/gedi is GPL-3.0. Corrected.
- "GEDI (Gene Expression Data Integration)" was unverified — the
  upstream README/wiki/paper don't expand the acronym that way.
  Replaced with the upstream one-liner phrasing. PRICE meta.yml also
  adds the verified PRICE expansion (Probabilistic Inference of Codon
  Activities by an EM algorithm) from the GEDI wiki.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(gedi/price): point fixtures at nf-core/test-datasets@modules

nf-core/test-datasets#2061 merged; fixtures now live on the modules branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants