Skip to content

Commit 53ed2bc

Browse files
committed
Add gedi/price test data: chr19+chr22 Ribo-seq cohort
PRICE requires substantially more ribosome-profiling signal than the existing chr20 fixture provides (~235 candidate ORFs) for its expectation-maximisation to converge. This fixture set covers chr19+chr22 across a 6-sample cohort (SRX11780885-90 from GSE182201, ds50%), giving ~660 candidate ORFs - enough for PRICE to call ~250 ORFs after multiple-testing correction. The FASTA is hard-masked outside annotated exons to keep it under 5 MB while preserving every base PRICE actually inspects; the GTF has non-essential attributes stripped to fit under 2 MB.
1 parent 53bae9e commit 53ed2bc

15 files changed

Lines changed: 38 additions & 0 deletions
Binary file not shown.
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# Test data for `gedi/price`
2+
3+
A cohort of six Ribo-seq samples covering chr19 + chr22 (Ensembl GRCh38), trimmed and downsampled so that PRICE's expectation-maximisation converges and produces a non-empty `orfs.tsv` while keeping fixtures small enough for CI.
4+
5+
## Why two chromosomes plus a cohort?
6+
7+
PRICE's ORF inference fails (`/ by zero` in `NoiseModel`) when fewer than ~500 candidate ORFs feed the noise model. The existing chr20 fixture yields ~235 candidates from a 2-sample cohort. chr19 alone with all six samples reaches ~465 - still too sparse. chr19 + chr22 with all six samples reaches ~660 candidates, which is sufficient for the EM and gives ~250 ORFs after multiple-testing correction.
8+
9+
## Files
10+
11+
| File | Size | Description |
12+
|---|---|---|
13+
| `Homo_sapiens.GRCh38_chr19_22.exon_masked.fa.gz` | 4.3 MB | chr19+chr22 from Ensembl GRCh38 primary assembly, intergenic regions hard-masked to `N`. Reduces fixture size by ~85% while preserving every exon (gedi/PRICE only needs the codon sequences under reads). |
14+
| `Homo_sapiens.GRCh38.111_chr19_22.lean.gtf.gz` | 1.9 MB | chr19+chr22 from Ensembl 111 GTF, with attribute column trimmed to `gene_id`, `transcript_id`, `gene_biotype`, `gene_name`, `transcript_biotype`. All feature types retained. |
15+
| `bams/SRX1178088{5,6,7,8,9}.chr19_22.ds50.bam` | 1.6-2.6 MB each | Six Ribo-seq samples from GSE182201 (SRA SRR1548078{8,9,90,91,92,93}), STAR-aligned to GRCh38, filtered to chr19+chr22, downsampled to 50% of reads. |
16+
| `bams/SRX11780890.chr19_22.ds50.bam` | 1.3 MB | (sixth sample) |
17+
| `bams/*.bai` | <105 KB each | BAM indexes. |
18+
19+
Total: ~22 MB across 14 files.
20+
21+
## How they were derived
22+
23+
1. Source BAMs taken from a successful test_full run of nf-core/riboseq (commit `c4cb19dc`) on Seqera Platform stage.
24+
2. Each `*.genome.sorted.bam` filtered to chr19+chr22 with `samtools view -bh -F 256 <bam> 19 22`.
25+
3. Downsampled to 50% with `samtools view -bh -s 1.5`.
26+
4. FASTA built from `Homo_sapiens.GRCh38.dna.chromosome.{19,22}.fa.gz` (Ensembl 111), then N-masked outside exon intervals from the chr19+chr22 GTF (no flank).
27+
5. GTF subset to chr19+chr22 from `Homo_sapiens.GRCh38.111.chr.gtf.gz`, then stripped of non-essential attributes.
28+
29+
## Verified
30+
31+
Running `gedi -e Price -reads <cit> -genomic <oml> -prefix test` on this cohort yields:
32+
33+
```
34+
INFO Found 563 ORFs
35+
INFO Remaining after multiple testing correction: 250 ORFs
36+
```
37+
38+
Used by `modules/nf-core/gedi/price/tests/main.nf.test`.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.

0 commit comments

Comments
 (0)