|
2 | 2 |
|
3 | 3 | # Phase 17: Features + Polish |
4 | 4 |
|
5 | | -**Status**: In Progress (17.1, 17.5, 17.A, 17.C complete) |
| 5 | +**Status**: In Progress (17.1, 17.5, 17.8, 17.A, 17.C complete) |
6 | 6 |
|
7 | 7 | **Goal**: Production-ready features and quality-of-life improvements. |
8 | 8 |
|
|
20 | 20 | | 17.5 | Fix clippy warnings (0 warnings) | ✅ Complete | |
21 | 21 | | 17.6 | `--outStd SAM/BAM` (stdout output for piping) | Planned | |
22 | 22 | | 17.7 | GTF tag parameters (`sjdbGTFchrPrefix`, etc.) | Planned | |
23 | | -| 17.8 | `--quantMode GeneCounts` | Planned | |
| 23 | +| 17.8 | `--quantMode GeneCounts` | ✅ Complete | |
24 | 24 | | 17.9 | `--outBAMcompression` / `--limitBAMsortRAM` | Planned | |
25 | 25 | | 17.10 | Chimeric Tier 3 (re-map soft-clipped regions) | Planned | |
26 | 26 | | 17.11 | `--chimOutType WithinBAM` (supplementary FLAG 0x800) | Planned | |
|
80 | 80 |
|
81 | 81 | --- |
82 | 82 |
|
| 83 | +## Phase 17.8: `--quantMode GeneCounts` ✅ (2026-04-17) |
| 84 | + |
| 85 | +**Goal**: Output `ReadsPerGene.out.tab` matching STAR's HTSeq-union gene-level counting. |
| 86 | + |
| 87 | +**Implementation**: New `src/quant/mod.rs` with: |
| 88 | +- `GeneAnnotation`: per-chromosome sorted interval list (absolute genome coords) built from GTF exons |
| 89 | +- `GeneCounts`: atomic per-gene counters + 3 independent N_noFeature/N_ambiguous arrays |
| 90 | +- `QuantContext`: `Arc`-shared bundle for rayon parallel threads |
| 91 | +- `--quantMode GeneCounts` + `--sjdbGTFfile` validation in `params.rs` |
| 92 | +- SE and PE counting paths in `lib.rs` |
| 93 | + |
| 94 | +**Three bugs fixed vs initial implementation**: |
| 95 | +1. **Coordinate mismatch**: GTF exon positions were stored chr-relative; `Transcript.exon.genome_start` uses absolute concatenated-genome coords. Fix: add `genome.chr_start[chr_idx]` offset when converting GTF positions. |
| 96 | +2. **Single counting pass**: All 3 columns were identical. STAR runs 3 INDEPENDENT passes — col1 (any strand), col2 (same strand as read), col3 (opposite strand) — each with separate N_noFeature and N_ambiguous. |
| 97 | +3. **Too-many-loci bucket**: These were going to N_multimapping. STAR puts them in N_unmapped. |
| 98 | + |
| 99 | +**Results vs STAR (10k SE yeast)**: |
| 100 | + |
| 101 | +| Metric | STAR | ruSTAR | |
| 102 | +|--------|------|--------| |
| 103 | +| N_unmapped | 1073 | 1074 (+1) | |
| 104 | +| N_multimapping | 661 | 661 | |
| 105 | +| N_noFeature col1/col2/col3 | 131/3653/4240 | 131/3653/4240 | |
| 106 | +| N_ambiguous col1 | 567 | 566 (-1) | |
| 107 | +| Gene total col1 | 7568 | 7568 | |
| 108 | +| Col1 gene disagreements | — | **0** | |
| 109 | +| Col2/col3 gene disagreements | — | 1 each (boundary edge case) | |
| 110 | + |
| 111 | +The ±1 discrepancies (N_unmapped + N_ambiguous) are a single read at a gene overlap boundary — likely a minor coordinate boundary difference. |
| 112 | + |
| 113 | +**Files**: `src/quant/mod.rs` (new), `src/params.rs`, `src/junction/mod.rs` (pub(crate) gtf), `src/lib.rs` |
| 114 | + |
| 115 | +**Tests**: 274/274 (added 6 new quant unit tests), 0 clippy warnings. |
| 116 | + |
83 | 117 | --- |
84 | 118 |
|
85 | 119 | ## Phase 17.A: scoreSeedBest Pre-Extension ✅ (2026-04-16) |
|
0 commit comments