|
2 | 2 |
|
3 | 3 | # Phase 17: Features + Polish |
4 | 4 |
|
5 | | -**Status**: In Progress (17.1, 17.5 complete) |
| 5 | +**Status**: In Progress (17.1, 17.5, 17.8, 17.A, 17.B, 17.C complete) |
6 | 6 |
|
7 | 7 | **Goal**: Production-ready features and quality-of-life improvements. |
8 | 8 |
|
|
11 | 11 | | Sub-phase | Description | Status | |
12 | 12 | |-----------|-------------|--------| |
13 | 13 | | 17.1 | Log.final.out statistics file (MultiQC/RNA-SeQC) | ✅ Complete | |
| 14 | +| 17.A | `scoreSeedBest` pre-extension on WA entries (STAR faithful) | ✅ Complete | |
| 15 | +| 17.B | Per-mate seeding (fix `.18919121`, `.6302610` arch failures) | ✅ Complete — `.18919121` fixed; regressions under investigation | |
| 16 | +| 17.C | STAR-faithful SCORE-GATE + mappedFilter for PE (fix 4 MAPQ inflations) | ✅ Complete | |
14 | 17 | | 17.2 | Coordinate-sorted BAM (`--outSAMtype BAM SortedByCoordinate`) | Planned | |
15 | 18 | | 17.3 | Paired-end chimeric detection | Planned | |
16 | 19 | | 17.4 | `--outReadsUnmapped Fastx` | Planned | |
17 | 20 | | 17.5 | Fix clippy warnings (0 warnings) | ✅ Complete | |
18 | 21 | | 17.6 | `--outStd SAM/BAM` (stdout output for piping) | Planned | |
19 | 22 | | 17.7 | GTF tag parameters (`sjdbGTFchrPrefix`, etc.) | Planned | |
20 | | -| 17.8 | `--quantMode GeneCounts` | Planned | |
| 23 | +| 17.8 | `--quantMode GeneCounts` | ✅ Complete | |
21 | 24 | | 17.9 | `--outBAMcompression` / `--limitBAMsortRAM` | Planned | |
22 | 25 | | 17.10 | Chimeric Tier 3 (re-map soft-clipped regions) | Planned | |
23 | 26 | | 17.11 | `--chimOutType WithinBAM` (supplementary FLAG 0x800) | Planned | |
|
77 | 80 |
|
78 | 81 | --- |
79 | 82 |
|
| 83 | +## Phase 17.8: `--quantMode GeneCounts` ✅ (2026-04-17) |
| 84 | + |
| 85 | +**Goal**: Output `ReadsPerGene.out.tab` matching STAR's HTSeq-union gene-level counting. |
| 86 | + |
| 87 | +**Implementation**: New `src/quant/mod.rs` with: |
| 88 | +- `GeneAnnotation`: per-chromosome sorted interval list (absolute genome coords) built from GTF exons |
| 89 | +- `GeneCounts`: atomic per-gene counters + 3 independent N_noFeature/N_ambiguous arrays |
| 90 | +- `QuantContext`: `Arc`-shared bundle for rayon parallel threads |
| 91 | +- `--quantMode GeneCounts` + `--sjdbGTFfile` validation in `params.rs` |
| 92 | +- SE and PE counting paths in `lib.rs` |
| 93 | + |
| 94 | +**Three bugs fixed vs initial implementation**: |
| 95 | +1. **Coordinate mismatch**: GTF exon positions were stored chr-relative; `Transcript.exon.genome_start` uses absolute concatenated-genome coords. Fix: add `genome.chr_start[chr_idx]` offset when converting GTF positions. |
| 96 | +2. **Single counting pass**: All 3 columns were identical. STAR runs 3 INDEPENDENT passes — col1 (any strand), col2 (same strand as read), col3 (opposite strand) — each with separate N_noFeature and N_ambiguous. |
| 97 | +3. **Too-many-loci bucket**: These were going to N_multimapping. STAR puts them in N_unmapped. |
| 98 | + |
| 99 | +**Results vs STAR (10k SE yeast)**: |
| 100 | + |
| 101 | +| Metric | STAR | ruSTAR | |
| 102 | +|--------|------|--------| |
| 103 | +| N_unmapped | 1073 | 1074 (+1) | |
| 104 | +| N_multimapping | 661 | 661 | |
| 105 | +| N_noFeature col1/col2/col3 | 131/3653/4240 | 131/3653/4240 | |
| 106 | +| N_ambiguous col1 | 567 | 566 (-1) | |
| 107 | +| Gene total col1 | 7568 | 7568 | |
| 108 | +| Col1 gene disagreements | — | **0** | |
| 109 | +| Col2/col3 gene disagreements | — | 1 each (boundary edge case) | |
| 110 | + |
| 111 | +The ±1 discrepancies (N_unmapped + N_ambiguous) are a single read at a gene overlap boundary — likely a minor coordinate boundary difference. |
| 112 | + |
| 113 | +**Files**: `src/quant/mod.rs` (new), `src/params.rs`, `src/junction/mod.rs` (pub(crate) gtf), `src/lib.rs` |
| 114 | + |
| 115 | +**Tests**: 274/274 (added 6 new quant unit tests), 0 clippy warnings. |
| 116 | + |
| 117 | +--- |
| 118 | + |
| 119 | +## Phase 17.A: scoreSeedBest Pre-Extension ✅ (2026-04-16) |
| 120 | + |
| 121 | +**Goal**: Match STAR's `ReadAlign_stitchWindowSeeds.cpp` — pre-extend each seed left+right before the recursive DP and store the result as `pre_ext_score` on each `WindowAlignment` entry. |
| 122 | + |
| 123 | +**What STAR does**: Before `stitchWindowAligns`, STAR computes `scoreSeedBest[iS]` for every seed in the window via a two-level DP: (1) base case: `length + left_ext`, (2) chain case: `stitchAlignToTranscript(iS2→iS1) + scoreSeedBest[iS2]`. Then adds `right_ext` universally. Used for seed ordering in the recursive aligner (start from highest-scoring seed). |
| 124 | + |
| 125 | +**Implementation**: |
| 126 | + |
| 127 | +1. **`src/align/stitch.rs`** — `WindowAlignment` struct: added `pub pre_ext_score: i32` field. All construction sites updated (`pre_ext_score: length as i32` default). |
| 128 | + |
| 129 | +2. **`src/align/score.rs`** — `AlignmentScorer`: added `pub out_filter_score_min_over_lread: f64`. All constructor paths updated. |
| 130 | + |
| 131 | +3. **`src/chimeric/detect.rs`** — `WindowAlignment` construction updated. |
| 132 | + |
| 133 | +4. **`src/align/stitch.rs`** — `stitch_seeds_core`: inserted pre-extension block after seed dedup/sort, before `stitch_recurse`: |
| 134 | + - EXTEND_ORDER respected: left-first for forward clusters (`!stitch_is_reverse`), right-first for reverse clusters (matching `stitch_recurse` base case) |
| 135 | + - `right_len_prev = wa.length + first_ext.extend_len` (mirrors base case's `len_after_first`) |
| 136 | + - Chain DP: `dp[i] = max(dp[i], dp[j] + wa_entries[i].pre_ext_score)` with colinearity check |
| 137 | + - No hard pre-filter gate: STAR uses `scoreSeedBest` for ordering only, not window rejection |
| 138 | + |
| 139 | +**Key finding during implementation**: A pre-filter gate at full `outFilterScoreMinOverLread * (Lread-1)` threshold caused 42 false rejections — reads with only short seeds (9-16bp) in low-quality windows, where the full WT extension (starting from leftmost seed) can reach the threshold even though no individual seed's pre-extension does. STAR does NOT apply this gate; `scoreSeedBest` is used for seed ordering in `stitchWindowAligns` only. |
| 140 | + |
| 141 | +**Result**: 268/268 tests, 0 warnings, 8796/8926 SE (baseline maintained), 8390/8390 PE (baseline maintained). `pre_ext_score` ready for Phase 17.B seed ordering. |
| 142 | + |
| 143 | +--- |
| 144 | + |
| 145 | +## Phase 17.B: Per-Mate Seeding ✅ (2026-04-17) |
| 146 | + |
| 147 | +**What this fixes**: `.18919121` (was STAR-only) — adapter-RC at start of rc_read1 caused a 15bp Nstart shift in the combined read's mate1 seed position, triggering reverse-cluster rejection. Per-mate seeding finds mate1 seeds from `mate1_seq` directly, avoiding the adapter-RC contamination. |
| 148 | + |
| 149 | +**Root cause of original failures** (combined-read approach): |
| 150 | +- `.18919121`: Nstart positions 21, 63, 106 in the 301bp combined-read fell within `rc_mate2` = RC(adapter-contaminated mate2). The adapter RC at stitch_read[155:171] caused a 15bp seed shift for mate1, firing the reverse-cluster reject condition. |
| 151 | +- `.6302610`: In the forward cluster, rc_mate2 seeds at sa_pos=126596 (inside mate1's genome range) slipped through `fwd_reject` because the combined read blurred the mate boundary. |
| 152 | + |
| 153 | +**Implementation** (`src/align/read_align.rs`): |
| 154 | +1. **Per-mate seeding**: `Seed::find_seeds(mate1_seq, ...)` and `Seed::find_seeds(mate2_seq, ...)` separately. Each mate seeded with its own Nstart positions (0, 37, 74, 112 for 150bp reads). |
| 155 | +2. **Independent clustering**: `cluster_seeds()` called separately for each mate. |
| 156 | +3. **Independent stitching**: `stitch_seeds_with_jdb_debug()` per mate-cluster. Reverse clusters receive `mate2_seq` directly; stitch internally does RC and sets `is_reverse=true`. |
| 157 | +4. **Pairwise matching**: `try_pair_transcripts()` — checks same chr, opposite strands, within `win_bin_window_dist()` span, combined score gate. |
| 158 | +5. **Half-mapped fallback**: if no valid pair but one mate individually passes quality threshold, report as HalfMapped. |
| 159 | + |
| 160 | +**Removed from stitch.rs**: `stitch_seeds_working`, `find_mate_boundary`, `split_working_transcript`, `adjust_mate2_coords`, `adjust_wt_read_coords` — no longer needed. |
| 161 | + |
| 162 | +**Result**: `.18919121` now mapped as VIII:452300 15S134M1S + VIII:452301 133M17S (STAR: 16S133M1S + 133M17S). 1bp CIGAR difference is a seed-level tie. |
| 163 | + |
| 164 | +**Regressions from per-mate approach (known, to fix later)**: |
| 165 | +- **15 rDNA inter-copy junction reads missed**: Reads spanning the boundary between two adjacent rDNA repeat units (yeast chr XII, ~9.1kb inserts). STAR's combined-read boundary seed at position ~171 uniquely identifies the inter-copy junction. Per-mate seeding generates 55 mate1 × 9 mate2 = ~76 candidate pairs, hitting the TooManyLoci limit (>20). Root fix: apply position-dedup before TooManyLoci check (STAR's actual ordering), or implement targeted cross-boundary rescue. |
| 166 | +- **~366 extra both-mapped pairs**: Cross-copy pairings created by combining mate1 and mate2 transcripts from different repeat copies. These inflate NH counts for some multi-mappers. |
| 167 | +- **248 half-mapped pairs**: New behavior — reads where one mate individually maps but cannot pair. STAR doesn't output these by default (--outSAMunmapped None). |
| 168 | + |
| 169 | +**Test status**: 274/274, 0 clippy warnings, SE 8796/8926 maintained. |
| 170 | + |
| 171 | +--- |
| 172 | + |
| 173 | +## Phase 17.C: STAR-faithful SCORE-GATE + mappedFilter ✅ (2026-04-17) |
| 174 | + |
| 175 | +**Problem**: 4 PE MAPQ inflations for rDNA/repeat multi-mappers. ruSTAR NH=2 vs STAR NH=3 for reads with cross-rDNA-copy pairs (M1@copy1 + M2@copy2, 9037bp gap), causing MAPQ=3 vs STAR's MAPQ=1. |
| 176 | + |
| 177 | +**Root cause**: Two distinct bugs: |
| 178 | + |
| 179 | +1. **Per-WT absolute threshold too strict** (`read_align.rs` forward/reverse cluster processing): |
| 180 | + - ruSTAR used `if adjusted_score < combined_score_threshold { continue; }` (hard cutoff at `outFilterScoreMinOverLread * (Lread-1)`) |
| 181 | + - STAR's `stitchWindowAligns.cpp:324` SCORE-GATE uses a RELATIVE criterion: `Score + outFilterMultimapScoreRange >= wTr[0]->maxScore` (within `scoreRange=1` of window best) |
| 182 | + - For cross-copy pairs: same-copy score=198 (g_span=100bp, penalty=-2), cross-copy score=197 (g_span=9237bp, penalty=-3). ruSTAR rejected cross-copy (197 < 198); STAR accepted it (197+1 ≥ 198) |
| 183 | + |
| 184 | +2. **filter_paired_transcripts applied absolute threshold per-pair** (not just to best): |
| 185 | + - ruSTAR checked every pair's `combined_wt_score < absolute_threshold` → removed cross-copy (197 < 198) |
| 186 | + - STAR's `ReadAlign_mappedFilter.cpp` checks only `trBest->maxScore >= threshold` — if the best passes, ALL pairs in the score window are kept |
| 187 | + |
| 188 | +**Fix**: |
| 189 | + |
| 190 | +1. **`src/align/read_align.rs`** — both forward and reverse cluster processing (lines 750, 972): |
| 191 | + ```rust |
| 192 | + // Old: |
| 193 | + if adjusted_score < combined_score_threshold { continue; } |
| 194 | + // New: |
| 195 | + if adjusted_score + params.out_filter_multimap_score_range < combined_score_threshold { continue; } |
| 196 | + ``` |
| 197 | + |
| 198 | +2. **`src/align/read_align.rs`** — `filter_paired_transcripts` (line 1373): |
| 199 | + - Changed from per-pair retain to best-pair quality check |
| 200 | + - Find best pair (max `combined_wt_score`); if best fails any threshold → clear all (read unmapped) |
| 201 | + - If best passes → keep all pairs (they already passed multMapSelect relative criterion) |
| 202 | + |
| 203 | +**Verification**: STAR debug trace on `.19790508` confirmed Score=197 cross-copy pair is INSERTED (`TR-INSERTED`) with `global_pass=1` because `scoreRange=1` (`outFilterMultimapScoreRange`). STAR's `mappedFilter` only checks `trBest->maxScore=198 >= 198` — passes. |
| 204 | + |
| 205 | +**Result**: 268/268 tests, 0 warnings, 8796/8926 SE (maintained), 8390/8390 PE (maintained), **0 MAPQ inflations** (was 4), **0 MAPQ deflations**, faithfulness 98.915% (was 98.903%). |
| 206 | + |
80 | 207 | --- |
81 | 208 |
|
82 | 209 | ## Phase 17.2: Coordinate-Sorted BAM — Planned |
|
0 commit comments