|
2 | 2 |
|
3 | 3 | # Phase 17: Features + Polish |
4 | 4 |
|
5 | | -**Status**: In Progress (17.1, 17.5, 17.8, 17.A, 17.C complete) |
| 5 | +**Status**: In Progress (17.1, 17.5, 17.8, 17.A, 17.B, 17.C complete) |
6 | 6 |
|
7 | 7 | **Goal**: Production-ready features and quality-of-life improvements. |
8 | 8 |
|
|
12 | 12 | |-----------|-------------|--------| |
13 | 13 | | 17.1 | Log.final.out statistics file (MultiQC/RNA-SeQC) | ✅ Complete | |
14 | 14 | | 17.A | `scoreSeedBest` pre-extension on WA entries (STAR faithful) | ✅ Complete | |
15 | | -| 17.B | Per-mate seeding (fix `.18919121`, `.6302610` arch failures) | Blocked — per-mate clustering approach caused regressions; root cause under investigation | |
| 15 | +| 17.B | Per-mate seeding (fix `.18919121`, `.6302610` arch failures) | ✅ Complete — `.18919121` fixed; regressions under investigation | |
16 | 16 | | 17.C | STAR-faithful SCORE-GATE + mappedFilter for PE (fix 4 MAPQ inflations) | ✅ Complete | |
17 | 17 | | 17.2 | Coordinate-sorted BAM (`--outSAMtype BAM SortedByCoordinate`) | Planned | |
18 | 18 | | 17.3 | Paired-end chimeric detection | Planned | |
@@ -142,19 +142,31 @@ The ±1 discrepancies (N_unmapped + N_ambiguous) are a single read at a gene ove |
142 | 142 |
|
143 | 143 | --- |
144 | 144 |
|
145 | | -## Phase 17.B: Per-Mate Seeding — Blocked (architectural) |
| 145 | +## Phase 17.B: Per-Mate Seeding ✅ (2026-04-17) |
146 | 146 |
|
147 | | -**What this fixes**: `.18919121` (STAR-only) and `.6302610` (ruSTAR-only FP) — both caused by adapter-contaminated seeds in the combined read crossing mate boundaries. |
| 147 | +**What this fixes**: `.18919121` (was STAR-only) — adapter-RC at start of rc_read1 caused a 15bp Nstart shift in the combined read's mate1 seed position, triggering reverse-cluster rejection. Per-mate seeding finds mate1 seeds from `mate1_seq` directly, avoiding the adapter-RC contamination. |
148 | 148 |
|
149 | | -**Implementation attempted (2026-04-16)**: Seeded mate1 and RC(mate2) separately, clustered each mate independently, then paired compatible (same-chr, same-strand, within-gap) clusters. This correctly prevents cross-mate adapter seed contamination. |
| 149 | +**Root cause of original failures** (combined-read approach): |
| 150 | +- `.18919121`: Nstart positions 21, 63, 106 in the 301bp combined-read fell within `rc_mate2` = RC(adapter-contaminated mate2). The adapter RC at stitch_read[155:171] caused a 15bp seed shift for mate1, firing the reverse-cluster reject condition. |
| 151 | +- `.6302610`: In the forward cluster, rc_mate2 seeds at sa_pos=126596 (inside mate1's genome range) slipped through `fwd_reject` because the combined read blurred the mate boundary. |
150 | 152 |
|
151 | | -**Outcome**: The per-mate clustering approach introduced 10 new ruSTAR-only FPs and 2 new STAR-only cases (net regression of +8 FPs), while NOT fixing the original 2 FPs. Reverted to Phase 17.A state. |
| 153 | +**Implementation** (`src/align/read_align.rs`): |
| 154 | +1. **Per-mate seeding**: `Seed::find_seeds(mate1_seq, ...)` and `Seed::find_seeds(mate2_seq, ...)` separately. Each mate seeded with its own Nstart positions (0, 37, 74, 112 for 150bp reads). |
| 155 | +2. **Independent clustering**: `cluster_seeds()` called separately for each mate. |
| 156 | +3. **Independent stitching**: `stitch_seeds_with_jdb_debug()` per mate-cluster. Reverse clusters receive `mate2_seq` directly; stitch internally does RC and sets `is_reverse=true`. |
| 157 | +4. **Pairwise matching**: `try_pair_transcripts()` — checks same chr, opposite strands, within `win_bin_window_dist()` span, combined score gate. |
| 158 | +5. **Half-mapped fallback**: if no valid pair but one mate individually passes quality threshold, report as HalfMapped. |
152 | 159 |
|
153 | | -**Root cause of regressions**: In unified clustering, a mate2 anchor creates windows that mate1 non-anchor seeds join (and vice versa). Per-mate clustering breaks this inter-mate window sharing that legitimate reads depend on (e.g. short-insert pairs where one mate has a strong anchor and the other has only weak seeds). The single-mate fallback clusters added to the pairing output also caused spurious pairings. |
| 160 | +**Removed from stitch.rs**: `stitch_seeds_working`, `find_mate_boundary`, `split_working_transcript`, `adjust_mate2_coords`, `adjust_wt_read_coords` — no longer needed. |
154 | 161 |
|
155 | | -**What STAR actually does**: STAR's `winBin` is `winBin[strand][bin]` — a 2D array indexed by strand (0=fwd, 1=rev) and genomic bin, allocated as `winBin[2][genome_size/65536]`. This is **strand-aware and identical in semantics to ruSTAR's `HashMap<(bool, u64), usize>`**. The earlier theory that STAR uses a bin-only key was incorrect (verified against STAR source in `ReadAlign_assignAlignToWindow.cpp` line 9: `uint iW=winBin[aStr][a1>>P.winBinNbits]`). |
| 162 | +**Result**: `.18919121` now mapped as VIII:452300 15S134M1S + VIII:452301 133M17S (STAR: 16S133M1S + 133M17S). 1bp CIGAR difference is a seed-level tie. |
156 | 163 |
|
157 | | -**Status**: Blocked — per-mate clustering caused regressions; actual root cause of `.6302610` and `.18919121` (post Phase 17.A) still needs fresh debugging to confirm. The 2 FPs and 2 STAR-only cases remain as known limitations. |
| 164 | +**Regressions from per-mate approach (known, to fix later)**: |
| 165 | +- **15 rDNA inter-copy junction reads missed**: Reads spanning the boundary between two adjacent rDNA repeat units (yeast chr XII, ~9.1kb inserts). STAR's combined-read boundary seed at position ~171 uniquely identifies the inter-copy junction. Per-mate seeding generates 55 mate1 × 9 mate2 = ~76 candidate pairs, hitting the TooManyLoci limit (>20). Root fix: apply position-dedup before TooManyLoci check (STAR's actual ordering), or implement targeted cross-boundary rescue. |
| 166 | +- **~366 extra both-mapped pairs**: Cross-copy pairings created by combining mate1 and mate2 transcripts from different repeat copies. These inflate NH counts for some multi-mappers. |
| 167 | +- **248 half-mapped pairs**: New behavior — reads where one mate individually maps but cannot pair. STAR doesn't output these by default (--outSAMunmapped None). |
| 168 | + |
| 169 | +**Test status**: 274/274, 0 clippy warnings, SE 8796/8926 maintained. |
158 | 170 |
|
159 | 171 | --- |
160 | 172 |
|
|
0 commit comments