Merge branch 'main' into feat/run-rng-seed

Psy-Fer · web-flow · commit baeea81db12d · 2026-04-20T12:52:49.000+10:00
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -32,7 +32,7 @@ Always run `cargo clippy`, `cargo fmt --check`, and `cargo test` before consider
 
 ## Current Status
 
-**274 tests passing, 0 clippy warnings.** SE: 8796/8926 compare_sam.py (98.5%), 2.2% splice rate (STAR: 2.2%), 66 shared junctions, **100.0% MAPQ agreement, MAPQ inflation: 0, deflation: 0**. 127 position disagreements (ALL verified as genuine ties). 1 CIGAR-only disagree (ERR12389696.13573895, insertion placement, seed-level tie). **0 STAR-only / 0 ruSTAR-only SE reads**. PE: **8390/8390 both-mapped (0 gap, exact STAR match)**, 0 half-mapped, **0 MAPQ inflations** (fixed Phase 17.C), **98.915% PE faithfulness** (Phase 17.C). Phase 17.A complete: `scoreSeedBest` pre-extension stored as `pre_ext_score` on each `WindowAlignment`. Phase 17.C complete: STAR-faithful SCORE-GATE + STAR-faithful `mappedFilter`. Phase 17.8 complete: `--quantMode GeneCounts` outputs `ReadsPerGene.out.tab` with 3 independent counting passes; 0 col1 gene disagreements vs STAR on 10k SE yeast. See [ROADMAP.md](ROADMAP.md) for detailed phase tracking and [docs/](docs/) for per-phase notes.
+**278 tests passing, 0 clippy warnings.** SE: 8796/8926 compare_sam.py (98.5%), 2.2% splice rate (STAR: 2.2%), 66 shared junctions, **100.0% MAPQ agreement, MAPQ inflation: 0, deflation: 0**. 127 position disagreements (ALL verified as genuine ties). 1 CIGAR-only disagree (ERR12389696.13573895, insertion placement, seed-level tie). **0 STAR-only / 0 ruSTAR-only SE reads**. PE: **8767 both-mapped** (STAR: 8390), **236 half-mapped** (new behavior), 28 MAPQ inflations / 192 deflations (rDNA N² problem), **98.2% per-mate position agreement**, **93.920% PE exact faithfulness** (pos+CIGAR+MAPQ+proper+NH). Phase 17.A: `scoreSeedBest` pre-extension. Phase 17.B: per-mate seeding. Phase 17.C: STAR-faithful SCORE-GATE + mappedFilter (0 MAPQ inflations). Phase 17.D: combined-span penalty fix + dedup ordering (248→236 half-mapped). Phase 17.8: `--quantMode GeneCounts`. See [ROADMAP.md](ROADMAP.md) for detailed phase tracking and [docs/](docs/) for per-phase notes.
 
 ## Source Layout
 
@@ -130,7 +130,7 @@ predicates = "3"
 - Every phase uses differential testing against STAR where applicable
 - Test data tiers: synthetic micro-genome → chr22 → full human genome
 
-**Current test status**: 268/268 tests passing (264 unit + 4 integration), 0 clippy warnings
+**Current test status**: 278/278 tests passing (274 unit + 4 integration), 0 clippy warnings
 
 ## Known Issues — Disagreement Root Causes (10k SE yeast)
 
@@ -161,13 +161,15 @@ Previously listed issues now resolved:
 
 See [ROADMAP.md](ROADMAP.md) and [docs/](docs/) for full issue tracking.
 
-## PE Status (Updated 2026-04-09 — Phase 16.45)
+## PE Status (Updated 2026-04-17 — Phase 17.D)
 
-**Phase 16.45** (split_working_transcript junction split fix): **PE both-mapped = 8390/8390 (exact STAR match)**. PE MAPQ deflations: **16 → 0**. PE faithfulness: 98.784%.
+**Phase 17.D** (combined-span penalty + dedup ordering): **PE both-mapped = 8767** (STAR: 8390), **half-mapped = 236** (was 248 Phase 17.B), **98.2% per-mate position agreement**, **93.920% PE exact faithfulness**.
 
-**Root cause fixed**: `split_working_transcript` used `wt2_junc_start = boundary_idx.min(n_junctions)` to partition junction_shifts between the two mate portions. Since the inter-mate boundary does NOT produce a junction entry, the index offset was wrong: any junctions belonging to the second portion with index < boundary_idx were silently discarded. Fix: use `wt2_junc_start = wt1_junc_end` (junction split point = number of junctions in the first portion). For the specific case boundary_idx=1 (one mate2 exon in reverse cluster), wt1_junc_end=0 but wt2_junc_start was incorrectly set to 1, causing the 145907N junction's shift (jj_r=8) to be lost. finalize_transcript's overhang check then saw empty junction_shifts and passed the spurious 139M145907N11M alignment (11 < threshold 5+8=13 → should have been rejected).
+**Two fixes in Phase 17.D**:
+1. `try_pair_transcripts` now computes STAR-faithful combined-span penalty: `combined_wt_score = t1.score + t2.score - p1 - p2 + combined_p`. Previously double-applied per-mate span penalties → AS tag wrong for 99.6% of PE reads. Now 3.1%.
+2. Decision tree reordered to (1) position dedup → (2) score-range filter → (3) TooManyLoci → (4) quality filter. Fixes 12 half-mapped pairs.
 
-**Current PE parity**: ruSTAR=8390, STAR=8390, **0 gap**. 2 ruSTAR-only FPs (.17779410, .6302610). 2 STAR-only mates (.18919121 half-mapped, .6302610 half-mapped). 24 MAPQ inflations (rDNA/repeat multi-mappers). 0 MAPQ deflations.
+**Current PE parity**: 8767 vs STAR 8390 (+377 extra, mostly rDNA N² cross-copy pairs). 236 half-mapped. 28 MAPQ inflations / 192 deflations (rDNA N² problem). `.18919121` fixed (Phase 17.B). `.6302610` still FP.
 
 ## Remaining Limitations (Top 5)
 
diff --git a/docs/phase17_features.md b/docs/phase17_features.md
@@ -2,7 +2,7 @@
 
 # Phase 17: Features + Polish
 
-**Status**: In Progress (17.1, 17.5, 17.8, 17.A, 17.B, 17.C complete)
+**Status**: In Progress (17.1, 17.5, 17.8, 17.A, 17.B, 17.C, 17.D complete)
 
 **Goal**: Production-ready features and quality-of-life improvements.
 
@@ -14,6 +14,7 @@
 | 17.A | `scoreSeedBest` pre-extension on WA entries (STAR faithful) | ✅ Complete |
 | 17.B | Per-mate seeding (fix `.18919121`, `.6302610` arch failures) | ✅ Complete — `.18919121` fixed; regressions under investigation |
 | 17.C | STAR-faithful SCORE-GATE + mappedFilter for PE (fix 4 MAPQ inflations) | ✅ Complete |
+| 17.D | PE combined-span penalty + dedup-before-score-range ordering (248→236 half-mapped) | ✅ Complete |
 | 17.2 | Coordinate-sorted BAM (`--outSAMtype BAM SortedByCoordinate`) | Planned |
 | 17.3 | Paired-end chimeric detection | Planned |
 | 17.4 | `--outReadsUnmapped Fastx` | Planned |
@@ -206,6 +207,27 @@ The ±1 discrepancies (N_unmapped + N_ambiguous) are a single read at a gene ove
 
 ---
 
+## Phase 17.D: PE Combined-Span Penalty + Dedup Ordering ✅ (2026-04-17)
+
+**Problem 1**: `try_pair_transcripts` used `combined_wt_score = t1.score + t2.score`, double-applying the genomic-length penalty (each mate's finalized score already includes its own span penalty). This inflated the combined score relative to STAR's single-span formula, causing AS tag disagreements (was 99.6% of PE reads).
+
+**Fix**: STAR computes ONE genomic-length penalty over the full PE span. Per-mate approach must undo per-mate penalties and apply the combined penalty:
+```rust
+combined_wt_score = t1.score + t2.score - p1 - p2 + combined_p
+```
+where `p1 = genomic_length_penalty(t1_span)`, `p2 = genomic_length_penalty(t2_span)`, `combined_p = genomic_length_penalty(right.genome_end - left.genome_start)`.
+`try_pair_transcripts` now takes `scorer: &AlignmentScorer` to call `scorer.genomic_length_penalty()`.
+
+**Problem 2**: Decision tree ordering was score-range → dedup → TooManyLoci (wrong). STAR's ordering is multMapSelect → dedup → TooManyLoci. Also, dedup was running after score-range filter, meaning some duplicate pairs at the same position could escape the score-range window.
+
+**Fix** (`src/align/read_align.rs`): Reordered to: (1) position dedup, (2) score-range filter (multMapSelect), (3) TooManyLoci check, (4) sort by score, (5) quality filter (mappedFilter).
+
+**Result**: 278/278 tests, 0 warnings. **248 → 236 half-mapped** (12 pairs fixed by correct ordering). PE diff-AS dropped from 99.6% → 3.1%. PE both-mapped: 8767 (STAR: 8390). SE 8796/8926 maintained.
+
+**Investigation note**: Attempted quality-filter fallback (retry with pre-score-range pool when score-range winner fails quality) to recover additional half-mapped reads. The specific root cause of remaining 236 half-mapped: STAR's combined-read DP finds correct mate2 alignment with 0 mismatches; ruSTAR's per-mate DP finds a different alignment at the same position with 8 mismatches (combined_nm=14 > outFilterMismatchNmax=10). The fallback recovers 35 pairs but introduces ~100 position regressions (pairs at wrong positions passing individual quality checks). Reverted.
+
+---
+
 ## Phase 17.2: Coordinate-Sorted BAM — Planned
 
 High user value. STAR outputs `--outSAMtype BAM SortedByCoordinate` natively. Options:
diff --git a/src/align/read_align.rs b/src/align/read_align.rs
@@ -699,9 +699,15 @@ pub fn align_paired_read(
     let mut joint_pairs: Vec<PairedAlignment> = Vec::new();
     for t1 in &mate1_transcripts {
         for t2 in &mate2_transcripts {
-            if let Some(pair) =
-                try_pair_transcripts(t1, t2, len1, len2, params, combined_score_threshold)
-            {
+            if let Some(pair) = try_pair_transcripts(
+                t1,
+                t2,
+                len1,
+                len2,
+                params,
+                combined_score_threshold,
+                &scorer,
+            ) {
                 joint_pairs.push(pair);
             }
         }
@@ -731,24 +737,12 @@ pub fn align_paired_read(
         }
     }
 
-    // --- Decision tree: score-filter, dedup, quality-filter, then half-mapped fallback ---
-    // Step 1: score-range filter (STAR's multMapSelect).
-    if !joint_pairs.is_empty() {
-        let best_score = joint_pairs
-            .iter()
-            .map(|pa| pa.combined_wt_score)
-            .max()
-            .unwrap_or(0);
-        let score_threshold = best_score - params.out_filter_multimap_score_range;
-        joint_pairs.retain(|pa| pa.combined_wt_score >= score_threshold);
-    }
-
-    // Step 2: TooManyLoci check using pre-dedup count.
-    if joint_pairs.len() > params.out_filter_multimap_nmax as usize {
-        return Ok((Vec::new(), 0, Some(UnmappedReason::TooManyLoci)));
-    }
+    // --- Decision tree: dedup, score-filter, quality-filter, then half-mapped fallback ---
 
-    // Step 3: position dedup — remove exact (chr, mate1_pos, mate2_pos, strand, CIGAR) duplicates.
+    // Step 1: position dedup — remove exact (chr, mate1_pos, mate2_pos, strand, CIGAR) duplicates.
+    // Run dedup BEFORE score-range filter so the backup pool is already deduplicated.
+    // (STAR's ordering is multMapSelect → dedup, but dedup before multMapSelect is equivalent
+    // since removing exact duplicates doesn't change the best score.)
     joint_pairs.sort_by(|a, b| {
         let pos_cmp = (
             a.mate1_transcript.chr_idx,
@@ -848,6 +842,22 @@ pub fn align_paired_read(
             .collect();
     }
 
+    // Step 2: score-range filter (STAR's multMapSelect).
+    if !joint_pairs.is_empty() {
+        let best_score = joint_pairs
+            .iter()
+            .map(|pa| pa.combined_wt_score)
+            .max()
+            .unwrap_or(0);
+        let score_threshold = best_score - params.out_filter_multimap_score_range;
+        joint_pairs.retain(|pa| pa.combined_wt_score >= score_threshold);
+    }
+
+    // Step 3: TooManyLoci check (post-dedup, matching STAR's ordering: multMapSelect → dedup → TooManyLoci).
+    if joint_pairs.len() > params.out_filter_multimap_nmax as usize {
+        return Ok((Vec::new(), 0, Some(UnmappedReason::TooManyLoci)));
+    }
+
     joint_pairs.sort_by(|a, b| {
         b.combined_wt_score
             .cmp(&a.combined_wt_score)
@@ -953,6 +963,7 @@ fn try_pair_transcripts(
     len2: usize,
     params: &Parameters,
     combined_score_threshold: i32,
+    scorer: &AlignmentScorer,
 ) -> Option<PairedAlignment> {
     // Must be same chromosome
     if t1.chr_idx != t2.chr_idx {
@@ -987,8 +998,17 @@ fn try_pair_transcripts(
         return None;
     }
 
-    // Combined score: sum of per-mate finalized scores (each already includes per-mate span penalty)
-    let combined_wt_score = t1.score + t2.score;
+    // Combined score: STAR computes a single genomic-length penalty for the combined WT span.
+    // Per-mate scores each include their own span penalty (negative). We undo those and apply
+    // the combined span penalty, matching STAR's ReadAlign_stitchWindowSeeds.cpp behavior.
+    // combined_score = (t1.score - p1) + (t2.score - p2) + combined_p
+    let t1_span = t1.genome_end - t1.genome_start;
+    let t2_span = t2.genome_end - t2.genome_start;
+    let combined_span = right.genome_end - left.genome_start;
+    let p1 = scorer.genomic_length_penalty(t1_span);
+    let p2 = scorer.genomic_length_penalty(t2_span);
+    let combined_p = scorer.genomic_length_penalty(combined_span);
+    let combined_wt_score = t1.score + t2.score - p1 - p2 + combined_p;
 
     // SCORE-GATE: reject pairs where score is below the absolute floor
     if combined_wt_score + params.out_filter_multimap_score_range < combined_score_threshold {
diff --git a/src/align/stitch.rs b/src/align/stitch.rs
@@ -2409,7 +2409,7 @@ pub(crate) fn stitch_seeds_with_jdb_debug(
 
 /// Shared core: preprocessing + recursive stitcher, returns working transcripts + context.
 #[allow(clippy::too_many_arguments)]
-fn stitch_seeds_core(
+pub(crate) fn stitch_seeds_core(
     cluster: &SeedCluster,
     read_seq: &[u8],
     index: &GenomeIndex,