Skip to content

Commit e508928

Browse files
committed
combined score and dedupe toomanyloci check reorder
1 parent 90af39c commit e508928

3 files changed

Lines changed: 65 additions & 26 deletions

File tree

CLAUDE.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ Always run `cargo clippy`, `cargo fmt --check`, and `cargo test` before consider
3232

3333
## Current Status
3434

35-
**274 tests passing, 0 clippy warnings.** SE: 8796/8926 compare_sam.py (98.5%), 2.2% splice rate (STAR: 2.2%), 66 shared junctions, **100.0% MAPQ agreement, MAPQ inflation: 0, deflation: 0**. 127 position disagreements (ALL verified as genuine ties). 1 CIGAR-only disagree (ERR12389696.13573895, insertion placement, seed-level tie). **0 STAR-only / 0 ruSTAR-only SE reads**. PE: **8390/8390 both-mapped (0 gap, exact STAR match)**, 0 half-mapped, **0 MAPQ inflations** (fixed Phase 17.C), **98.915% PE faithfulness** (Phase 17.C). Phase 17.A complete: `scoreSeedBest` pre-extension stored as `pre_ext_score` on each `WindowAlignment`. Phase 17.C complete: STAR-faithful SCORE-GATE + STAR-faithful `mappedFilter`. Phase 17.8 complete: `--quantMode GeneCounts` outputs `ReadsPerGene.out.tab` with 3 independent counting passes; 0 col1 gene disagreements vs STAR on 10k SE yeast. See [ROADMAP.md](ROADMAP.md) for detailed phase tracking and [docs/](docs/) for per-phase notes.
35+
**278 tests passing, 0 clippy warnings.** SE: 8796/8926 compare_sam.py (98.5%), 2.2% splice rate (STAR: 2.2%), 66 shared junctions, **100.0% MAPQ agreement, MAPQ inflation: 0, deflation: 0**. 127 position disagreements (ALL verified as genuine ties). 1 CIGAR-only disagree (ERR12389696.13573895, insertion placement, seed-level tie). **0 STAR-only / 0 ruSTAR-only SE reads**. PE: **8767 both-mapped** (STAR: 8390), **236 half-mapped** (new behavior), 28 MAPQ inflations / 192 deflations (rDNA N² problem), **98.2% per-mate position agreement**, **93.920% PE exact faithfulness** (pos+CIGAR+MAPQ+proper+NH). Phase 17.A: `scoreSeedBest` pre-extension. Phase 17.B: per-mate seeding. Phase 17.C: STAR-faithful SCORE-GATE + mappedFilter (0 MAPQ inflations). Phase 17.D: combined-span penalty fix + dedup ordering (248→236 half-mapped). Phase 17.8: `--quantMode GeneCounts`. See [ROADMAP.md](ROADMAP.md) for detailed phase tracking and [docs/](docs/) for per-phase notes.
3636

3737
## Source Layout
3838

@@ -130,7 +130,7 @@ predicates = "3"
130130
- Every phase uses differential testing against STAR where applicable
131131
- Test data tiers: synthetic micro-genome → chr22 → full human genome
132132

133-
**Current test status**: 268/268 tests passing (264 unit + 4 integration), 0 clippy warnings
133+
**Current test status**: 278/278 tests passing (274 unit + 4 integration), 0 clippy warnings
134134

135135
## Known Issues — Disagreement Root Causes (10k SE yeast)
136136

@@ -161,13 +161,15 @@ Previously listed issues now resolved:
161161

162162
See [ROADMAP.md](ROADMAP.md) and [docs/](docs/) for full issue tracking.
163163

164-
## PE Status (Updated 2026-04-09 — Phase 16.45)
164+
## PE Status (Updated 2026-04-17 — Phase 17.D)
165165

166-
**Phase 16.45** (split_working_transcript junction split fix): **PE both-mapped = 8390/8390 (exact STAR match)**. PE MAPQ deflations: **16 → 0**. PE faithfulness: 98.784%.
166+
**Phase 17.D** (combined-span penalty + dedup ordering): **PE both-mapped = 8767** (STAR: 8390), **half-mapped = 236** (was 248 Phase 17.B), **98.2% per-mate position agreement**, **93.920% PE exact faithfulness**.
167167

168-
**Root cause fixed**: `split_working_transcript` used `wt2_junc_start = boundary_idx.min(n_junctions)` to partition junction_shifts between the two mate portions. Since the inter-mate boundary does NOT produce a junction entry, the index offset was wrong: any junctions belonging to the second portion with index < boundary_idx were silently discarded. Fix: use `wt2_junc_start = wt1_junc_end` (junction split point = number of junctions in the first portion). For the specific case boundary_idx=1 (one mate2 exon in reverse cluster), wt1_junc_end=0 but wt2_junc_start was incorrectly set to 1, causing the 145907N junction's shift (jj_r=8) to be lost. finalize_transcript's overhang check then saw empty junction_shifts and passed the spurious 139M145907N11M alignment (11 < threshold 5+8=13 → should have been rejected).
168+
**Two fixes in Phase 17.D**:
169+
1. `try_pair_transcripts` now computes STAR-faithful combined-span penalty: `combined_wt_score = t1.score + t2.score - p1 - p2 + combined_p`. Previously double-applied per-mate span penalties → AS tag wrong for 99.6% of PE reads. Now 3.1%.
170+
2. Decision tree reordered to (1) position dedup → (2) score-range filter → (3) TooManyLoci → (4) quality filter. Fixes 12 half-mapped pairs.
169171

170-
**Current PE parity**: ruSTAR=8390, STAR=8390, **0 gap**. 2 ruSTAR-only FPs (.17779410, .6302610). 2 STAR-only mates (.18919121 half-mapped, .6302610 half-mapped). 24 MAPQ inflations (rDNA/repeat multi-mappers). 0 MAPQ deflations.
172+
**Current PE parity**: 8767 vs STAR 8390 (+377 extra, mostly rDNA N² cross-copy pairs). 236 half-mapped. 28 MAPQ inflations / 192 deflations (rDNA N² problem). `.18919121` fixed (Phase 17.B). `.6302610` still FP.
171173

172174
## Remaining Limitations (Top 5)
173175

docs/phase17_features.md

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
# Phase 17: Features + Polish
44

5-
**Status**: In Progress (17.1, 17.5, 17.8, 17.A, 17.B, 17.C complete)
5+
**Status**: In Progress (17.1, 17.5, 17.8, 17.A, 17.B, 17.C, 17.D complete)
66

77
**Goal**: Production-ready features and quality-of-life improvements.
88

@@ -14,6 +14,7 @@
1414
| 17.A | `scoreSeedBest` pre-extension on WA entries (STAR faithful) | ✅ Complete |
1515
| 17.B | Per-mate seeding (fix `.18919121`, `.6302610` arch failures) | ✅ Complete — `.18919121` fixed; regressions under investigation |
1616
| 17.C | STAR-faithful SCORE-GATE + mappedFilter for PE (fix 4 MAPQ inflations) | ✅ Complete |
17+
| 17.D | PE combined-span penalty + dedup-before-score-range ordering (248→236 half-mapped) | ✅ Complete |
1718
| 17.2 | Coordinate-sorted BAM (`--outSAMtype BAM SortedByCoordinate`) | Planned |
1819
| 17.3 | Paired-end chimeric detection | Planned |
1920
| 17.4 | `--outReadsUnmapped Fastx` | Planned |
@@ -206,6 +207,27 @@ The ±1 discrepancies (N_unmapped + N_ambiguous) are a single read at a gene ove
206207

207208
---
208209

210+
## Phase 17.D: PE Combined-Span Penalty + Dedup Ordering ✅ (2026-04-17)
211+
212+
**Problem 1**: `try_pair_transcripts` used `combined_wt_score = t1.score + t2.score`, double-applying the genomic-length penalty (each mate's finalized score already includes its own span penalty). This inflated the combined score relative to STAR's single-span formula, causing AS tag disagreements (was 99.6% of PE reads).
213+
214+
**Fix**: STAR computes ONE genomic-length penalty over the full PE span. Per-mate approach must undo per-mate penalties and apply the combined penalty:
215+
```rust
216+
combined_wt_score = t1.score + t2.score - p1 - p2 + combined_p
217+
```
218+
where `p1 = genomic_length_penalty(t1_span)`, `p2 = genomic_length_penalty(t2_span)`, `combined_p = genomic_length_penalty(right.genome_end - left.genome_start)`.
219+
`try_pair_transcripts` now takes `scorer: &AlignmentScorer` to call `scorer.genomic_length_penalty()`.
220+
221+
**Problem 2**: Decision tree ordering was score-range → dedup → TooManyLoci (wrong). STAR's ordering is multMapSelect → dedup → TooManyLoci. Also, dedup was running after score-range filter, meaning some duplicate pairs at the same position could escape the score-range window.
222+
223+
**Fix** (`src/align/read_align.rs`): Reordered to: (1) position dedup, (2) score-range filter (multMapSelect), (3) TooManyLoci check, (4) sort by score, (5) quality filter (mappedFilter).
224+
225+
**Result**: 278/278 tests, 0 warnings. **248 → 236 half-mapped** (12 pairs fixed by correct ordering). PE diff-AS dropped from 99.6% → 3.1%. PE both-mapped: 8767 (STAR: 8390). SE 8796/8926 maintained.
226+
227+
**Investigation note**: Attempted quality-filter fallback (retry with pre-score-range pool when score-range winner fails quality) to recover additional half-mapped reads. The specific root cause of remaining 236 half-mapped: STAR's combined-read DP finds correct mate2 alignment with 0 mismatches; ruSTAR's per-mate DP finds a different alignment at the same position with 8 mismatches (combined_nm=14 > outFilterMismatchNmax=10). The fallback recovers 35 pairs but introduces ~100 position regressions (pairs at wrong positions passing individual quality checks). Reverted.
228+
229+
---
230+
209231
## Phase 17.2: Coordinate-Sorted BAM — Planned
210232

211233
High user value. STAR outputs `--outSAMtype BAM SortedByCoordinate` natively. Options:

src/align/read_align.rs

Lines changed: 34 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -663,6 +663,7 @@ pub fn align_paired_read(
663663
len2,
664664
params,
665665
combined_score_threshold,
666+
&scorer,
666667
) {
667668
joint_pairs.push(pair);
668669
}
@@ -693,24 +694,12 @@ pub fn align_paired_read(
693694
}
694695
}
695696

696-
// --- Decision tree: score-filter, dedup, quality-filter, then half-mapped fallback ---
697-
// Step 1: score-range filter (STAR's multMapSelect).
698-
if !joint_pairs.is_empty() {
699-
let best_score = joint_pairs
700-
.iter()
701-
.map(|pa| pa.combined_wt_score)
702-
.max()
703-
.unwrap_or(0);
704-
let score_threshold = best_score - params.out_filter_multimap_score_range;
705-
joint_pairs.retain(|pa| pa.combined_wt_score >= score_threshold);
706-
}
697+
// --- Decision tree: dedup, score-filter, quality-filter, then half-mapped fallback ---
707698

708-
// Step 2: TooManyLoci check using pre-dedup count.
709-
if joint_pairs.len() > params.out_filter_multimap_nmax as usize {
710-
return Ok((Vec::new(), 0, Some(UnmappedReason::TooManyLoci)));
711-
}
712-
713-
// Step 3: position dedup — remove exact (chr, mate1_pos, mate2_pos, strand, CIGAR) duplicates.
699+
// Step 1: position dedup — remove exact (chr, mate1_pos, mate2_pos, strand, CIGAR) duplicates.
700+
// Run dedup BEFORE score-range filter so the backup pool is already deduplicated.
701+
// (STAR's ordering is multMapSelect → dedup, but dedup before multMapSelect is equivalent
702+
// since removing exact duplicates doesn't change the best score.)
714703
joint_pairs.sort_by(|a, b| {
715704
let pos_cmp = (
716705
a.mate1_transcript.chr_idx,
@@ -810,6 +799,22 @@ pub fn align_paired_read(
810799
.collect();
811800
}
812801

802+
// Step 2: score-range filter (STAR's multMapSelect).
803+
if !joint_pairs.is_empty() {
804+
let best_score = joint_pairs
805+
.iter()
806+
.map(|pa| pa.combined_wt_score)
807+
.max()
808+
.unwrap_or(0);
809+
let score_threshold = best_score - params.out_filter_multimap_score_range;
810+
joint_pairs.retain(|pa| pa.combined_wt_score >= score_threshold);
811+
}
812+
813+
// Step 3: TooManyLoci check (post-dedup, matching STAR's ordering: multMapSelect → dedup → TooManyLoci).
814+
if joint_pairs.len() > params.out_filter_multimap_nmax as usize {
815+
return Ok((Vec::new(), 0, Some(UnmappedReason::TooManyLoci)));
816+
}
817+
813818
joint_pairs.sort_by(|a, b| {
814819
b.combined_wt_score
815820
.cmp(&a.combined_wt_score)
@@ -910,6 +915,7 @@ fn try_pair_transcripts(
910915
len2: usize,
911916
params: &Parameters,
912917
combined_score_threshold: i32,
918+
scorer: &AlignmentScorer,
913919
) -> Option<PairedAlignment> {
914920
// Must be same chromosome
915921
if t1.chr_idx != t2.chr_idx {
@@ -944,8 +950,17 @@ fn try_pair_transcripts(
944950
return None;
945951
}
946952

947-
// Combined score: sum of per-mate finalized scores (each already includes per-mate span penalty)
948-
let combined_wt_score = t1.score + t2.score;
953+
// Combined score: STAR computes a single genomic-length penalty for the combined WT span.
954+
// Per-mate scores each include their own span penalty (negative). We undo those and apply
955+
// the combined span penalty, matching STAR's ReadAlign_stitchWindowSeeds.cpp behavior.
956+
// combined_score = (t1.score - p1) + (t2.score - p2) + combined_p
957+
let t1_span = t1.genome_end - t1.genome_start;
958+
let t2_span = t2.genome_end - t2.genome_start;
959+
let combined_span = right.genome_end - left.genome_start;
960+
let p1 = scorer.genomic_length_penalty(t1_span);
961+
let p2 = scorer.genomic_length_penalty(t2_span);
962+
let combined_p = scorer.genomic_length_penalty(combined_span);
963+
let combined_wt_score = t1.score + t2.score - p1 - p2 + combined_p;
949964

950965
// SCORE-GATE: reject pairs where score is below the absolute floor
951966
if combined_wt_score + params.out_filter_multimap_score_range < combined_score_threshold {

0 commit comments

Comments
 (0)