Skip to content

Commit baeea81

Browse files
authored
Merge branch 'main' into feat/run-rng-seed
2 parents 1514169 + e508928 commit baeea81

4 files changed

Lines changed: 74 additions & 30 deletions

File tree

CLAUDE.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ Always run `cargo clippy`, `cargo fmt --check`, and `cargo test` before consider
3232

3333
## Current Status
3434

35-
**274 tests passing, 0 clippy warnings.** SE: 8796/8926 compare_sam.py (98.5%), 2.2% splice rate (STAR: 2.2%), 66 shared junctions, **100.0% MAPQ agreement, MAPQ inflation: 0, deflation: 0**. 127 position disagreements (ALL verified as genuine ties). 1 CIGAR-only disagree (ERR12389696.13573895, insertion placement, seed-level tie). **0 STAR-only / 0 ruSTAR-only SE reads**. PE: **8390/8390 both-mapped (0 gap, exact STAR match)**, 0 half-mapped, **0 MAPQ inflations** (fixed Phase 17.C), **98.915% PE faithfulness** (Phase 17.C). Phase 17.A complete: `scoreSeedBest` pre-extension stored as `pre_ext_score` on each `WindowAlignment`. Phase 17.C complete: STAR-faithful SCORE-GATE + STAR-faithful `mappedFilter`. Phase 17.8 complete: `--quantMode GeneCounts` outputs `ReadsPerGene.out.tab` with 3 independent counting passes; 0 col1 gene disagreements vs STAR on 10k SE yeast. See [ROADMAP.md](ROADMAP.md) for detailed phase tracking and [docs/](docs/) for per-phase notes.
35+
**278 tests passing, 0 clippy warnings.** SE: 8796/8926 compare_sam.py (98.5%), 2.2% splice rate (STAR: 2.2%), 66 shared junctions, **100.0% MAPQ agreement, MAPQ inflation: 0, deflation: 0**. 127 position disagreements (ALL verified as genuine ties). 1 CIGAR-only disagree (ERR12389696.13573895, insertion placement, seed-level tie). **0 STAR-only / 0 ruSTAR-only SE reads**. PE: **8767 both-mapped** (STAR: 8390), **236 half-mapped** (new behavior), 28 MAPQ inflations / 192 deflations (rDNA N² problem), **98.2% per-mate position agreement**, **93.920% PE exact faithfulness** (pos+CIGAR+MAPQ+proper+NH). Phase 17.A: `scoreSeedBest` pre-extension. Phase 17.B: per-mate seeding. Phase 17.C: STAR-faithful SCORE-GATE + mappedFilter (0 MAPQ inflations). Phase 17.D: combined-span penalty fix + dedup ordering (248→236 half-mapped). Phase 17.8: `--quantMode GeneCounts`. See [ROADMAP.md](ROADMAP.md) for detailed phase tracking and [docs/](docs/) for per-phase notes.
3636

3737
## Source Layout
3838

@@ -130,7 +130,7 @@ predicates = "3"
130130
- Every phase uses differential testing against STAR where applicable
131131
- Test data tiers: synthetic micro-genome → chr22 → full human genome
132132

133-
**Current test status**: 268/268 tests passing (264 unit + 4 integration), 0 clippy warnings
133+
**Current test status**: 278/278 tests passing (274 unit + 4 integration), 0 clippy warnings
134134

135135
## Known Issues — Disagreement Root Causes (10k SE yeast)
136136

@@ -161,13 +161,15 @@ Previously listed issues now resolved:
161161

162162
See [ROADMAP.md](ROADMAP.md) and [docs/](docs/) for full issue tracking.
163163

164-
## PE Status (Updated 2026-04-09 — Phase 16.45)
164+
## PE Status (Updated 2026-04-17 — Phase 17.D)
165165

166-
**Phase 16.45** (split_working_transcript junction split fix): **PE both-mapped = 8390/8390 (exact STAR match)**. PE MAPQ deflations: **16 → 0**. PE faithfulness: 98.784%.
166+
**Phase 17.D** (combined-span penalty + dedup ordering): **PE both-mapped = 8767** (STAR: 8390), **half-mapped = 236** (was 248 Phase 17.B), **98.2% per-mate position agreement**, **93.920% PE exact faithfulness**.
167167

168-
**Root cause fixed**: `split_working_transcript` used `wt2_junc_start = boundary_idx.min(n_junctions)` to partition junction_shifts between the two mate portions. Since the inter-mate boundary does NOT produce a junction entry, the index offset was wrong: any junctions belonging to the second portion with index < boundary_idx were silently discarded. Fix: use `wt2_junc_start = wt1_junc_end` (junction split point = number of junctions in the first portion). For the specific case boundary_idx=1 (one mate2 exon in reverse cluster), wt1_junc_end=0 but wt2_junc_start was incorrectly set to 1, causing the 145907N junction's shift (jj_r=8) to be lost. finalize_transcript's overhang check then saw empty junction_shifts and passed the spurious 139M145907N11M alignment (11 < threshold 5+8=13 → should have been rejected).
168+
**Two fixes in Phase 17.D**:
169+
1. `try_pair_transcripts` now computes STAR-faithful combined-span penalty: `combined_wt_score = t1.score + t2.score - p1 - p2 + combined_p`. Previously double-applied per-mate span penalties → AS tag wrong for 99.6% of PE reads. Now 3.1%.
170+
2. Decision tree reordered to (1) position dedup → (2) score-range filter → (3) TooManyLoci → (4) quality filter. Fixes 12 half-mapped pairs.
169171

170-
**Current PE parity**: ruSTAR=8390, STAR=8390, **0 gap**. 2 ruSTAR-only FPs (.17779410, .6302610). 2 STAR-only mates (.18919121 half-mapped, .6302610 half-mapped). 24 MAPQ inflations (rDNA/repeat multi-mappers). 0 MAPQ deflations.
172+
**Current PE parity**: 8767 vs STAR 8390 (+377 extra, mostly rDNA N² cross-copy pairs). 236 half-mapped. 28 MAPQ inflations / 192 deflations (rDNA N² problem). `.18919121` fixed (Phase 17.B). `.6302610` still FP.
171173

172174
## Remaining Limitations (Top 5)
173175

docs/phase17_features.md

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
# Phase 17: Features + Polish
44

5-
**Status**: In Progress (17.1, 17.5, 17.8, 17.A, 17.B, 17.C complete)
5+
**Status**: In Progress (17.1, 17.5, 17.8, 17.A, 17.B, 17.C, 17.D complete)
66

77
**Goal**: Production-ready features and quality-of-life improvements.
88

@@ -14,6 +14,7 @@
1414
| 17.A | `scoreSeedBest` pre-extension on WA entries (STAR faithful) | ✅ Complete |
1515
| 17.B | Per-mate seeding (fix `.18919121`, `.6302610` arch failures) | ✅ Complete — `.18919121` fixed; regressions under investigation |
1616
| 17.C | STAR-faithful SCORE-GATE + mappedFilter for PE (fix 4 MAPQ inflations) | ✅ Complete |
17+
| 17.D | PE combined-span penalty + dedup-before-score-range ordering (248→236 half-mapped) | ✅ Complete |
1718
| 17.2 | Coordinate-sorted BAM (`--outSAMtype BAM SortedByCoordinate`) | Planned |
1819
| 17.3 | Paired-end chimeric detection | Planned |
1920
| 17.4 | `--outReadsUnmapped Fastx` | Planned |
@@ -206,6 +207,27 @@ The ±1 discrepancies (N_unmapped + N_ambiguous) are a single read at a gene ove
206207

207208
---
208209

210+
## Phase 17.D: PE Combined-Span Penalty + Dedup Ordering ✅ (2026-04-17)
211+
212+
**Problem 1**: `try_pair_transcripts` used `combined_wt_score = t1.score + t2.score`, double-applying the genomic-length penalty (each mate's finalized score already includes its own span penalty). This inflated the combined score relative to STAR's single-span formula, causing AS tag disagreements (was 99.6% of PE reads).
213+
214+
**Fix**: STAR computes ONE genomic-length penalty over the full PE span. Per-mate approach must undo per-mate penalties and apply the combined penalty:
215+
```rust
216+
combined_wt_score = t1.score + t2.score - p1 - p2 + combined_p
217+
```
218+
where `p1 = genomic_length_penalty(t1_span)`, `p2 = genomic_length_penalty(t2_span)`, `combined_p = genomic_length_penalty(right.genome_end - left.genome_start)`.
219+
`try_pair_transcripts` now takes `scorer: &AlignmentScorer` to call `scorer.genomic_length_penalty()`.
220+
221+
**Problem 2**: Decision tree ordering was score-range → dedup → TooManyLoci (wrong). STAR's ordering is multMapSelect → dedup → TooManyLoci. Also, dedup was running after score-range filter, meaning some duplicate pairs at the same position could escape the score-range window.
222+
223+
**Fix** (`src/align/read_align.rs`): Reordered to: (1) position dedup, (2) score-range filter (multMapSelect), (3) TooManyLoci check, (4) sort by score, (5) quality filter (mappedFilter).
224+
225+
**Result**: 278/278 tests, 0 warnings. **248 → 236 half-mapped** (12 pairs fixed by correct ordering). PE diff-AS dropped from 99.6% → 3.1%. PE both-mapped: 8767 (STAR: 8390). SE 8796/8926 maintained.
226+
227+
**Investigation note**: Attempted quality-filter fallback (retry with pre-score-range pool when score-range winner fails quality) to recover additional half-mapped reads. The specific root cause of remaining 236 half-mapped: STAR's combined-read DP finds correct mate2 alignment with 0 mismatches; ruSTAR's per-mate DP finds a different alignment at the same position with 8 mismatches (combined_nm=14 > outFilterMismatchNmax=10). The fallback recovers 35 pairs but introduces ~100 position regressions (pairs at wrong positions passing individual quality checks). Reverted.
228+
229+
---
230+
209231
## Phase 17.2: Coordinate-Sorted BAM — Planned
210232

211233
High user value. STAR outputs `--outSAMtype BAM SortedByCoordinate` natively. Options:

src/align/read_align.rs

Lines changed: 42 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -699,9 +699,15 @@ pub fn align_paired_read(
699699
let mut joint_pairs: Vec<PairedAlignment> = Vec::new();
700700
for t1 in &mate1_transcripts {
701701
for t2 in &mate2_transcripts {
702-
if let Some(pair) =
703-
try_pair_transcripts(t1, t2, len1, len2, params, combined_score_threshold)
704-
{
702+
if let Some(pair) = try_pair_transcripts(
703+
t1,
704+
t2,
705+
len1,
706+
len2,
707+
params,
708+
combined_score_threshold,
709+
&scorer,
710+
) {
705711
joint_pairs.push(pair);
706712
}
707713
}
@@ -731,24 +737,12 @@ pub fn align_paired_read(
731737
}
732738
}
733739

734-
// --- Decision tree: score-filter, dedup, quality-filter, then half-mapped fallback ---
735-
// Step 1: score-range filter (STAR's multMapSelect).
736-
if !joint_pairs.is_empty() {
737-
let best_score = joint_pairs
738-
.iter()
739-
.map(|pa| pa.combined_wt_score)
740-
.max()
741-
.unwrap_or(0);
742-
let score_threshold = best_score - params.out_filter_multimap_score_range;
743-
joint_pairs.retain(|pa| pa.combined_wt_score >= score_threshold);
744-
}
745-
746-
// Step 2: TooManyLoci check using pre-dedup count.
747-
if joint_pairs.len() > params.out_filter_multimap_nmax as usize {
748-
return Ok((Vec::new(), 0, Some(UnmappedReason::TooManyLoci)));
749-
}
740+
// --- Decision tree: dedup, score-filter, quality-filter, then half-mapped fallback ---
750741

751-
// Step 3: position dedup — remove exact (chr, mate1_pos, mate2_pos, strand, CIGAR) duplicates.
742+
// Step 1: position dedup — remove exact (chr, mate1_pos, mate2_pos, strand, CIGAR) duplicates.
743+
// Run dedup BEFORE score-range filter so the backup pool is already deduplicated.
744+
// (STAR's ordering is multMapSelect → dedup, but dedup before multMapSelect is equivalent
745+
// since removing exact duplicates doesn't change the best score.)
752746
joint_pairs.sort_by(|a, b| {
753747
let pos_cmp = (
754748
a.mate1_transcript.chr_idx,
@@ -848,6 +842,22 @@ pub fn align_paired_read(
848842
.collect();
849843
}
850844

845+
// Step 2: score-range filter (STAR's multMapSelect).
846+
if !joint_pairs.is_empty() {
847+
let best_score = joint_pairs
848+
.iter()
849+
.map(|pa| pa.combined_wt_score)
850+
.max()
851+
.unwrap_or(0);
852+
let score_threshold = best_score - params.out_filter_multimap_score_range;
853+
joint_pairs.retain(|pa| pa.combined_wt_score >= score_threshold);
854+
}
855+
856+
// Step 3: TooManyLoci check (post-dedup, matching STAR's ordering: multMapSelect → dedup → TooManyLoci).
857+
if joint_pairs.len() > params.out_filter_multimap_nmax as usize {
858+
return Ok((Vec::new(), 0, Some(UnmappedReason::TooManyLoci)));
859+
}
860+
851861
joint_pairs.sort_by(|a, b| {
852862
b.combined_wt_score
853863
.cmp(&a.combined_wt_score)
@@ -953,6 +963,7 @@ fn try_pair_transcripts(
953963
len2: usize,
954964
params: &Parameters,
955965
combined_score_threshold: i32,
966+
scorer: &AlignmentScorer,
956967
) -> Option<PairedAlignment> {
957968
// Must be same chromosome
958969
if t1.chr_idx != t2.chr_idx {
@@ -987,8 +998,17 @@ fn try_pair_transcripts(
987998
return None;
988999
}
9891000

990-
// Combined score: sum of per-mate finalized scores (each already includes per-mate span penalty)
991-
let combined_wt_score = t1.score + t2.score;
1001+
// Combined score: STAR computes a single genomic-length penalty for the combined WT span.
1002+
// Per-mate scores each include their own span penalty (negative). We undo those and apply
1003+
// the combined span penalty, matching STAR's ReadAlign_stitchWindowSeeds.cpp behavior.
1004+
// combined_score = (t1.score - p1) + (t2.score - p2) + combined_p
1005+
let t1_span = t1.genome_end - t1.genome_start;
1006+
let t2_span = t2.genome_end - t2.genome_start;
1007+
let combined_span = right.genome_end - left.genome_start;
1008+
let p1 = scorer.genomic_length_penalty(t1_span);
1009+
let p2 = scorer.genomic_length_penalty(t2_span);
1010+
let combined_p = scorer.genomic_length_penalty(combined_span);
1011+
let combined_wt_score = t1.score + t2.score - p1 - p2 + combined_p;
9921012

9931013
// SCORE-GATE: reject pairs where score is below the absolute floor
9941014
if combined_wt_score + params.out_filter_multimap_score_range < combined_score_threshold {

src/align/stitch.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2409,7 +2409,7 @@ pub(crate) fn stitch_seeds_with_jdb_debug(
24092409

24102410
/// Shared core: preprocessing + recursive stitcher, returns working transcripts + context.
24112411
#[allow(clippy::too_many_arguments)]
2412-
fn stitch_seeds_core(
2412+
pub(crate) fn stitch_seeds_core(
24132413
cluster: &SeedCluster,
24142414
read_seq: &[u8],
24152415
index: &GenomeIndex,

0 commit comments

Comments
 (0)