Status: Complete (Phases 13.1-13.14)
Goal: Optimize alignment performance to approach STAR speeds, fix classification issues, and reduce accuracy gaps.
- Fixed clustering explosion (11,982 clusters → ~200 with STAR limits)
- Implemented exon building from CIGAR (line 354 TODO completed)
- Added match scoring (+1 per matched base for DP stitching)
- Implemented soft clip logic for partial alignments
- 100 reads: 100% mapped in ~5s (was: >60s, 0% mapped)
Files: src/params.rs, src/align/stitch.rs, src/align/read_align.rs
See: ALIGNMENT_FIXES.md
Problem: 66% unmapped reads (STAR: 11%) due to inflated mismatch counts.
Root Causes & Fixes:
- Reverse-strand genome offset —
count_mismatches()was reading forward genome for reverse reads. Fixed: addn_genomeoffset. - Seed length overestimation —
extend_match()verified atsa_startonly. Fixed:verify_match_at_position()per expanded seed. - DP chain start wrong — Used
expanded_seeds.first()instead of tracingprev_seed. Fixed chain start tracing. - Combined gap CIGAR missing — When
read_gap > 0ANDgenome_gap > 0, seeds merged incorrectly. Fixed to emitMatch(shared) + Del/Ins(excess). - Removed incorrect read reverse-complement — Seeds match read-as-is against RC genome region.
| Metric | Before | After | STAR |
|---|---|---|---|
| Unique | 32% | 82.4% | 82% |
| Multi | 2% | 8.3% | 7% |
| Unmapped | 66% | 9.3% | 11% |
Files: src/align/stitch.rs, src/align/score.rs, src/align/read_align.rs
Profiling (1000 reads, single-threaded): 29.5% malloc, 18.1% PackedArray::read, 15.9% cluster_seeds, 14.7% stitch_seeds.
Optimizations:
- Eliminated Vec allocations —
Seed::genome_positions()lazy iterator, pre-allocated Vecs - Optimized PackedArray::read — Fast path: direct 8-byte slice read
- Binary search
position_to_chr— O(n) → O(log n) viapartition_point() - Deduplicated expanded seeds — Cap at 200 per cluster
- Deferred CIGAR clone — Build only once for winner
| Metric | Before | After |
|---|---|---|
| Wall time (1000 reads) | ~3.0s | ~0.77s (3.9x faster) |
Files: src/align/seed.rs, src/index/packed_array.rs, src/genome/mod.rs, src/align/stitch.rs
Symptoms: CIGAR 4294953882D (near 2³²), junction coords beyond chr boundaries.
Root Causes:
- Integer overflow — Gap calculations produce negative values for overlapping seeds;
as u32wraps - CIGAR merging failure — Consecutive Match ops not merged
- Global vs per-chromosome coords — Missing
pos - genome.chr_start[chr_idx]
Files: src/align/stitch.rs, src/io/sam.rs, src/junction/sj_output.rs
outFilterIntronMotifsdefault →None(was wronglyRemoveNoncanonical)seedMultimapNmaxrestored to 10000 (was 100)- Removed multi-chr anchor check (STAR processes each independently)
- STAR-style seed overlap trimming (advance start, don't skip)
- Gap mismatch scoring:
shared_score = shared_bases - 2*mismatches alignSJstitchMismatchNmaxfilter
Files: src/params.rs, src/align/stitch.rs
Problem: 65% soft clips vs STAR's 26%, only 42% position agreement.
Implementation: extend_alignment() walks base-by-base, scoring +1 match / -1 mismatch, stops at mismatch limit. Applied at both ends of seed chain, replacing unconditional soft clips.
| Metric | Before | After | STAR |
|---|---|---|---|
| Soft clip rate | 65% | 26.6% | 25.8% |
| Position agreement | 42% | 51.0% | — |
Files: src/align/stitch.rs, src/align/score.rs, src/lib.rs
- Added
CtAc,CtGc,GtAtvariants toSpliceMotifenum detect_splice_motif()always reads forward genome, checks all 6 patterns- 100% motif agreement on 31/31 shared junctions (was 89.7%)
Files: src/align/score.rs
Enforced alignSJoverhangMin (5bp) and alignSJDBoverhangMin (3bp) during DP stitching. Previously only used in two-pass junction filtering.
Files: src/align/score.rs, src/align/stitch.rs, src/lib.rs
When alignIntronMax=0 (default), STAR computes max as 2^winBinNbits * winAnchorDistNbins = 589,824bp. Gaps exceeding this are treated as deletions. Added enforcement in both DP scoring and two-pass filtering.
Files: src/align/score.rs, src/align/stitch.rs, src/lib.rs, src/junction/mod.rs
- P0: Fixed overhang calculation (was hardcoded
5u32). Walks CIGAR formin(left_exon, right_exon). - P1:
outFilterIntronStrands = RemoveInconsistentStrands(STAR default). Rejects conflicting junction strands. - P2:
outSJfilter*params (OverhangMin, CountUniqueMin, CountTotalMin, DistToOtherSJmin) — 4-element Vecs indexed by motif category. - P3: Two-pass
filter_novel_junctions()uses motif-specific thresholds.
Files: src/lib.rs, src/params.rs, src/align/score.rs, src/align/read_align.rs, src/junction/sj_output.rs, src/junction/mod.rs
Root Cause: SA reverse-strand position encoding bug. Positions stored as RC genome offsets [0, n_genome) but used directly as forward coords. ~94% of disagreements were different-chromosome.
Fixes:
sa_pos_to_forward()— convertsn_genome - sa_pos - match_lengthfor reverse strandcluster_seeds()— convert beforeposition_to_chr()stitch_seeds()— convert for chr bounds; keep raw for DP genome access- SAM SEQ/QUAL — reverse-complement SEQ, reverse QUAL for FLAG & 16
| Metric | Before | After |
|---|---|---|
| Position agreement | 51.0% | 94.5% |
| Diff-chr MAPQ=255 | 3497 | 2 |
| STAR-only mapped | 629 | 42 |
Files: src/index/mod.rs, src/align/stitch.rs, src/io/sam.rs, src/io/fastq.rs
- CIGAR reversal for reverse strand — DP builds in RC order, SAM requires forward
- Strand-aware splice motif —
score_gap_with_strand()converts RC donor:forward_donor = n_genome - rc_donor - intron_len scoreGenomicLengthLog2scale(-0.25) — penalizes long-spanning alignments:ceil(log2(span) * scale - 0.5)
| Metric | Before | After |
|---|---|---|
| Position agreement | 94.5% | 95.3% |
| CIGAR agreement | 84.3% | 96.5% |
| Splice rate | 5.8% | 4.1% |
Files: src/align/stitch.rs, src/align/score.rs, src/params.rs
Added secondary sort keys for equal-score transcripts: smallest chr_idx → smallest genome_start → forward strand first. Output now deterministic across runs.
Files: src/align/read_align.rs
- 13.10a: Extension mismatch boundary — confirmed STAR uses
>(not>=). No change. - 13.10b: Terminal exon overhang — 12bp min for novel, 3bp for annotated (
alignSJDBoverhangMin).stitch_seeds_with_jdb(). - 13.10c:
outSJfilterIntronMaxVsReadN— [50000, 100000, 200000] max intron by read count. - 13.10d:
winReadCoverageRelativeMin(0.5) — discard sparse clusters. - 13.10e: Annotation bonus during DP —
sjdbScore(+2) for annotated junctions in DP transitions. - 13.10f: Seed/window caps —
seedPerReadNmax(1000),seedPerWindowNmax(50),alignWindowsPerReadNmax(10000).
| Metric | Before | After | STAR |
|---|---|---|---|
| Position agreement | 95.3% | 96.3% | — |
| CIGAR agreement | 96.5% | 97.4% | — |
| Splice rate | 4.1% | 0.4% | 2.5% |
| rustar-aligner-only junctions | 33 | 3 | — |
Files: src/align/stitch.rs, src/align/read_align.rs, src/align/seed.rs, src/params.rs, src/junction/sj_output.rs
Implementation (src/align/seed.rs):
Seed.search_rc: boolfield,reverse_complement_read()helperfind_seeds()runs L→R loop then R→L loop on RC read, sharedseedPerReadNmaxcapgenome_positions()converts RC seeds:(n_genome - pos - len, !is_rev)
| Metric | Before | After |
|---|---|---|
| Multi-mapped | 3.65% | 4.92% |
| Shared junctions | 9 | 30 |
| Splice rate | 0.4% | 0.9% |
Files: src/align/seed.rs, src/align/stitch.rs
Root Cause: Junction strand derived from transcript.is_reverse (wrong). STAR derives from splice motif dinucleotides.
Fixes:
- Strand from
motif.implied_strand()(GT/AG→1, CT/AC→2, None→0) - Direct
encode_motif()mapping without strand transformation
Result: Motif agreement 80% → 100% on shared junctions.
Files: src/lib.rs, src/junction/sj_output.rs
Removed 12bp terminal exon floor from DP (Phase 13.10b was too aggressive). STAR only applies 12bp at SJ.out.tab write time, not during DP. Simplified to alignSJoverhangMin (5bp novel, 3bp annotated).
| Metric | Before | After | STAR |
|---|---|---|---|
| Splice rate | 0.9% | 3.4% | 2.2% |
| Shared junctions | 30 | 50 | 72 total |
Files: src/align/stitch.rs
After all reads aligned, compute surviving junctions via outSJfilter* thresholds, then filter reads with non-surviving junctions.
compute_surviving_junctions()insj_output.rs- BySJout buffering in
lib.rs— buffers all SAM records, filters, writes survivors undo_mapped_record_bysj()instats.rs— atomic CAS for stat correction
| Mode | Position | CIGAR | Splice rate |
|---|---|---|---|
| Normal | 95.7% | 97.3% | 3.4% |
| BySJout | 96.7% | 98.3% | 1.1% |
Key Insight: BySJout too aggressive without GTF — all junctions novel → strict thresholds. With GTF, annotated junctions bypass filters.
Files: src/junction/sj_output.rs, src/junction/mod.rs, src/lib.rs, src/stats.rs