Skip to content

Latest commit

 

History

History
439 lines (291 loc) · 27.5 KB

File metadata and controls

439 lines (291 loc) · 27.5 KB

← Back to ROADMAP

Phase 17: Features + Polish

Status: In Progress (17.1, 17.5, 17.8, 17.A, 17.B, 17.C, 17.D, 17.2, 17.3, 17.4, 17.6, 17.7, 17.9, 17.10, 17.11, 17.12, 17.13 complete)

Goal: Production-ready features and quality-of-life improvements.

Sub-phase Status

Sub-phase Description Status
17.1 Log.final.out statistics file (MultiQC/RNA-SeQC) ✅ Complete
17.A scoreSeedBest pre-extension on WA entries (STAR faithful) ✅ Complete
17.B Per-mate seeding (fix .18919121, .6302610 arch failures) ✅ Complete — .18919121 fixed; regressions under investigation
17.C STAR-faithful SCORE-GATE + mappedFilter for PE (fix 4 MAPQ inflations) ✅ Complete
17.D PE combined-span penalty + dedup-before-score-range ordering (248→236 half-mapped) ✅ Complete
17.2 Coordinate-sorted BAM (--outSAMtype BAM SortedByCoordinate) ✅ Complete
17.3 Paired-end chimeric detection ✅ Complete
17.4 --outReadsUnmapped Fastx ✅ Complete
17.5 Fix clippy warnings (0 warnings) ✅ Complete
17.6 --outStd SAM/BAM (stdout output for piping) ✅ Complete
17.7 GTF tag parameters (sjdbGTFchrPrefix, etc.) ✅ Complete
17.8 --quantMode GeneCounts ✅ Complete
17.9 --outBAMcompression / --limitBAMsortRAM ✅ Complete
17.10 Chimeric Tier 3 (re-map soft-clipped regions) ✅ Complete
17.11 --chimOutType WithinBAM (supplementary FLAG 0x800) ✅ Complete
17.12 BySJout memory optimization (disk buffering for 100M+ reads) ✅ Complete
17.13 Integration tests for Phase 17 features (8 tests, synthetic 20kb genome) ✅ Complete

Phase 17.1: Log.final.out ✅

Problem: No Log.final.out — MultiQC and RNA-SeQC can't parse results.

Implementation:

  1. Cargo.toml — Added chrono = "0.4" for timestamps

  2. src/stats.rs — Major expansion (+673 lines):

    • UnmappedReason enum (Other, TooShort, TooManyMismatches)
    • 15 new AtomicU64 counters: read_bases, mapped_bases/mismatches, ins/del count/bases, splices_by_motif[7], splices_annotated, unmapped breakdown, chimeric_reads
    • record_transcript_stats(&Transcript) — walks CIGAR for all counters
    • write_log_final(path, time_start, time_map_start, time_finish)
  3. src/align/read_align.rsalign_read() returns 4-tuple with Option<UnmappedReason>

  4. src/lib.rs — Timing via chrono::Local::now(), stats collected before BySJout (matches STAR)

Format: 47-char right-justified field names + |\t separator. All 37 STAR fields present.

Differential Test (10k SE):

Field rustar-aligner STAR
Input reads 10000 10000
Avg read length 150 150
Uniquely mapped % 83.11% 82.65%
Avg mapped length 146.99 146.99
Mismatch rate 0.40% 0.40%

Files: Cargo.toml, src/stats.rs, src/align/read_align.rs, src/lib.rs


Phase 17.5: Clippy Cleanup ✅

Problem: 13 clippy warnings creating noise during debugging.

Changes:

  • Removed dead code: verify_match_at_position(), unused read_seq param from cluster_seeds()
  • cluster_seeds(): 9 args → 3 args — now takes &Parameters instead of 6 windowing params
  • search_direction_sparse(): 8 args → 7 args — folded effective_start_lmax into body
  • ChimericSegment::new(): Removed constructor, all 6 call sites use struct literal syntax
  • Added AlignReadResult type alias for align_read() return type
  • Idiomatic fixes: .contains(), .div_ceil(), .saturating_sub()
  • #[allow(clippy::too_many_arguments)] on 4 functions with genuinely many distinct args

Result: 0 clippy warnings, 264/264 tests passing.


Phase 17.8: --quantMode GeneCounts ✅ (2026-04-17)

Goal: Output ReadsPerGene.out.tab matching STAR's HTSeq-union gene-level counting.

Implementation: New src/quant/mod.rs with:

  • GeneAnnotation: per-chromosome sorted interval list (absolute genome coords) built from GTF exons
  • GeneCounts: atomic per-gene counters + 3 independent N_noFeature/N_ambiguous arrays
  • QuantContext: Arc-shared bundle for rayon parallel threads
  • --quantMode GeneCounts + --sjdbGTFfile validation in params.rs
  • SE and PE counting paths in lib.rs

Three bugs fixed vs initial implementation:

  1. Coordinate mismatch: GTF exon positions were stored chr-relative; Transcript.exon.genome_start uses absolute concatenated-genome coords. Fix: add genome.chr_start[chr_idx] offset when converting GTF positions.
  2. Single counting pass: All 3 columns were identical. STAR runs 3 INDEPENDENT passes — col1 (any strand), col2 (same strand as read), col3 (opposite strand) — each with separate N_noFeature and N_ambiguous.
  3. Too-many-loci bucket: These were going to N_multimapping. STAR puts them in N_unmapped.

Results vs STAR (10k SE yeast):

Metric STAR rustar-aligner
N_unmapped 1073 1074 (+1)
N_multimapping 661 661
N_noFeature col1/col2/col3 131/3653/4240 131/3653/4240
N_ambiguous col1 567 566 (-1)
Gene total col1 7568 7568
Col1 gene disagreements 0
Col2/col3 gene disagreements 1 each (boundary edge case)

The ±1 discrepancies (N_unmapped + N_ambiguous) are a single read at a gene overlap boundary — likely a minor coordinate boundary difference.

Files: src/quant/mod.rs (new), src/params.rs, src/junction/mod.rs (pub(crate) gtf), src/lib.rs

Tests: 274/274 (added 6 new quant unit tests), 0 clippy warnings.


Phase 17.A: scoreSeedBest Pre-Extension ✅ (2026-04-16)

Goal: Match STAR's ReadAlign_stitchWindowSeeds.cpp — pre-extend each seed left+right before the recursive DP and store the result as pre_ext_score on each WindowAlignment entry.

What STAR does: Before stitchWindowAligns, STAR computes scoreSeedBest[iS] for every seed in the window via a two-level DP: (1) base case: length + left_ext, (2) chain case: stitchAlignToTranscript(iS2→iS1) + scoreSeedBest[iS2]. Then adds right_ext universally. Used for seed ordering in the recursive aligner (start from highest-scoring seed).

Implementation:

  1. src/align/stitch.rsWindowAlignment struct: added pub pre_ext_score: i32 field. All construction sites updated (pre_ext_score: length as i32 default).

  2. src/align/score.rsAlignmentScorer: added pub out_filter_score_min_over_lread: f64. All constructor paths updated.

  3. src/chimeric/detect.rsWindowAlignment construction updated.

  4. src/align/stitch.rsstitch_seeds_core: inserted pre-extension block after seed dedup/sort, before stitch_recurse:

    • EXTEND_ORDER respected: left-first for forward clusters (!stitch_is_reverse), right-first for reverse clusters (matching stitch_recurse base case)
    • right_len_prev = wa.length + first_ext.extend_len (mirrors base case's len_after_first)
    • Chain DP: dp[i] = max(dp[i], dp[j] + wa_entries[i].pre_ext_score) with colinearity check
    • No hard pre-filter gate: STAR uses scoreSeedBest for ordering only, not window rejection

Key finding during implementation: A pre-filter gate at full outFilterScoreMinOverLread * (Lread-1) threshold caused 42 false rejections — reads with only short seeds (9-16bp) in low-quality windows, where the full WT extension (starting from leftmost seed) can reach the threshold even though no individual seed's pre-extension does. STAR does NOT apply this gate; scoreSeedBest is used for seed ordering in stitchWindowAligns only.

Result: 268/268 tests, 0 warnings, 8796/8926 SE (baseline maintained), 8390/8390 PE (baseline maintained). pre_ext_score ready for Phase 17.B seed ordering.


Phase 17.B: Per-Mate Seeding ✅ (2026-04-17)

What this fixes: .18919121 (was STAR-only) — adapter-RC at start of rc_read1 caused a 15bp Nstart shift in the combined read's mate1 seed position, triggering reverse-cluster rejection. Per-mate seeding finds mate1 seeds from mate1_seq directly, avoiding the adapter-RC contamination.

Root cause of original failures (combined-read approach):

  • .18919121: Nstart positions 21, 63, 106 in the 301bp combined-read fell within rc_mate2 = RC(adapter-contaminated mate2). The adapter RC at stitch_read[155:171] caused a 15bp seed shift for mate1, firing the reverse-cluster reject condition.
  • .6302610: In the forward cluster, rc_mate2 seeds at sa_pos=126596 (inside mate1's genome range) slipped through fwd_reject because the combined read blurred the mate boundary.

Implementation (src/align/read_align.rs):

  1. Per-mate seeding: Seed::find_seeds(mate1_seq, ...) and Seed::find_seeds(mate2_seq, ...) separately. Each mate seeded with its own Nstart positions (0, 37, 74, 112 for 150bp reads).
  2. Independent clustering: cluster_seeds() called separately for each mate.
  3. Independent stitching: stitch_seeds_with_jdb_debug() per mate-cluster. Reverse clusters receive mate2_seq directly; stitch internally does RC and sets is_reverse=true.
  4. Pairwise matching: try_pair_transcripts() — checks same chr, opposite strands, within win_bin_window_dist() span, combined score gate.
  5. Half-mapped fallback: if no valid pair but one mate individually passes quality threshold, report as HalfMapped.

Removed from stitch.rs: stitch_seeds_working, find_mate_boundary, split_working_transcript, adjust_mate2_coords, adjust_wt_read_coords — no longer needed.

Result: .18919121 now mapped as VIII:452300 15S134M1S + VIII:452301 133M17S (STAR: 16S133M1S + 133M17S). 1bp CIGAR difference is a seed-level tie.

Regressions from per-mate approach (known, to fix later):

  • 15 rDNA inter-copy junction reads missed: Reads spanning the boundary between two adjacent rDNA repeat units (yeast chr XII, ~9.1kb inserts). STAR's combined-read boundary seed at position ~171 uniquely identifies the inter-copy junction. Per-mate seeding generates 55 mate1 × 9 mate2 = ~76 candidate pairs, hitting the TooManyLoci limit (>20). Root fix: apply position-dedup before TooManyLoci check (STAR's actual ordering), or implement targeted cross-boundary rescue.
  • ~366 extra both-mapped pairs: Cross-copy pairings created by combining mate1 and mate2 transcripts from different repeat copies. These inflate NH counts for some multi-mappers.
  • 248 half-mapped pairs: New behavior — reads where one mate individually maps but cannot pair. STAR doesn't output these by default (--outSAMunmapped None).

Test status: 274/274, 0 clippy warnings, SE 8796/8926 maintained.


Phase 17.C: STAR-faithful SCORE-GATE + mappedFilter ✅ (2026-04-17)

Problem: 4 PE MAPQ inflations for rDNA/repeat multi-mappers. rustar-aligner NH=2 vs STAR NH=3 for reads with cross-rDNA-copy pairs (M1@copy1 + M2@copy2, 9037bp gap), causing MAPQ=3 vs STAR's MAPQ=1.

Root cause: Two distinct bugs:

  1. Per-WT absolute threshold too strict (read_align.rs forward/reverse cluster processing):

    • rustar-aligner used if adjusted_score < combined_score_threshold { continue; } (hard cutoff at outFilterScoreMinOverLread * (Lread-1))
    • STAR's stitchWindowAligns.cpp:324 SCORE-GATE uses a RELATIVE criterion: Score + outFilterMultimapScoreRange >= wTr[0]->maxScore (within scoreRange=1 of window best)
    • For cross-copy pairs: same-copy score=198 (g_span=100bp, penalty=-2), cross-copy score=197 (g_span=9237bp, penalty=-3). rustar-aligner rejected cross-copy (197 < 198); STAR accepted it (197+1 ≥ 198)
  2. filter_paired_transcripts applied absolute threshold per-pair (not just to best):

    • rustar-aligner checked every pair's combined_wt_score < absolute_threshold → removed cross-copy (197 < 198)
    • STAR's ReadAlign_mappedFilter.cpp checks only trBest->maxScore >= threshold — if the best passes, ALL pairs in the score window are kept

Fix:

  1. src/align/read_align.rs — both forward and reverse cluster processing (lines 750, 972):

    // Old:
    if adjusted_score < combined_score_threshold { continue; }
    // New:
    if adjusted_score + params.out_filter_multimap_score_range < combined_score_threshold { continue; }
  2. src/align/read_align.rsfilter_paired_transcripts (line 1373):

    • Changed from per-pair retain to best-pair quality check
    • Find best pair (max combined_wt_score); if best fails any threshold → clear all (read unmapped)
    • If best passes → keep all pairs (they already passed multMapSelect relative criterion)

Verification: STAR debug trace on .19790508 confirmed Score=197 cross-copy pair is INSERTED (TR-INSERTED) with global_pass=1 because scoreRange=1 (outFilterMultimapScoreRange). STAR's mappedFilter only checks trBest->maxScore=198 >= 198 — passes.

Result: 268/268 tests, 0 warnings, 8796/8926 SE (maintained), 8390/8390 PE (maintained), 0 MAPQ inflations (was 4), 0 MAPQ deflations, faithfulness 98.915% (was 98.903%).


Phase 17.D: PE Combined-Span Penalty + Dedup Ordering ✅ (2026-04-17)

Problem 1: try_pair_transcripts used combined_wt_score = t1.score + t2.score, double-applying the genomic-length penalty (each mate's finalized score already includes its own span penalty). This inflated the combined score relative to STAR's single-span formula, causing AS tag disagreements (was 99.6% of PE reads).

Fix: STAR computes ONE genomic-length penalty over the full PE span. Per-mate approach must undo per-mate penalties and apply the combined penalty:

combined_wt_score = t1.score + t2.score - p1 - p2 + combined_p

where p1 = genomic_length_penalty(t1_span), p2 = genomic_length_penalty(t2_span), combined_p = genomic_length_penalty(right.genome_end - left.genome_start). try_pair_transcripts now takes scorer: &AlignmentScorer to call scorer.genomic_length_penalty().

Problem 2: Decision tree ordering was score-range → dedup → TooManyLoci (wrong). STAR's ordering is multMapSelect → dedup → TooManyLoci. Also, dedup was running after score-range filter, meaning some duplicate pairs at the same position could escape the score-range window.

Fix (src/align/read_align.rs): Reordered to: (1) position dedup, (2) score-range filter (multMapSelect), (3) TooManyLoci check, (4) sort by score, (5) quality filter (mappedFilter).

Result: 278/278 tests, 0 warnings. 248 → 236 half-mapped (12 pairs fixed by correct ordering). PE diff-AS dropped from 99.6% → 3.1%. PE both-mapped: 8767 (STAR: 8390). SE 8796/8926 maintained.

Investigation note: Attempted quality-filter fallback (retry with pre-score-range pool when score-range winner fails quality) to recover additional half-mapped reads. The specific root cause of remaining 236 half-mapped: STAR's combined-read DP finds correct mate2 alignment with 0 mismatches; rustar-aligner's per-mate DP finds a different alignment at the same position with 8 mismatches (combined_nm=14 > outFilterMismatchNmax=10). The fallback recovers 35 pairs but introduces ~100 position regressions (pairs at wrong positions passing individual quality checks). Reverted.


Phase 17.3: Paired-End Chimeric Detection ✅ (2026-05-01)

Goal: Detect chimeric alignments for paired-end reads.

Implementation:

  • PairedAlignResult type alias in read_align.rs resolves clippy::type_complexity for align_paired_read's 4-tuple return.
  • Two detection paths:
    1. Intra-mate: When a mate has ≥2 clusters, split by wa.mate_id + adjust mate2 read_pos -= len1+1ChimericDetector::detect_from_multi_clusters.
    2. Inter-mate: detect_inter_mate_chimeric on best single-mate transcripts (before half-mapped fallback). Detects diff-chr, same-strand, or >1Mb discordant pairs.
  • detect_inter_mate_chimeric re-exported from chimeric/mod.rs.
  • No benchmark regression (8390 both-mapped, 0 half-mapped).

Files: src/align/read_align.rs, src/chimeric/detect.rs, src/chimeric/mod.rs


Phase 17.11: --chimOutType WithinBAM ✅ (2026-05-01)

Goal: Write chimeric alignments as supplementary SAM/BAM records (FLAG 0x800) in the primary output file.

Implementation:

  • build_within_bam_records(alignment, genome, mapq) in chimeric/output.rs: returns 2 RecordBufs.
    • Donor: normal FLAGS + full SEQ (RC'd if is_reverse) + SA tag.
    • Acceptor: FLAG 0x800 supplementary + empty SEQ + SA tag.
    • SA tag format: chr,pos,strand,CIGAR,mapQ,NM; (pos = 1-based per-chromosome).
  • chim_out_junctions() / chim_out_within_bam() helpers in params.rs.
  • Junction file creation gated on chim_out_junctions() (allows WithinBAM-only mode).
  • All 4 lib.rs write paths (SE normal, SE bysj, PE normal, PE bysj) emit WithinBAM records.
  • Supports --chimOutType Junctions WithinBAM simultaneously.
  • convert_cigar changed to pub(crate) in sam.rs for use by output.rs.

Files: src/chimeric/output.rs, src/chimeric/mod.rs, src/params.rs, src/io/sam.rs, src/lib.rs


Phase 17.7: GTF Tag Parameters ✅ (2026-05-01)

Goal: Support non-standard GTF files via STAR's four GTF configuration parameters.

Parameters added (src/params.rs):

Parameter Default Purpose
--sjdbGTFchrPrefix "" Prefix to add to GTF chromosome names (e.g. "chr" when GTF uses bare 1,2,3)
--sjdbGTFfeatureExon "exon" Feature column value to use as exons
--sjdbGTFtagExonParentTranscript "transcript_id" GTF attribute for grouping exons into transcripts
--sjdbGTFtagExonParentGene "gene_id" GTF attribute for gene grouping in quantification

Implementation strategy: _configured variant functions carry the params; original functions remain as backward-compatible wrappers (parse_gtf, extract_junctions_from_exons, from_gtf_exons) calling with defaults. Zero test disruption (~50 test call sites unchanged).

  • parse_gtf_configured(path, feature_exon, chr_prefix) — filters by feature_exon, prepends chr_prefix to seqnames.
  • extract_junctions_configured(exons, genome, transcript_tag) — groups transcripts by transcript_tag.
  • GeneAnnotation::from_gtf_exons_configured(exons, genome, gene_tag) — uses gene_tag for quant.
  • TranscriptomeIndex::from_gtf_exons_configured(exons, genome, transcript_tag, gene_tag).
  • SpliceJunctionDb::from_gtf_configured(path, genome, feature_exon, chr_prefix, transcript_tag).
  • QuantContext::build(path, genome, feature_exon, chr_prefix, gene_tag) — updated signature.

All 4 production paths thread params: index/mod.rs (genomeGenerate), index/io.rs (load-time fallback), lib.rs (quantMode GeneCounts), junction/mod.rs (junction DB).

Tests added (3 new in gtf.rs): test_parse_gtf_configured_chr_prefix, test_parse_gtf_configured_custom_feature, test_extract_junctions_configured_custom_transcript_tag.

Files: src/params.rs, src/junction/gtf.rs, src/junction/mod.rs, src/quant/mod.rs, src/quant/transcriptome.rs, src/index/mod.rs, src/index/io.rs, src/lib.rs

Result: 379/379 tests, 0 clippy warnings.


Phase 17.9: --outBAMcompression / --limitBAMsortRAM ✅ (2026-05-01)

Goal: Control BGZF compression level for BAM output and cap memory usage during coordinate-sorted BAM buffering.

Parameters added (src/params.rs):

Parameter Default Purpose
--outBAMcompression 1 BGZF compression level: -1/0=uncompressed, 1-8=flate2 levels, ≥9=BEST
--limitBAMsortRAM 0 Max bytes for sorted BAM in-memory buffering; 0=unlimited

Implementation (src/io/bam.rs):

  • bgzf_compression(level: i32) -> noodles::bgzf::writer::CompressionLevel — maps: ≤0→NONE, ≥9→BEST, 1-8→try_from(u8).
  • make_bgzf_writer<W: Write>(inner, compression) -> noodles::bgzf::Writer<W> — uses noodles::bgzf::writer::Builder::default().set_compression_level(...).build_from_writer(inner).
  • SortedBamWriter + SortedBamStdoutWriter: added compression: i32 and limit_bam_sort_ram: u64 fields.
  • SortedBamWriter::estimated_ram()records.len() as u64 * 400 (rough 400 bytes/record).
  • SortedBamWriter::check_ram_limit() → returns Error::Alignment if limit > 0 && estimated > limit.
  • BamWriter::with_header(header, path, compression) — 3rd arg; callers pass params.out_bam_compression.
  • All 4 BAM writers use make_bgzf_writer with the configured level.

Tests added (3 new): test_bam_compression_zero, test_bam_sort_ram_unlimited, test_bam_sort_ram_exceeded.

Files: src/params.rs, src/io/bam.rs

Result: 382/382 tests, 0 clippy warnings.


Phase 17.2: Coordinate-Sorted BAM ✅ (2026-04-29)

SortedBamWriter in src/io/bam.rs buffers all RecordBuf records, sorts by (reference_sequence_id, alignment_start) on finish(), writes one BAM with SO:coordinate in @HD. Output filename: Aligned.sortedByCoord.out.bam. Verified with samtools quickcheck + samtools index.


PE chimericDetectionOld ✅ (2026-05-01)

Goal: Run Tier 1 chimeric detection (detect_chimeric_old) per-mate for paired-end reads, mirroring STAR's post-stitching chimeric search for each fragment.

Implementation (src/align/read_align.rs):

  • Added all_m1_transcripts: Vec<Transcript> and all_m2_transcripts: Vec<Transcript> before the cluster loop.
  • In the Some((m1_wt, m2_wt)) joint-pair branch: push clones when chim_segment_min > 0.
  • In single-mate all_m1/all_m2 branches: push before single_mate_transcripts.push(t).
  • After filter_paired_transcripts, before BothMapped early return: runs detect_chimeric_old on each mate's pool (using best transcript as tr_best), filters by chim_segment_min + chim_score_min, extends pe_chimeric.

Phase 17.12: BySJout Disk Buffering ✅ (2026-05-01)

Goal: Replace in-memory Vec<AlignmentBatchResults> with temp-file + compact metadata to support 100M+ read datasets without requiring ~60GB RAM for BySJout mode.

Implementation:

  • BySJReadMeta struct: n_sam_records: u32, junction_keys, chimeric_alns, transcriptome_records — only per-read metadata stays in RAM.
  • SAM records go to tempfile::NamedTempFile via bysj_write_records (noodles SAM writer).
  • Filter phase: drop(bysj_temp_writer) flushes BufWriter, reopen via tf.reopen(), create SAM reader, iterate bysj_meta — pass reads via bysj_read_n_records(..., true), skip filtered reads via bysj_read_n_records(..., false).
  • tempfile moved from [dev-dependencies] to [dependencies] in Cargo.toml.
  • Helper functions in src/io/sam.rs: create_bysj_writer, bysj_write_records, bysj_read_n_records.
  • Applied to both SE (align_reads_single_end) and PE (align_reads_paired_end).

Phase 17.13: Integration Tests ✅ (2026-05-01)

Goal: 8 end-to-end integration tests covering Phase 17 features using a synthetic 20kb genome with planted splice structure.

Genome design (tests/alignment_features.rs):

  • 20kb pseudo-random genome with planted GT-AG intron (Exon1@10000-10049, GT intron@10050-10249, Exon2@10250-10299).
  • Uses --genomeSAindexNbases 7 (satisfies 2^14 ≤ 20000).
  • GTF: two exon records for G1/T1 transcript.

Tests:

  1. test_bam_unsorted_output--outSAMtype BAM Unsorted produces valid BAM
  2. test_bam_sorted_output--outSAMtype BAM SortedByCoordinate produces sorted BAM
  3. test_paired_end_alignment — PE reads mapped, NH=1 each
  4. test_spliced_alignment — reads spanning planted intron get expected CIGAR (25M200N25M)
  5. test_bysj_filtering — BySJout mode passes spliced reads, filters unspliced
  6. test_gene_counts_outputReadsPerGene.out.tab written with correct columns
  7. test_unmapped_reads_output--outReadsUnmapped Fastx writes Unmapped.out.mate1
  8. test_two_pass_mode — two-pass alignment runs without error

Phase 12.2: SE Chimeric Tier 1b — Soft-Clip Re-mapping ✅ (2026-05-04)

Goal: When detect_chimeric_old finds no chimeric partner in the existing transcript pool, re-seed the primary alignment's soft-clipped bases to find a chimeric partner de novo.

Implementation (src/chimeric/detect.rs):

  • ChimericDetector::detect_from_soft_clips — extracts right/left soft-clip sub-sequence from read_seq, runs Seed::find_seedscluster_seedsstitch_seeds_with_jdb, adjusts exon read positions for right clips (adds clip_start offset), then applies the same score/overhang/geometry filters as detect_chimeric_old.
  • adjust_read_positions(tr, offset) helper — shifts all exon read_start/read_end by offset for sub-sequence stitching results.
  • Called as Step 3c in src/align/read_align.rs, only when chimeric_alignments.is_empty() after Step 3b.

Result: 396/396 tests, 0 clippy warnings.


Phase 17.10: Chimeric Tier 3 — Residual Re-mapping ✅ (2026-05-04)

Goal: After any chimeric pair is found (by Tier 1 or 2), re-seed the outer read regions not covered by either segment to detect multi-junction gene fusions (3-way chimeras).

Implementation (src/chimeric/detect.rs):

  • ChimericDetector::detect_from_chimeric_residuals — computes left_covered = min(donor.read_start, acceptor.read_start) and right_covered = max(donor.read_end, acceptor.read_end); for each outer span >= chimSegmentMin, re-seeds the uncovered sub-sequence and pairs the result with the adjacent chimeric segment using the same score/overhang/geometry filters.
  • Called as Step 3d in src/align/read_align.rs on each element of chimeric_alignments after Steps 3b and 3c; results are appended to chimeric_alignments.

Chimeric pipeline is now 4-tier: Tier 1 (transcript-pair search) → Tier 2 (multi-cluster) → Tier 1b (primary soft-clip re-seed) → Tier 3 (residual outer re-seed).

Result: 396/396 tests, 0 clippy warnings.


PE AS Diff Investigation — Root Causes (2026-05-01)

Starting state: 6 PE AS diffs → 4 after Phase G3 SA fix.

All 4 remaining diffs are rustar-aligner improvements, not bugs.

.844151 (2 mates, AS diff = +12)

  • rustar-aligner: VIII:451791 146M4S/3S146M1S MAPQ=255 AS=290 nM=0 (perfect match)
  • STAR: VII:1001391 146M4S/3S146M1S MAPQ=255 AS=278 nM=6 (3 mismatches per mate)

STAR's window-based seeding finds WINDOW[0] at Chr VIII (1 seed per mate: MMP len=146, unique) and WINDOW[1] at Chr VII (2 seeds per mate: shorter seeds from RC direction that also match the no-mismatch region pos 111-149). STAR only forms a combined PE pair score in windows with multiple seeds per mate — WINDOW[0] gets individual mate scores (144/145), WINDOW[1] gets combined pair score=278. Since 278 > max(144,145), STAR picks Chr VII.

rustar-aligner's per-mate approach independently finds each mate at VIII (perfect 0mm), try_pair_transcripts combines → AS=290 > 278 → picks the objectively better VIII alignment.

.4972950 (2 mates, AS diff = +12)

  • rustar-aligner mate2: X:120783 1S33M72N50M186N65M1S — 148 bases matched, 2 canonical introns, AS=260
  • STAR mate2: X:120953 27S122M1S — 122 bases matched, 10 combined mismatches, AS=248

rustar-aligner's stitching finds the proper spliced alignment (GT-AG junctions, more bases covered). STAR settles for a lower-scoring unspliced alignment with 27bp soft-clipping. rustar-aligner wins by 12 AS points and 26 more matched bases.

Conclusion: Fixing these would require replicating STAR's suboptimal window-stitching failure at VIII, and degrading rustar-aligner's correct splice detection. Accept as known improvements.


Phase 14: STARsolo (Single-Cell) — DEFERRED

Prerequisite: All accuracy gaps resolved, position agreement >99%. (Current: 99.92% parity excluding unavoidable ties)