Phase 17: Features + Polish

Status: In Progress (17.1, 17.5, 17.8, 17.A, 17.B, 17.C, 17.D, 17.2, 17.3, 17.4, 17.6, 17.7, 17.9, 17.10, 17.11, 17.12, 17.13 complete)

Goal: Production-ready features and quality-of-life improvements.

Sub-phase Status

Sub-phase	Description	Status
17.1	Log.final.out statistics file (MultiQC/RNA-SeQC)	✅ Complete
17.A	`scoreSeedBest` pre-extension on WA entries (STAR faithful)	✅ Complete
17.B	Per-mate seeding (fix `.18919121`, `.6302610` arch failures)	✅ Complete — `.18919121` fixed; regressions under investigation
17.C	STAR-faithful SCORE-GATE + mappedFilter for PE (fix 4 MAPQ inflations)	✅ Complete
17.D	PE combined-span penalty + dedup-before-score-range ordering (248→236 half-mapped)	✅ Complete
17.2	Coordinate-sorted BAM (`--outSAMtype BAM SortedByCoordinate`)	✅ Complete
17.3	Paired-end chimeric detection	✅ Complete
17.4	`--outReadsUnmapped Fastx`	✅ Complete
17.5	Fix clippy warnings (0 warnings)	✅ Complete
17.6	`--outStd SAM/BAM` (stdout output for piping)	✅ Complete
17.7	GTF tag parameters (`sjdbGTFchrPrefix`, etc.)	✅ Complete
17.8	`--quantMode GeneCounts`	✅ Complete
17.9	`--outBAMcompression` / `--limitBAMsortRAM`	✅ Complete
17.10	Chimeric Tier 3 (re-map soft-clipped regions)	✅ Complete
17.11	`--chimOutType WithinBAM` (supplementary FLAG 0x800)	✅ Complete
17.12	BySJout memory optimization (disk buffering for 100M+ reads)	✅ Complete
17.13	Integration tests for Phase 17 features (8 tests, synthetic 20kb genome)	✅ Complete

Phase 17.1: Log.final.out ✅

Problem: No Log.final.out — MultiQC and RNA-SeQC can't parse results.

Implementation:

Cargo.toml — Added chrono = "0.4" for timestamps
src/stats.rs — Major expansion (+673 lines):
- UnmappedReason enum (Other, TooShort, TooManyMismatches)
- 15 new AtomicU64 counters: read_bases, mapped_bases/mismatches, ins/del count/bases, splices_by_motif[7], splices_annotated, unmapped breakdown, chimeric_reads
- record_transcript_stats(&Transcript) — walks CIGAR for all counters
- write_log_final(path, time_start, time_map_start, time_finish)
src/align/read_align.rs — align_read() returns 4-tuple with Option<UnmappedReason>
src/lib.rs — Timing via chrono::Local::now(), stats collected before BySJout (matches STAR)

Format: 47-char right-justified field names + |\t separator. All 37 STAR fields present.

Differential Test (10k SE):

Field	rustar-aligner	STAR
Input reads	10000	10000
Avg read length	150	150
Uniquely mapped %	83.11%	82.65%
Avg mapped length	146.99	146.99
Mismatch rate	0.40%	0.40%

Files: Cargo.toml, src/stats.rs, src/align/read_align.rs, src/lib.rs

Phase 17.5: Clippy Cleanup ✅

Problem: 13 clippy warnings creating noise during debugging.

Changes:

Removed dead code: verify_match_at_position(), unused read_seq param from cluster_seeds()
cluster_seeds(): 9 args → 3 args — now takes &Parameters instead of 6 windowing params
search_direction_sparse(): 8 args → 7 args — folded effective_start_lmax into body
ChimericSegment::new(): Removed constructor, all 6 call sites use struct literal syntax
Added AlignReadResult type alias for align_read() return type
Idiomatic fixes: .contains(), .div_ceil(), .saturating_sub()
#[allow(clippy::too_many_arguments)] on 4 functions with genuinely many distinct args

Result: 0 clippy warnings, 264/264 tests passing.

Phase 17.8: `--quantMode GeneCounts` ✅ (2026-04-17)

Goal: Output ReadsPerGene.out.tab matching STAR's HTSeq-union gene-level counting.

Implementation: New src/quant/mod.rs with:

GeneAnnotation: per-chromosome sorted interval list (absolute genome coords) built from GTF exons
GeneCounts: atomic per-gene counters + 3 independent N_noFeature/N_ambiguous arrays
QuantContext: Arc-shared bundle for rayon parallel threads
--quantMode GeneCounts + --sjdbGTFfile validation in params.rs
SE and PE counting paths in lib.rs

Three bugs fixed vs initial implementation:

Coordinate mismatch: GTF exon positions were stored chr-relative; Transcript.exon.genome_start uses absolute concatenated-genome coords. Fix: add genome.chr_start[chr_idx] offset when converting GTF positions.
Single counting pass: All 3 columns were identical. STAR runs 3 INDEPENDENT passes — col1 (any strand), col2 (same strand as read), col3 (opposite strand) — each with separate N_noFeature and N_ambiguous.
Too-many-loci bucket: These were going to N_multimapping. STAR puts them in N_unmapped.

Results vs STAR (10k SE yeast):

Metric	STAR	rustar-aligner
N_unmapped	1073	1074 (+1)
N_multimapping	661	661
N_noFeature col1/col2/col3	131/3653/4240	131/3653/4240
N_ambiguous col1	567	566 (-1)
Gene total col1	7568	7568
Col1 gene disagreements	—	0
Col2/col3 gene disagreements	—	1 each (boundary edge case)

The ±1 discrepancies (N_unmapped + N_ambiguous) are a single read at a gene overlap boundary — likely a minor coordinate boundary difference.

Files: src/quant/mod.rs (new), src/params.rs, src/junction/mod.rs (pub(crate) gtf), src/lib.rs

Tests: 274/274 (added 6 new quant unit tests), 0 clippy warnings.

Phase 17.A: scoreSeedBest Pre-Extension ✅ (2026-04-16)

Goal: Match STAR's ReadAlign_stitchWindowSeeds.cpp — pre-extend each seed left+right before the recursive DP and store the result as pre_ext_score on each WindowAlignment entry.

What STAR does: Before stitchWindowAligns, STAR computes scoreSeedBest[iS] for every seed in the window via a two-level DP: (1) base case: length + left_ext, (2) chain case: stitchAlignToTranscript(iS2→iS1) + scoreSeedBest[iS2]. Then adds right_ext universally. Used for seed ordering in the recursive aligner (start from highest-scoring seed).

Implementation:

src/align/stitch.rs — WindowAlignment struct: added pub pre_ext_score: i32 field. All construction sites updated (pre_ext_score: length as i32 default).
src/align/score.rs — AlignmentScorer: added pub out_filter_score_min_over_lread: f64. All constructor paths updated.
src/chimeric/detect.rs — WindowAlignment construction updated.
src/align/stitch.rs — stitch_seeds_core: inserted pre-extension block after seed dedup/sort, before stitch_recurse:
- EXTEND_ORDER respected: left-first for forward clusters (!stitch_is_reverse), right-first for reverse clusters (matching stitch_recurse base case)
- right_len_prev = wa.length + first_ext.extend_len (mirrors base case's len_after_first)
- Chain DP: dp[i] = max(dp[i], dp[j] + wa_entries[i].pre_ext_score) with colinearity check
- No hard pre-filter gate: STAR uses scoreSeedBest for ordering only, not window rejection

Key finding during implementation: A pre-filter gate at full outFilterScoreMinOverLread * (Lread-1) threshold caused 42 false rejections — reads with only short seeds (9-16bp) in low-quality windows, where the full WT extension (starting from leftmost seed) can reach the threshold even though no individual seed's pre-extension does. STAR does NOT apply this gate; scoreSeedBest is used for seed ordering in stitchWindowAligns only.

Result: 268/268 tests, 0 warnings, 8796/8926 SE (baseline maintained), 8390/8390 PE (baseline maintained). pre_ext_score ready for Phase 17.B seed ordering.

Phase 17.B: Per-Mate Seeding ✅ (2026-04-17)

What this fixes: .18919121 (was STAR-only) — adapter-RC at start of rc_read1 caused a 15bp Nstart shift in the combined read's mate1 seed position, triggering reverse-cluster rejection. Per-mate seeding finds mate1 seeds from mate1_seq directly, avoiding the adapter-RC contamination.

Root cause of original failures (combined-read approach):

.18919121: Nstart positions 21, 63, 106 in the 301bp combined-read fell within rc_mate2 = RC(adapter-contaminated mate2). The adapter RC at stitch_read[155:171] caused a 15bp seed shift for mate1, firing the reverse-cluster reject condition.
.6302610: In the forward cluster, rc_mate2 seeds at sa_pos=126596 (inside mate1's genome range) slipped through fwd_reject because the combined read blurred the mate boundary.

Implementation (src/align/read_align.rs):

Per-mate seeding: Seed::find_seeds(mate1_seq, ...) and Seed::find_seeds(mate2_seq, ...) separately. Each mate seeded with its own Nstart positions (0, 37, 74, 112 for 150bp reads).
Independent clustering: cluster_seeds() called separately for each mate.
Independent stitching: stitch_seeds_with_jdb_debug() per mate-cluster. Reverse clusters receive mate2_seq directly; stitch internally does RC and sets is_reverse=true.
Pairwise matching: try_pair_transcripts() — checks same chr, opposite strands, within win_bin_window_dist() span, combined score gate.
Half-mapped fallback: if no valid pair but one mate individually passes quality threshold, report as HalfMapped.

Removed from stitch.rs: stitch_seeds_working, find_mate_boundary, split_working_transcript, adjust_mate2_coords, adjust_wt_read_coords — no longer needed.

Result: .18919121 now mapped as VIII:452300 15S134M1S + VIII:452301 133M17S (STAR: 16S133M1S + 133M17S). 1bp CIGAR difference is a seed-level tie.

Regressions from per-mate approach (known, to fix later):

15 rDNA inter-copy junction reads missed: Reads spanning the boundary between two adjacent rDNA repeat units (yeast chr XII, ~9.1kb inserts). STAR's combined-read boundary seed at position ~171 uniquely identifies the inter-copy junction. Per-mate seeding generates 55 mate1 × 9 mate2 = ~76 candidate pairs, hitting the TooManyLoci limit (>20). Root fix: apply position-dedup before TooManyLoci check (STAR's actual ordering), or implement targeted cross-boundary rescue.
~366 extra both-mapped pairs: Cross-copy pairings created by combining mate1 and mate2 transcripts from different repeat copies. These inflate NH counts for some multi-mappers.
248 half-mapped pairs: New behavior — reads where one mate individually maps but cannot pair. STAR doesn't output these by default (--outSAMunmapped None).

Test status: 274/274, 0 clippy warnings, SE 8796/8926 maintained.

Phase 17.C: STAR-faithful SCORE-GATE + mappedFilter ✅ (2026-04-17)

Problem: 4 PE MAPQ inflations for rDNA/repeat multi-mappers. rustar-aligner NH=2 vs STAR NH=3 for reads with cross-rDNA-copy pairs (M1@copy1 + M2@copy2, 9037bp gap), causing MAPQ=3 vs STAR's MAPQ=1.

Root cause: Two distinct bugs:

Per-WT absolute threshold too strict (read_align.rs forward/reverse cluster processing):
- rustar-aligner used if adjusted_score < combined_score_threshold { continue; } (hard cutoff at outFilterScoreMinOverLread * (Lread-1))
- STAR's stitchWindowAligns.cpp:324 SCORE-GATE uses a RELATIVE criterion: Score + outFilterMultimapScoreRange >= wTr[0]->maxScore (within scoreRange=1 of window best)
- For cross-copy pairs: same-copy score=198 (g_span=100bp, penalty=-2), cross-copy score=197 (g_span=9237bp, penalty=-3). rustar-aligner rejected cross-copy (197 < 198); STAR accepted it (197+1 ≥ 198)
filter_paired_transcripts applied absolute threshold per-pair (not just to best):
- rustar-aligner checked every pair's combined_wt_score < absolute_threshold → removed cross-copy (197 < 198)
- STAR's ReadAlign_mappedFilter.cpp checks only trBest->maxScore >= threshold — if the best passes, ALL pairs in the score window are kept

Fix:

src/align/read_align.rs — both forward and reverse cluster processing (lines 750, 972):

// Old:
if adjusted_score < combined_score_threshold { continue; }
// New:
if adjusted_score + params.out_filter_multimap_score_range < combined_score_threshold { continue; }

src/align/read_align.rs — filter_paired_transcripts (line 1373):
- Changed from per-pair retain to best-pair quality check
- Find best pair (max combined_wt_score); if best fails any threshold → clear all (read unmapped)
- If best passes → keep all pairs (they already passed multMapSelect relative criterion)

Verification: STAR debug trace on .19790508 confirmed Score=197 cross-copy pair is INSERTED (TR-INSERTED) with global_pass=1 because scoreRange=1 (outFilterMultimapScoreRange). STAR's mappedFilter only checks trBest->maxScore=198 >= 198 — passes.

Result: 268/268 tests, 0 warnings, 8796/8926 SE (maintained), 8390/8390 PE (maintained), 0 MAPQ inflations (was 4), 0 MAPQ deflations, faithfulness 98.915% (was 98.903%).

Phase 17.D: PE Combined-Span Penalty + Dedup Ordering ✅ (2026-04-17)

Problem 1: try_pair_transcripts used combined_wt_score = t1.score + t2.score, double-applying the genomic-length penalty (each mate's finalized score already includes its own span penalty). This inflated the combined score relative to STAR's single-span formula, causing AS tag disagreements (was 99.6% of PE reads).

Fix: STAR computes ONE genomic-length penalty over the full PE span. Per-mate approach must undo per-mate penalties and apply the combined penalty:

combined_wt_score = t1.score + t2.score - p1 - p2 + combined_p

where p1 = genomic_length_penalty(t1_span), p2 = genomic_length_penalty(t2_span), combined_p = genomic_length_penalty(right.genome_end - left.genome_start). try_pair_transcripts now takes scorer: &AlignmentScorer to call scorer.genomic_length_penalty().

Problem 2: Decision tree ordering was score-range → dedup → TooManyLoci (wrong). STAR's ordering is multMapSelect → dedup → TooManyLoci. Also, dedup was running after score-range filter, meaning some duplicate pairs at the same position could escape the score-range window.

Fix (src/align/read_align.rs): Reordered to: (1) position dedup, (2) score-range filter (multMapSelect), (3) TooManyLoci check, (4) sort by score, (5) quality filter (mappedFilter).

Result: 278/278 tests, 0 warnings. 248 → 236 half-mapped (12 pairs fixed by correct ordering). PE diff-AS dropped from 99.6% → 3.1%. PE both-mapped: 8767 (STAR: 8390). SE 8796/8926 maintained.

Investigation note: Attempted quality-filter fallback (retry with pre-score-range pool when score-range winner fails quality) to recover additional half-mapped reads. The specific root cause of remaining 236 half-mapped: STAR's combined-read DP finds correct mate2 alignment with 0 mismatches; rustar-aligner's per-mate DP finds a different alignment at the same position with 8 mismatches (combined_nm=14 > outFilterMismatchNmax=10). The fallback recovers 35 pairs but introduces ~100 position regressions (pairs at wrong positions passing individual quality checks). Reverted.

Phase 17.3: Paired-End Chimeric Detection ✅ (2026-05-01)

Goal: Detect chimeric alignments for paired-end reads.

Implementation:

PairedAlignResult type alias in read_align.rs resolves clippy::type_complexity for align_paired_read's 4-tuple return.
Two detection paths:
1. Intra-mate: When a mate has ≥2 clusters, split by wa.mate_id + adjust mate2 read_pos -= len1+1 → ChimericDetector::detect_from_multi_clusters.
2. Inter-mate: detect_inter_mate_chimeric on best single-mate transcripts (before half-mapped fallback). Detects diff-chr, same-strand, or >1Mb discordant pairs.
detect_inter_mate_chimeric re-exported from chimeric/mod.rs.
No benchmark regression (8390 both-mapped, 0 half-mapped).

Files: src/align/read_align.rs, src/chimeric/detect.rs, src/chimeric/mod.rs

Phase 17.11: `--chimOutType WithinBAM` ✅ (2026-05-01)

Goal: Write chimeric alignments as supplementary SAM/BAM records (FLAG 0x800) in the primary output file.

Implementation:

build_within_bam_records(alignment, genome, mapq) in chimeric/output.rs: returns 2 RecordBufs.
- Donor: normal FLAGS + full SEQ (RC'd if is_reverse) + SA tag.
- Acceptor: FLAG 0x800 supplementary + empty SEQ + SA tag.
- SA tag format: chr,pos,strand,CIGAR,mapQ,NM; (pos = 1-based per-chromosome).
chim_out_junctions() / chim_out_within_bam() helpers in params.rs.
Junction file creation gated on chim_out_junctions() (allows WithinBAM-only mode).
All 4 lib.rs write paths (SE normal, SE bysj, PE normal, PE bysj) emit WithinBAM records.
Supports --chimOutType Junctions WithinBAM simultaneously.
convert_cigar changed to pub(crate) in sam.rs for use by output.rs.

Files: src/chimeric/output.rs, src/chimeric/mod.rs, src/params.rs, src/io/sam.rs, src/lib.rs

Phase 17.7: GTF Tag Parameters ✅ (2026-05-01)

Goal: Support non-standard GTF files via STAR's four GTF configuration parameters.

Parameters added (src/params.rs):

Parameter	Default	Purpose
`--sjdbGTFchrPrefix`	`""`	Prefix to add to GTF chromosome names (e.g. `"chr"` when GTF uses bare `1,2,3`)
`--sjdbGTFfeatureExon`	`"exon"`	Feature column value to use as exons
`--sjdbGTFtagExonParentTranscript`	`"transcript_id"`	GTF attribute for grouping exons into transcripts
`--sjdbGTFtagExonParentGene`	`"gene_id"`	GTF attribute for gene grouping in quantification

Implementation strategy: _configured variant functions carry the params; original functions remain as backward-compatible wrappers (parse_gtf, extract_junctions_from_exons, from_gtf_exons) calling with defaults. Zero test disruption (~50 test call sites unchanged).

parse_gtf_configured(path, feature_exon, chr_prefix) — filters by feature_exon, prepends chr_prefix to seqnames.
extract_junctions_configured(exons, genome, transcript_tag) — groups transcripts by transcript_tag.
GeneAnnotation::from_gtf_exons_configured(exons, genome, gene_tag) — uses gene_tag for quant.
TranscriptomeIndex::from_gtf_exons_configured(exons, genome, transcript_tag, gene_tag).
SpliceJunctionDb::from_gtf_configured(path, genome, feature_exon, chr_prefix, transcript_tag).
QuantContext::build(path, genome, feature_exon, chr_prefix, gene_tag) — updated signature.

All 4 production paths thread params: index/mod.rs (genomeGenerate), index/io.rs (load-time fallback), lib.rs (quantMode GeneCounts), junction/mod.rs (junction DB).

Tests added (3 new in gtf.rs): test_parse_gtf_configured_chr_prefix, test_parse_gtf_configured_custom_feature, test_extract_junctions_configured_custom_transcript_tag.

Files: src/params.rs, src/junction/gtf.rs, src/junction/mod.rs, src/quant/mod.rs, src/quant/transcriptome.rs, src/index/mod.rs, src/index/io.rs, src/lib.rs

Result: 379/379 tests, 0 clippy warnings.

Phase 17.9: `--outBAMcompression` / `--limitBAMsortRAM` ✅ (2026-05-01)

Goal: Control BGZF compression level for BAM output and cap memory usage during coordinate-sorted BAM buffering.

Parameters added (src/params.rs):

Parameter	Default	Purpose
`--outBAMcompression`	`1`	BGZF compression level: -1/0=uncompressed, 1-8=flate2 levels, ≥9=BEST
`--limitBAMsortRAM`	`0`	Max bytes for sorted BAM in-memory buffering; 0=unlimited

Implementation (src/io/bam.rs):

bgzf_compression(level: i32) -> noodles::bgzf::writer::CompressionLevel — maps: ≤0→NONE, ≥9→BEST, 1-8→try_from(u8).
make_bgzf_writer<W: Write>(inner, compression) -> noodles::bgzf::Writer<W> — uses noodles::bgzf::writer::Builder::default().set_compression_level(...).build_from_writer(inner).
SortedBamWriter + SortedBamStdoutWriter: added compression: i32 and limit_bam_sort_ram: u64 fields.
SortedBamWriter::estimated_ram() → records.len() as u64 * 400 (rough 400 bytes/record).
SortedBamWriter::check_ram_limit() → returns Error::Alignment if limit > 0 && estimated > limit.
BamWriter::with_header(header, path, compression) — 3rd arg; callers pass params.out_bam_compression.
All 4 BAM writers use make_bgzf_writer with the configured level.

Tests added (3 new): test_bam_compression_zero, test_bam_sort_ram_unlimited, test_bam_sort_ram_exceeded.

Files: src/params.rs, src/io/bam.rs

Result: 382/382 tests, 0 clippy warnings.

Phase 17.2: Coordinate-Sorted BAM ✅ (2026-04-29)

SortedBamWriter in src/io/bam.rs buffers all RecordBuf records, sorts by (reference_sequence_id, alignment_start) on finish(), writes one BAM with SO:coordinate in @HD. Output filename: Aligned.sortedByCoord.out.bam. Verified with samtools quickcheck + samtools index.

PE chimericDetectionOld ✅ (2026-05-01)

Goal: Run Tier 1 chimeric detection (detect_chimeric_old) per-mate for paired-end reads, mirroring STAR's post-stitching chimeric search for each fragment.

Implementation (src/align/read_align.rs):

Added all_m1_transcripts: Vec<Transcript> and all_m2_transcripts: Vec<Transcript> before the cluster loop.
In the Some((m1_wt, m2_wt)) joint-pair branch: push clones when chim_segment_min > 0.
In single-mate all_m1/all_m2 branches: push before single_mate_transcripts.push(t).
After filter_paired_transcripts, before BothMapped early return: runs detect_chimeric_old on each mate's pool (using best transcript as tr_best), filters by chim_segment_min + chim_score_min, extends pe_chimeric.

Phase 17.12: BySJout Disk Buffering ✅ (2026-05-01)

Goal: Replace in-memory Vec<AlignmentBatchResults> with temp-file + compact metadata to support 100M+ read datasets without requiring ~60GB RAM for BySJout mode.

Implementation:

BySJReadMeta struct: n_sam_records: u32, junction_keys, chimeric_alns, transcriptome_records — only per-read metadata stays in RAM.
SAM records go to tempfile::NamedTempFile via bysj_write_records (noodles SAM writer).
Filter phase: drop(bysj_temp_writer) flushes BufWriter, reopen via tf.reopen(), create SAM reader, iterate bysj_meta — pass reads via bysj_read_n_records(..., true), skip filtered reads via bysj_read_n_records(..., false).
tempfile moved from [dev-dependencies] to [dependencies] in Cargo.toml.
Helper functions in src/io/sam.rs: create_bysj_writer, bysj_write_records, bysj_read_n_records.
Applied to both SE (align_reads_single_end) and PE (align_reads_paired_end).

Phase 17.13: Integration Tests ✅ (2026-05-01)

Goal: 8 end-to-end integration tests covering Phase 17 features using a synthetic 20kb genome with planted splice structure.

Genome design (tests/alignment_features.rs):

20kb pseudo-random genome with planted GT-AG intron (Exon1@10000-10049, GT intron@10050-10249, Exon2@10250-10299).
Uses --genomeSAindexNbases 7 (satisfies 2^14 ≤ 20000).
GTF: two exon records for G1/T1 transcript.

Tests:

test_bam_unsorted_output — --outSAMtype BAM Unsorted produces valid BAM
test_bam_sorted_output — --outSAMtype BAM SortedByCoordinate produces sorted BAM
test_paired_end_alignment — PE reads mapped, NH=1 each
test_spliced_alignment — reads spanning planted intron get expected CIGAR (25M200N25M)
test_bysj_filtering — BySJout mode passes spliced reads, filters unspliced
test_gene_counts_output — ReadsPerGene.out.tab written with correct columns
test_unmapped_reads_output — --outReadsUnmapped Fastx writes Unmapped.out.mate1
test_two_pass_mode — two-pass alignment runs without error

Phase 12.2: SE Chimeric Tier 1b — Soft-Clip Re-mapping ✅ (2026-05-04)

Goal: When detect_chimeric_old finds no chimeric partner in the existing transcript pool, re-seed the primary alignment's soft-clipped bases to find a chimeric partner de novo.

Implementation (src/chimeric/detect.rs):

ChimericDetector::detect_from_soft_clips — extracts right/left soft-clip sub-sequence from read_seq, runs Seed::find_seeds → cluster_seeds → stitch_seeds_with_jdb, adjusts exon read positions for right clips (adds clip_start offset), then applies the same score/overhang/geometry filters as detect_chimeric_old.
adjust_read_positions(tr, offset) helper — shifts all exon read_start/read_end by offset for sub-sequence stitching results.
Called as Step 3c in src/align/read_align.rs, only when chimeric_alignments.is_empty() after Step 3b.

Result: 396/396 tests, 0 clippy warnings.

Phase 17.10: Chimeric Tier 3 — Residual Re-mapping ✅ (2026-05-04)

Goal: After any chimeric pair is found (by Tier 1 or 2), re-seed the outer read regions not covered by either segment to detect multi-junction gene fusions (3-way chimeras).

Implementation (src/chimeric/detect.rs):

ChimericDetector::detect_from_chimeric_residuals — computes left_covered = min(donor.read_start, acceptor.read_start) and right_covered = max(donor.read_end, acceptor.read_end); for each outer span >= chimSegmentMin, re-seeds the uncovered sub-sequence and pairs the result with the adjacent chimeric segment using the same score/overhang/geometry filters.
Called as Step 3d in src/align/read_align.rs on each element of chimeric_alignments after Steps 3b and 3c; results are appended to chimeric_alignments.

Chimeric pipeline is now 4-tier: Tier 1 (transcript-pair search) → Tier 2 (multi-cluster) → Tier 1b (primary soft-clip re-seed) → Tier 3 (residual outer re-seed).

Result: 396/396 tests, 0 clippy warnings.

PE AS Diff Investigation — Root Causes (2026-05-01)

Starting state: 6 PE AS diffs → 4 after Phase G3 SA fix.

All 4 remaining diffs are rustar-aligner improvements, not bugs.

`.844151` (2 mates, AS diff = +12)

rustar-aligner: VIII:451791 146M4S/3S146M1S MAPQ=255 AS=290 nM=0 (perfect match)
STAR: VII:1001391 146M4S/3S146M1S MAPQ=255 AS=278 nM=6 (3 mismatches per mate)

STAR's window-based seeding finds WINDOW[0] at Chr VIII (1 seed per mate: MMP len=146, unique) and WINDOW[1] at Chr VII (2 seeds per mate: shorter seeds from RC direction that also match the no-mismatch region pos 111-149). STAR only forms a combined PE pair score in windows with multiple seeds per mate — WINDOW[0] gets individual mate scores (144/145), WINDOW[1] gets combined pair score=278. Since 278 > max(144,145), STAR picks Chr VII.

rustar-aligner's per-mate approach independently finds each mate at VIII (perfect 0mm), try_pair_transcripts combines → AS=290 > 278 → picks the objectively better VIII alignment.

`.4972950` (2 mates, AS diff = +12)

rustar-aligner mate2: X:120783 1S33M72N50M186N65M1S — 148 bases matched, 2 canonical introns, AS=260
STAR mate2: X:120953 27S122M1S — 122 bases matched, 10 combined mismatches, AS=248

rustar-aligner's stitching finds the proper spliced alignment (GT-AG junctions, more bases covered). STAR settles for a lower-scoring unspliced alignment with 27bp soft-clipping. rustar-aligner wins by 12 AS points and 26 more matched bases.

Conclusion: Fixing these would require replicating STAR's suboptimal window-stitching failure at VIII, and degrading rustar-aligner's correct splice detection. Accept as known improvements.

Phase 14: STARsolo (Single-Cell) — DEFERRED

Prerequisite: All accuracy gaps resolved, position agreement >99%. (Current: 99.92% parity excluding unavoidable ties)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 17: Features + Polish

Sub-phase Status

Phase 17.1: Log.final.out ✅

Phase 17.5: Clippy Cleanup ✅

Phase 17.8: `--quantMode GeneCounts` ✅ (2026-04-17)

Phase 17.A: scoreSeedBest Pre-Extension ✅ (2026-04-16)

Phase 17.B: Per-Mate Seeding ✅ (2026-04-17)

Phase 17.C: STAR-faithful SCORE-GATE + mappedFilter ✅ (2026-04-17)

Phase 17.D: PE Combined-Span Penalty + Dedup Ordering ✅ (2026-04-17)

Phase 17.3: Paired-End Chimeric Detection ✅ (2026-05-01)

Phase 17.11: `--chimOutType WithinBAM` ✅ (2026-05-01)

Phase 17.7: GTF Tag Parameters ✅ (2026-05-01)

Phase 17.9: `--outBAMcompression` / `--limitBAMsortRAM` ✅ (2026-05-01)

Phase 17.2: Coordinate-Sorted BAM ✅ (2026-04-29)

PE chimericDetectionOld ✅ (2026-05-01)

Phase 17.12: BySJout Disk Buffering ✅ (2026-05-01)

Phase 17.13: Integration Tests ✅ (2026-05-01)

Phase 12.2: SE Chimeric Tier 1b — Soft-Clip Re-mapping ✅ (2026-05-04)

Phase 17.10: Chimeric Tier 3 — Residual Re-mapping ✅ (2026-05-04)

PE AS Diff Investigation — Root Causes (2026-05-01)

`.844151` (2 mates, AS diff = +12)

`.4972950` (2 mates, AS diff = +12)

Phase 14: STARsolo (Single-Cell) — DEFERRED

FilesExpand file tree

phase17_features.md

Latest commit

History

phase17_features.md

File metadata and controls

Phase 17: Features + Polish

Sub-phase Status

Phase 17.1: Log.final.out ✅

Phase 17.5: Clippy Cleanup ✅

Phase 17.8: --quantMode GeneCounts ✅ (2026-04-17)

Phase 17.A: scoreSeedBest Pre-Extension ✅ (2026-04-16)

Phase 17.B: Per-Mate Seeding ✅ (2026-04-17)

Phase 17.C: STAR-faithful SCORE-GATE + mappedFilter ✅ (2026-04-17)

Phase 17.D: PE Combined-Span Penalty + Dedup Ordering ✅ (2026-04-17)

Phase 17.3: Paired-End Chimeric Detection ✅ (2026-05-01)

Phase 17.11: --chimOutType WithinBAM ✅ (2026-05-01)

Phase 17.7: GTF Tag Parameters ✅ (2026-05-01)

Phase 17.9: --outBAMcompression / --limitBAMsortRAM ✅ (2026-05-01)

Phase 17.2: Coordinate-Sorted BAM ✅ (2026-04-29)

PE chimericDetectionOld ✅ (2026-05-01)

Phase 17.12: BySJout Disk Buffering ✅ (2026-05-01)

Phase 17.13: Integration Tests ✅ (2026-05-01)

Phase 12.2: SE Chimeric Tier 1b — Soft-Clip Re-mapping ✅ (2026-05-04)

Phase 17.10: Chimeric Tier 3 — Residual Re-mapping ✅ (2026-05-04)

PE AS Diff Investigation — Root Causes (2026-05-01)

.844151 (2 mates, AS diff = +12)

.4972950 (2 mates, AS diff = +12)

Phase 14: STARsolo (Single-Cell) — DEFERRED

Phase 17.8: `--quantMode GeneCounts` ✅ (2026-04-17)

Phase 17.11: `--chimOutType WithinBAM` ✅ (2026-05-01)

Phase 17.9: `--outBAMcompression` / `--limitBAMsortRAM` ✅ (2026-05-01)

`.844151` (2 mates, AS diff = +12)

`.4972950` (2 mates, AS diff = +12)