Skip to content

Commit 4908cd1

Browse files
rob-pclaude
andcommitted
docs(2.1.0): decoy-orphan SA fix, log-space eq weights, dovetail counter, reproducibility
Document this round's quant-correctness work: the concordant-decoy-pair orphan suppression fix + functional --allowDecoyOrphans in SA mode; log-space eq-class weight normalization (mapped-mass loss 190 -> 0.1 fragments); the now-wired num_dovetail_fragments counter; and the run-to-run reproducibility comparison (-p1 byte-identical; -p16 wobble ~0.26%, smaller than C++'s; full parallel determinism deferred as an improvement-over-C++ follow-up). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01B7JMur5DmDpECddErpi2JS
1 parent 89f6c34 commit 4908cd1

1 file changed

Lines changed: 57 additions & 0 deletions

File tree

docs/release-notes-2.1.0.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -292,6 +292,63 @@ simulated references (same clipped FASTA, matched thresholds) the per-read
292292
target-set residual is 147 Rust-superset / 50 C++-superset, unchanged by the cap
293293
fix now that both cap on the aligned-mapping count.
294294

295+
## Decoy-orphan handling in selective alignment (`--allowDecoyOrphans`)
296+
297+
Two related fixes to how a fragment that pairs concordantly on the genome decoy
298+
but only orphans onto a transcript (one mate exonic, the other intronic/genomic —
299+
very common with a decoy-aware index) is handled:
300+
301+
- **A concordant decoy pair no longer suppresses a transcript orphan.** The
302+
orphan-fallback rule discarded *all* orphans whenever any concordant pair
303+
existed — including a concordant pair to a *decoy*. That destroyed the
304+
transcript orphan, leaving only the decoy pair, so the fragment was dropped as
305+
decoy-dominated with no surviving non-decoy mapping — and `--allowDecoyOrphans`
306+
could not even rescue it (it only acts when a transcript mapping survives). Now
307+
only a concordant *transcript* pair suppresses orphans; a decoy pair leaves the
308+
transcript orphan for the decoy-domination logic to adjudicate.
309+
- **`--allowDecoyOrphans` now works as intended.** With the above fixed, the flag
310+
recovers the transcript orphan when the other mate maps to the genome decoy
311+
(default still drops it). On SRR1039508 (full) this raises the `--allowDecoyOrphans`
312+
rate 93.92 % → 94.81 %; the **default rate is unchanged** (byte-identical), and
313+
the recovered fragments match what C++ keeps as orphans. This is the
314+
selective-alignment mirror of the sketch-mode decoy-orphan rescue above.
315+
316+
## Equivalence-class weights: log-space normalization (no lost mapped mass)
317+
318+
Per-fragment equivalence-class weights are now normalized in **log** space
319+
(`exp(auxProb − auxDenom)`, as C++ salmon does) rather than linearly (`w/Σw`). The
320+
linear form, guarded by `Σw > 0`, silently produced all-zero weights for a
321+
fragment whose implied lengths all have ~0 fragment-length-distribution
322+
probability (every `w·exp(logFragProb)` underflows to 0); the VBEM then dropped
323+
that equivalence class's count, **losing mapped mass**. The log-space form is
324+
mathematically identical for the normal case (per-class scaling is EM-invariant)
325+
but stays well-defined under total underflow. On SRR1039508 (full) the
326+
mapped-mass loss (sum of `quant.sf` `NumReads` vs `num_mapped`) drops from **190
327+
fragments to 0.1** — matching C++.
328+
329+
## `num_dovetail_fragments` is now reported
330+
331+
Rust always dropped dovetailed concordant pairs under the default no-dovetail
332+
policy (matching C++) but reported `num_dovetail_fragments = 0`, because it only
333+
inspected surviving pairs (never dovetailed after filtering). The counter is now
334+
wired to the pairing stage and reports fragments whose only concordant pairing was
335+
a dovetail. Diagnostic only — no change to mapping or quantification.
336+
337+
## Run-to-run reproducibility
338+
339+
Quantification is **byte-identical run-to-run when single-threaded** (`-p 1`); the
340+
duplicate-transcript symmetry fix (per-fragment FLD snapshot + uniform init) means
341+
exact-duplicate groups converge deterministically. Multi-threaded runs have a
342+
small residual wobble (~0.26 % of assigned reads on SRR1039508, nonzero-transcript
343+
set unchanged) from the **stochastic FLD training** under nondeterministic
344+
fragment→thread scheduling — the same mechanism present in C++ salmon. Measured
345+
head-to-head (`-p 16`, two runs each), Rust is in fact **more reproducible than
346+
C++**: about half the average and total per-transcript variation, and far fewer
347+
transcripts shifting by >2 % (Rust 2.2 % of expressed transcripts vs C++ 5.1 %).
348+
Full parallel determinism (per-fragment-seeded FLD acceptance + order-independent
349+
accumulation) is tracked as a follow-up; it is an improvement *over* C++, not a
350+
correctness gap.
351+
295352
## Related C++ fixes (salmon 1.12.1)
296353

297354
These were also applied to the final C++ line so the two implementations agree:

0 commit comments

Comments
 (0)