Skip to content

Commit 2bccea8

Browse files
rob-pclaude
andcommitted
fix(quant): normalize eq-class weights in log space (conserve mapped mass)
The per-fragment eq-class weights were normalized in linear space (w/Σw, guarded by `wsum > 0`). When a fragment's implied lengths all have ~0 FLD probability (logFragProb at the no-mass sentinel), every linear weight `w*exp(logFragProb)` underflows to exactly 0, so Σw == 0, the `wsum > 0` guard leaves the weights all zero, the eq-class denom is 0, and the VBEM silently drops that class's count — losing mapped mass. (The EM's degenerate-class branch is a no-op; C++ salmon's is too, so C++ relies on never producing a zero denom.) Adopt C++'s normalization: compute each mapping's log weight (ln(score) + logFragProb) and subtract the per-fragment log-sum-exp (C++ `exp(auxProb - auxDenom)`). This is mathematically identical to w/Σw for the non-degenerate case (per-class scaling is EM-invariant) but stays well-defined under total underflow, yielding relative weights instead of all-zero — so no class is dropped. On SRR1039508 (full) the mapped-mass loss drops from 190.1 fragments to 0.1 (matching C++); the change vs the prior linear path is within run-to-run wobble (log-Pearson 0.99956, < the 0.99951 run-to-run baseline) and leaves the nonzero transcript count unchanged. Reverts the earlier special-case underflow guard in favor of this general normalization. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01B7JMur5DmDpECddErpi2JS
1 parent 3e2f559 commit 2bccea8

1 file changed

Lines changed: 37 additions & 14 deletions

File tree

crates/salmon-quant/src/processor.rs

Lines changed: 37 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ use paraseq::parallel::{PairedParallelProcessor, ParallelProcessor};
1515
use paraseq::Record;
1616
use piscem_rs::mapping::hit_searcher::{HitSearcher, SkippingStrategy};
1717

18+
use salmon_core::math::{log_add, LOG_0, LOG_1};
1819
use salmon_core::{is_compatible, LibraryFormat, MateStatus};
1920
use salmon_eqclass::{range_factorize_bins, EquivalenceClassBuilder, TranscriptGroup};
2021
use salmon_index::SalmonIndex;
@@ -450,24 +451,46 @@ fn record(
450451
} else {
451452
None
452453
};
453-
let mut pairs: Vec<(u32, f64)> = compat
454+
// Per-mapping FLD log-probability (the un-normalized log-pmf at the implied
455+
// fragment length; `LOG_1` = 0 when the auxiliary model is inactive or the
456+
// mapping is not a proper pair).
457+
let log_fps: Vec<f64> = compat
454458
.iter()
455-
.map(|(m, w)| {
456-
let log_frag_prob =
457-
if use_aux && m.status == MateStatus::PairedEndPaired && m.fragment_len > 0 {
458-
match fld_snap.as_deref() {
459-
Some(snap) if !snap.is_empty() => {
460-
snap[(m.fragment_len as usize).min(snap.len() - 1)]
461-
}
462-
// pre-first-refresh fallback (only the early pre-burn-in batches)
463-
_ => sh.fld.pmf(m.fragment_len as usize),
459+
.map(|(m, _)| {
460+
if use_aux && m.status == MateStatus::PairedEndPaired && m.fragment_len > 0 {
461+
match fld_snap.as_deref() {
462+
Some(snap) if !snap.is_empty() => {
463+
snap[(m.fragment_len as usize).min(snap.len() - 1)]
464464
}
465-
} else {
466-
0.0
467-
};
468-
(m.tid, *w * log_frag_prob.exp())
465+
// pre-first-refresh fallback (only the early pre-burn-in batches)
466+
_ => sh.fld.pmf(m.fragment_len as usize),
467+
}
468+
} else {
469+
LOG_1
470+
}
469471
})
470472
.collect();
473+
// Per-fragment conditional probabilities, normalized in *log* space — the
474+
// log weight of each mapping is `ln(score weight) + logFragProb`, and we
475+
// subtract the log-sum-exp over the fragment's mappings (C++'s
476+
// `exp(auxProb - auxDenom)`). This is the same normalization salmon's
477+
// C++ does and is mathematically identical to the linear `w/Σw`, but doing it
478+
// in log space keeps it well-defined when every FLD weight underflows: a
479+
// fragment whose implied lengths all have ~0 FLD probability (logFragProb at
480+
// the no-mass sentinel) would, in linear space, give all-zero weights — the
481+
// `wsum > 0` normalization below then leaves them zero, the eq-class denom is
482+
// 0, and the VBEM silently drops the fragment's count (lost mapped mass; the
483+
// EM's degenerate-class branch is a no-op, as in C++). In log space the same
484+
// fragment yields well-defined relative weights, so no mass is lost.
485+
let log_denom = compat
486+
.iter()
487+
.zip(&log_fps)
488+
.fold(LOG_0, |acc, ((_, w), &lfp)| log_add(acc, w.ln() + lfp));
489+
let mut pairs: Vec<(u32, f64)> = compat
490+
.iter()
491+
.zip(&log_fps)
492+
.map(|((m, w), &lfp)| (m.tid, (w.ln() + lfp - log_denom).exp()))
493+
.collect();
471494

472495
// Abundance-aware FLD training: accept each concordant compatible pair's
473496
// fragment length with probability = its abundance-aware online posterior

0 commit comments

Comments
 (0)