Skip to content

Commit 0b822a6

Browse files
rob-pclaude
andcommitted
fix(quant): include fragment-length probability in equivalence-class weights
The selective-alignment equivalence-class weights were built from the bare coverage weight, omitting the fragment-length-distribution (FLD) term that salmon folds into its per-fragment auxProb (logFragProb + logFragCov + logAlignCompatProb; SalmonQuantify.cpp). The missing term flattened the conditional weights for paralogs/isoforms whose implied insert size differs, coarsening the range-factorized eq-classes and slightly degrading multimapping resolution. Two changes in processor.rs: - Fold fld.pmf(fragment_len) into the eq-class weight (gated on pre_burnin to match salmon's numPreAuxModelSamples; start-position term intentionally excluded since effective length is applied separately in update_eff_lengths). - Train the FLD weighted by each fragment's best-pair posterior confidence, a deterministic analog of salmon's probability-weighted stochastic training, so ambiguous multimappers no longer overdisperse the FLD on paralog-rich data. Validation (vs C++ salmon 1.12.0): - Polyester ground-truth (human, 193,760 txps): closes 85-97% of the per-txp Spearman gap on easy/hard sims x VBEM/useEM. - Real-data 36M GEUVADIS C++ parity: NumReads Pearson 0.9986 -> 0.9995, Spearman 0.9645 -> 0.9706. - No mapping/perf regression (mapping rate identical; wall within noise). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 9164df2 commit 0b822a6

1 file changed

Lines changed: 53 additions & 12 deletions

File tree

crates/salmon-quant/src/processor.rs

Lines changed: 53 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -397,21 +397,62 @@ fn record(
397397
}
398398
}
399399

400-
let mut pairs: Vec<(u32, f64)> = compat.iter().map(|(m, w)| (m.tid, *w)).collect();
401-
402-
// Observe a fragment length from the best concordant compatible pair.
403-
let best_concordant = maps
400+
// Fold the fragment-length probability into the equivalence-class weight.
401+
// salmon's eq-class auxProb is `logFragProb + logFragCov + logAlignCompatProb`
402+
// (SalmonQuantify.cpp) -- i.e. it includes the FLD term but NOT the
403+
// start-position term (that enters only the abundance/`aln.logProb`; the
404+
// effective-length factor is applied separately in `update_eff_lengths`).
405+
// The port previously built the eq-class weight from the bare coverage
406+
// weight, flattening it for paralogs/isoforms whose implied insert size
407+
// differs and coarsening the range-factorized classes. The FLD term is only
408+
// applied once the auxiliary model is trained (after `pre_burnin` fragments),
409+
// matching salmon's `numPreAuxModelSamples` gating.
410+
let use_aux = sh.num_processed.load(Ordering::Relaxed) >= sh.pre_burnin;
411+
let mut pairs: Vec<(u32, f64)> = compat
404412
.iter()
405-
.filter(|m| {
406-
m.status == MateStatus::PairedEndPaired
407-
&& m.fragment_len > 0
408-
&& sh
409-
.expected_format
410-
.is_none_or(|exp| is_compatible(exp, m.format, m.is_fw, m.status))
413+
.map(|(m, w)| {
414+
let log_frag_prob = if m.status == MateStatus::PairedEndPaired && m.fragment_len > 0 {
415+
if use_aux {
416+
sh.fld.pmf(m.fragment_len as usize)
417+
} else {
418+
0.0
419+
}
420+
} else {
421+
0.0
422+
};
423+
(m.tid, *w * log_frag_prob.exp())
411424
})
412-
.max_by(|a, b| a.weight.total_cmp(&b.weight));
425+
.collect();
426+
427+
// Observe a fragment length from the best concordant compatible pair,
428+
// weighted by that pair's posterior confidence among the concordant pairs.
429+
// salmon trains its FLD stochastically (it accepts an alignment with
430+
// probability proportional to exp(aln.logProb)), so ambiguous multimappers
431+
// contribute little; adding every best pair at full weight overdisperses the
432+
// FLD on paralog/near-duplicate-rich inputs. Down-weighting by confidence is
433+
// the deterministic analog.
434+
let mut conc_sum = 0.0_f64;
435+
let mut best_concordant: Option<&ScoredMapping> = None;
436+
for m in maps.iter() {
437+
if m.status == MateStatus::PairedEndPaired
438+
&& m.fragment_len > 0
439+
&& sh
440+
.expected_format
441+
.is_none_or(|exp| is_compatible(exp, m.format, m.is_fw, m.status))
442+
{
443+
conc_sum += m.weight;
444+
if best_concordant.is_none_or(|b| m.weight > b.weight) {
445+
best_concordant = Some(m);
446+
}
447+
}
448+
}
413449
if let Some(best) = best_concordant {
414-
sh.fld.add_val(best.fragment_len as usize, 0.0);
450+
let conf = if conc_sum > 0.0 {
451+
best.weight / conc_sum
452+
} else {
453+
1.0
454+
};
455+
sh.fld.add_val(best.fragment_len as usize, conf.ln());
415456
}
416457

417458
// Build the equivalence class: sorted, de-duplicated transcript ids + weights.

0 commit comments

Comments
 (0)