You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(quant): abundance-aware fragment-length-distribution training
Make the online abundance estimate run unconditionally (not only when bias
correction is on) and use its per-fragment posterior to train the FLD by
stochastic acceptance, mirroring salmon's `if (r < exp(aln.logProb))
fragLengthDist.addVal(...)` where aln.logProb includes transcriptLogCount. For
reads shared between near-duplicate transcripts this preferentially samples the
dominant transcript's implied length, concentrating the FLD the way C++ salmon
does (the deterministic confidence-weighting in the previous commit narrowed but
did not match it).
The offline EM does not read the online estimate, so enabling it always only
affects bias collection (unchanged when off) and FLD training.
Validation vs C++ salmon 1.12.0:
- Polyester ground truth (human, 193,760 txps): per-transcript Spearman gap to
C++ shrinks to 0.0019 (easy/VBEM), 0.0012 (easy/useEM), 0.0002 (hard/VBEM),
0.0004 (hard/useEM) -- at/near the run-to-run noise floor.
- Real-data 36M GEUVADIS C++ parity: NumReads Pearson 0.99955, Spearman 0.97096
(was 0.97052); mapping rate identical.
- Performance: no measurable cost (3-rep matched-load 36M means within noise;
online-update subsampling 1/33 changed neither wall time nor accuracy, so the
per-fragment online work is not a bottleneck).
- Determinism: run-to-run Spearman spread 0.00006 (tighter than the prior
0.00043); the FLD aggregate is stable enough that the per-thread RNG used for
stochastic acceptance adds no meaningful nondeterminism.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
0 commit comments