feat(quant): abundance-aware fragment-length-distribution training#1014
Merged
Conversation
Make the online abundance estimate run unconditionally (not only when bias correction is on) and use its per-fragment posterior to train the FLD by stochastic acceptance, mirroring salmon's `if (r < exp(aln.logProb)) fragLengthDist.addVal(...)` where aln.logProb includes transcriptLogCount. For reads shared between near-duplicate transcripts this preferentially samples the dominant transcript's implied length, concentrating the FLD the way C++ salmon does (the deterministic confidence-weighting in the previous commit narrowed but did not match it). The offline EM does not read the online estimate, so enabling it always only affects bias collection (unchanged when off) and FLD training. Validation vs C++ salmon 1.12.0: - Polyester ground truth (human, 193,760 txps): per-transcript Spearman gap to C++ shrinks to 0.0019 (easy/VBEM), 0.0012 (easy/useEM), 0.0002 (hard/VBEM), 0.0004 (hard/useEM) -- at/near the run-to-run noise floor. - Real-data 36M GEUVADIS C++ parity: NumReads Pearson 0.99955, Spearman 0.97096 (was 0.97052); mapping rate identical. - Performance: no measurable cost (3-rep matched-load 36M means within noise; online-update subsampling 1/33 changed neither wall time nor accuracy, so the per-fragment online work is not a bottleneck). - Determinism: run-to-run Spearman spread 0.00006 (tighter than the prior 0.00043); the FLD aggregate is stable enough that the per-thread RNG used for stochastic acceptance adds no meaningful nondeterminism. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Train the fragment-length distribution with abundance-aware stochastic acceptance, the last step to match C++ salmon's FLD on multimapping-heavy data. Builds on #1013.
The online abundance estimate now runs unconditionally (previously only when bias correction was on) and its per-fragment posterior is used to accept FLD samples — mirroring salmon's
if (r < exp(aln.logProb)) fragLengthDist.addVal(...), wherealn.logProbincludestranscriptLogCount. For reads shared between near-duplicates this preferentially samples the dominant transcript's implied length, concentrating the FLD (the deterministic confidence-weighting in #1013 narrowed but didn't fully match it). The offline EM does not read the online estimate, so enabling it always only affects bias collection (unchanged when off) and FLD training.Why this is the default (not a flag)
Investigated whether the always-on online estimate costs throughput; it does not:
Validation vs C++ salmon 1.12.0
Polyester ground truth (human, 193,760 txps), per-transcript Spearman gap to C++:
Real-data 36M GEUVADIS C++ parity: NumReads Pearson 0.99955, Spearman 0.97096 (was 0.97052); mapping rate identical (92.18%). fmt + clippy clean; tests pass.
The residual gaps are now at/near the run-to-run noise floor across both difficulty levels and both EM modes.
🤖 Generated with Claude Code