Commit ca98582
authored
trim: honor --expected-insert-size by storing the estimate in I-space (#5)
* trim: store overlap-stats estimate in I-space; honor --expected-insert-size
Fixes #3. `--expected-insert-size` had been silently a no-op: `Pipeline::new`
constructed `OverlapStats::new(hint, 0)` and the constructor required a
non-zero `read_len_hint` to honor the hint, which the production call site
never had at construction time.
The fix is a small refactor of how the overlap-walk estimate is stored.
* **Bug fix.** `OverlapStats::new(hint)` now takes only the hint and stores
it directly — no read-length-at-construction needed. The hint takes
effect on the very first pair.
* **I-space refactor.** The running estimate is now stored in I-space
(`expected_insert: Option<usize>`) instead of shift-space (`center: isize`
with `isize::MIN` sentinel). The shift used at each walk is derived
per-pair from the current pair's R1 length via the new `center_shift`
method. This also makes variable-length inputs (per-cycle trims, mixed
read lengths within a worker) behave correctly, since the shift is
computed from the actual read length each time rather than baked into a
one-time seed. Per-pair arithmetic cost is in-noise (≤0.5%) on a 5x WGS
PE benchmark (53.6M pairs, 8 threads).
* **Hysteresis margin semantics change.** The flap-prevention margin is
now expressed as 5% of the established insert estimate (`old / 20`)
rather than 5% of read length (`read_len / 20`). For typical 150 bp /
300 bp libraries this widens the margin from ~7 bp to ~15 bp — more
semantically meaningful ("don't move unless the running mean is more
than 5% off our current estimate") and symmetric for upward and
downward drift of the same magnitude.
Tests:
* New `expected_insert_size_hint_takes_effect_via_center_shift` — direct
regression for #3; also checks variable-read-length behavior.
* New `overlap_stats_margin_scales_with_insert_estimate` — locks in
proportional-margin behavior at a larger insert size.
* New `overlap_stats_learned_mean_can_replace_hint` — confirms the
running mean can override the user's hint once it drifts far enough.
* Existing `overlap_stats_*` tests adapted from shift-space to I-space
assertions.
* trim: anchor center_shift on R2 length, per shift = I − r2.len() definition
CodeRabbit (PR #5) caught that the I→shift conversion was using R1 length,
while the formal shift definition (and `try_shift_neg`'s i_value derivation)
anchor on R2 length. For symmetric PE Illumina (R1 == R2) the two match; for
asymmetric pairs the walk seeded at the wrong shift and burned extra probe
iterations converging. This was pre-existing — the prior shift-space code
also used R1 — just carried forward by the I-space refactor.
Adds an explicit test pinning the R2-anchored behavior for asymmetric pairs.1 parent e87b0fc commit ca98582
1 file changed
Lines changed: 196 additions & 86 deletions
0 commit comments