Skip to content

Commit ca98582

Browse files
authored
trim: honor --expected-insert-size by storing the estimate in I-space (#5)
* trim: store overlap-stats estimate in I-space; honor --expected-insert-size Fixes #3. `--expected-insert-size` had been silently a no-op: `Pipeline::new` constructed `OverlapStats::new(hint, 0)` and the constructor required a non-zero `read_len_hint` to honor the hint, which the production call site never had at construction time. The fix is a small refactor of how the overlap-walk estimate is stored. * **Bug fix.** `OverlapStats::new(hint)` now takes only the hint and stores it directly — no read-length-at-construction needed. The hint takes effect on the very first pair. * **I-space refactor.** The running estimate is now stored in I-space (`expected_insert: Option<usize>`) instead of shift-space (`center: isize` with `isize::MIN` sentinel). The shift used at each walk is derived per-pair from the current pair's R1 length via the new `center_shift` method. This also makes variable-length inputs (per-cycle trims, mixed read lengths within a worker) behave correctly, since the shift is computed from the actual read length each time rather than baked into a one-time seed. Per-pair arithmetic cost is in-noise (≤0.5%) on a 5x WGS PE benchmark (53.6M pairs, 8 threads). * **Hysteresis margin semantics change.** The flap-prevention margin is now expressed as 5% of the established insert estimate (`old / 20`) rather than 5% of read length (`read_len / 20`). For typical 150 bp / 300 bp libraries this widens the margin from ~7 bp to ~15 bp — more semantically meaningful ("don't move unless the running mean is more than 5% off our current estimate") and symmetric for upward and downward drift of the same magnitude. Tests: * New `expected_insert_size_hint_takes_effect_via_center_shift` — direct regression for #3; also checks variable-read-length behavior. * New `overlap_stats_margin_scales_with_insert_estimate` — locks in proportional-margin behavior at a larger insert size. * New `overlap_stats_learned_mean_can_replace_hint` — confirms the running mean can override the user's hint once it drifts far enough. * Existing `overlap_stats_*` tests adapted from shift-space to I-space assertions. * trim: anchor center_shift on R2 length, per shift = I − r2.len() definition CodeRabbit (PR #5) caught that the I→shift conversion was using R1 length, while the formal shift definition (and `try_shift_neg`'s i_value derivation) anchor on R2 length. For symmetric PE Illumina (R1 == R2) the two match; for asymmetric pairs the walk seeded at the wrong shift and burned extra probe iterations converging. This was pre-existing — the prior shift-space code also used R1 — just carried forward by the I-space refactor. Adds an explicit test pinning the R2-anchored behavior for asymmetric pairs.
1 parent e87b0fc commit ca98582

1 file changed

Lines changed: 196 additions & 86 deletions

File tree

0 commit comments

Comments
 (0)