Restore wasm SIMD autovectorization in ResonatorBank hot loop#5
Merged
Conversation
On wasm32+simd128, `f32::mul_add` lowered to per-lane `fmaf` calls and defeated autovectorization of the EWMA loop, leaving the "SIMD" build roughly 10x slower than it should be. Three changes in bank.rs recover the full speedup: - `mul_add(a, b, c)` helper: unfused (a*b + c) on wasm32+simd128 to keep the vector loop; `f32::mul_add` on native where vector FMA exists. - `process_samples` chunks to the next stabilization boundary so the per-sample modulo check moves out of the hot path. - `process_sample_inner` hoists the ten backing `Vec` fields to local `&mut [f32]` of known length `n`, letting LLVM drop bounds checks, hoist the length-min across slices, and trust disjointedness. Browser bench throughput (ns/sample, 48 kHz x 1 s): bins before after speedup 88 730 54 13.5x 264 2097 143 14.7x 440 3449 238 14.5x 880 6888 470 14.7x Native `cargo bench --bench bank` (aarch64): within noise at every bin count. Co-authored-by: Pengo Wray <pengowray@users.noreply.github.com>
The workspace version stayed at 0.1.0 while `resonators-py` was bumped independently to 0.1.1 for the AVX2/FMA wheel fix. Now that both are at 0.1.1, put `resonators-py` back on `version.workspace = true` and switch `pyproject.toml` to `dynamic = ["version"]` so maturin pulls from Cargo.toml. One place to bump going forward.
There was a problem hiding this comment.
Pull request overview
Restores WASM SIMD autovectorization in the ResonatorBank per-sample hot loop by avoiding f32::mul_add on wasm32+simd128, and tightens the sample-processing path to keep stabilization checks out of the inner loop while hoisting slices to help LLVM remove bounds checks.
Changes:
- Add a
mul_add(a,b,c)helper that usesa*b + conwasm32+simd128(preservingf32::mul_addelsewhere) to avoid wasm SIMD scalarization. - Rework
process_samplesto batch up to the next stabilization boundary and factor the hot loop intoprocess_sample_innerwith slice-hoisted locals. - Bump workspace version to
0.1.1and alignresonators-pyversioning viadynamic = ["version"]andversion.workspace = true.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| crates/resonators/src/bank.rs | Refactors the hot loop to restore wasm SIMD autovectorization and reduces per-sample overhead by batching stabilization checks. |
| crates/resonators-py/pyproject.toml | Switches Python package version to dynamic so maturin reads it from Cargo metadata. |
| crates/resonators-py/Cargo.toml | Unifies Python binding crate version with the workspace version. |
| Cargo.toml | Bumps workspace package version to 0.1.1. |
| Cargo.lock | Updates locked package versions to 0.1.1. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Contributor
|
Ah, that's a more elegant solution, getting autovectorization to kick in again, and much better performance too |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a wasm perf issue diagnosed in #1:
target-feature=+simd128was enabled, but the hot EWMA loop inResonatorBank::process_samplewas only half-vectorizing — LLVM emittedv128loads/stores but scalarized everyf32::mul_addinto per-lanefmafcalls (wasm has no vector FMA). The effect was a ~10× slowdown vs what the code looked like it was doing.Credit to @pengowray for the diagnosis and reproducer.
The fix (three changes in
bank.rs)mul_add(a, b, c)helper, cfg-gated to usea*b + conwasm32+simd128andf32::mul_addeverywhere else. On wasm this gives up fused rounding to keep the vector loop; on native (x86 FMA, aarch64 NEON) the fused instruction is preserved.process_samplesthat chunks up to the next stabilization boundary, so the modulo check no longer fires inside the per-sample loop.process_sample_innerbinds ten&mut [f32]locals of known lengthnonce per call. This lets LLVM drop bounds checks, hoist the length-min across all backing Vecs, and trust disjointedness inside the loop.Perf
Browser bench (
examples/web-bench):Before:
After:
Speedup: ~13.5× at 88 bins, ~14.7× at 264/880, ~14.5× at 440.
Native bench (
cargo bench --bench bank, aarch64): within noise at every bin count (tiny −2% at 88 bins, ±0% elsewhere). Native path intentionally unchanged.Release
Bumps workspace version
0.1.0 → 0.1.1. Re-unifiesresonators-pywith the workspace version now that both are at 0.1.1;pyproject.tomlusesdynamic = ["version"]so maturin pulls fromCargo.toml.