Restore wasm SIMD autovectorization in ResonatorBank hot loop by jhartquist · Pull Request #5 · jhartquist/resonators

jhartquist · 2026-04-24T21:19:53Z

Summary

Fixes a wasm perf issue diagnosed in #1: target-feature=+simd128 was enabled, but the hot EWMA loop in ResonatorBank::process_sample was only half-vectorizing — LLVM emitted v128 loads/stores but scalarized every f32::mul_add into per-lane fmaf calls (wasm has no vector FMA). The effect was a ~10× slowdown vs what the code looked like it was doing.

Credit to @pengowray for the diagnosis and reproducer.

The fix (three changes in `bank.rs`)

mul_add(a, b, c) helper, cfg-gated to use a*b + c on wasm32+simd128 and f32::mul_add everywhere else. On wasm this gives up fused rounding to keep the vector loop; on native (x86 FMA, aarch64 NEON) the fused instruction is preserved.
Batched process_samples that chunks up to the next stabilization boundary, so the modulo check no longer fires inside the per-sample loop.
Slice-hoisted process_sample_inner binds ten &mut [f32] locals of known length n once per call. This lets LLVM drop bounds checks, hoist the length-min across all backing Vecs, and trust disjointedness inside the loop.

Perf

Browser bench (examples/web-bench):

Before:

# bins	ns / sample	μs / quantum	% budget
88	730	93	3.50%
264	2097	268	10.07%
440	3449	442	16.56%
880	6888	882	33.06%

After:

# bins	ns / sample	μs / quantum	% budget
88	54	7	0.26%
264	143	18	0.68%
440	238	30	1.14%
880	470	60	2.26%

Speedup: ~13.5× at 88 bins, ~14.7× at 264/880, ~14.5× at 440.

Native bench (cargo bench --bench bank, aarch64): within noise at every bin count (tiny −2% at 88 bins, ±0% elsewhere). Native path intentionally unchanged.

Release

Bumps workspace version 0.1.0 → 0.1.1. Re-unifies resonators-py with the workspace version now that both are at 0.1.1; pyproject.toml uses dynamic = ["version"] so maturin pulls from Cargo.toml.

On wasm32+simd128, `f32::mul_add` lowered to per-lane `fmaf` calls and defeated autovectorization of the EWMA loop, leaving the "SIMD" build roughly 10x slower than it should be. Three changes in bank.rs recover the full speedup: - `mul_add(a, b, c)` helper: unfused (a*b + c) on wasm32+simd128 to keep the vector loop; `f32::mul_add` on native where vector FMA exists. - `process_samples` chunks to the next stabilization boundary so the per-sample modulo check moves out of the hot path. - `process_sample_inner` hoists the ten backing `Vec` fields to local `&mut [f32]` of known length `n`, letting LLVM drop bounds checks, hoist the length-min across slices, and trust disjointedness. Browser bench throughput (ns/sample, 48 kHz x 1 s): bins before after speedup 88 730 54 13.5x 264 2097 143 14.7x 440 3449 238 14.5x 880 6888 470 14.7x Native `cargo bench --bench bank` (aarch64): within noise at every bin count. Co-authored-by: Pengo Wray <pengowray@users.noreply.github.com>

The workspace version stayed at 0.1.0 while `resonators-py` was bumped independently to 0.1.1 for the AVX2/FMA wheel fix. Now that both are at 0.1.1, put `resonators-py` back on `version.workspace = true` and switch `pyproject.toml` to `dynamic = ["version"]` so maturin pulls from Cargo.toml. One place to bump going forward.

Copilot

Pull request overview

Restores WASM SIMD autovectorization in the ResonatorBank per-sample hot loop by avoiding f32::mul_add on wasm32+simd128, and tightens the sample-processing path to keep stabilization checks out of the inner loop while hoisting slices to help LLVM remove bounds checks.

Changes:

Add a mul_add(a,b,c) helper that uses a*b + c on wasm32+simd128 (preserving f32::mul_add elsewhere) to avoid wasm SIMD scalarization.
Rework process_samples to batch up to the next stabilization boundary and factor the hot loop into process_sample_inner with slice-hoisted locals.
Bump workspace version to 0.1.1 and align resonators-py versioning via dynamic = ["version"] and version.workspace = true.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
crates/resonators/src/bank.rs	Refactors the hot loop to restore wasm SIMD autovectorization and reduces per-sample overhead by batching stabilization checks.
crates/resonators-py/pyproject.toml	Switches Python package version to dynamic so maturin reads it from Cargo metadata.
crates/resonators-py/Cargo.toml	Unifies Python binding crate version with the workspace version.
Cargo.toml	Bumps workspace package version to `0.1.1`.
Cargo.lock	Updates locked package versions to `0.1.1`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pengowray · 2026-04-24T23:58:21Z

Ah, that's a more elegant solution, getting autovectorization to kick in again, and much better performance too

jhartquist and others added 2 commits April 24, 2026 14:11

Copilot AI review requested due to automatic review settings April 24, 2026 21:19

Copilot started reviewing on behalf of jhartquist April 24, 2026 21:20 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

jhartquist mentioned this pull request Apr 24, 2026

Add WASM SIMD128 hot loop + amortise stabilisation #1

Closed

jhartquist merged commit 9e3d42e into main Apr 24, 2026
10 checks passed

jhartquist deleted the perf/wasm-autovec branch April 24, 2026 21:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore wasm SIMD autovectorization in ResonatorBank hot loop#5

Restore wasm SIMD autovectorization in ResonatorBank hot loop#5
jhartquist merged 2 commits into
mainfrom
perf/wasm-autovec

jhartquist commented Apr 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

pengowray commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jhartquist commented Apr 24, 2026

Summary

The fix (three changes in bank.rs)

Perf

Release

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

pengowray commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

The fix (three changes in `bank.rs`)