Skip to content

Restore wasm SIMD autovectorization in ResonatorBank hot loop#5

Merged
jhartquist merged 2 commits into
mainfrom
perf/wasm-autovec
Apr 24, 2026
Merged

Restore wasm SIMD autovectorization in ResonatorBank hot loop#5
jhartquist merged 2 commits into
mainfrom
perf/wasm-autovec

Conversation

@jhartquist
Copy link
Copy Markdown
Owner

Summary

Fixes a wasm perf issue diagnosed in #1: target-feature=+simd128 was enabled, but the hot EWMA loop in ResonatorBank::process_sample was only half-vectorizing — LLVM emitted v128 loads/stores but scalarized every f32::mul_add into per-lane fmaf calls (wasm has no vector FMA). The effect was a ~10× slowdown vs what the code looked like it was doing.

Credit to @pengowray for the diagnosis and reproducer.

The fix (three changes in bank.rs)

  1. mul_add(a, b, c) helper, cfg-gated to use a*b + c on wasm32+simd128 and f32::mul_add everywhere else. On wasm this gives up fused rounding to keep the vector loop; on native (x86 FMA, aarch64 NEON) the fused instruction is preserved.
  2. Batched process_samples that chunks up to the next stabilization boundary, so the modulo check no longer fires inside the per-sample loop.
  3. Slice-hoisted process_sample_inner binds ten &mut [f32] locals of known length n once per call. This lets LLVM drop bounds checks, hoist the length-min across all backing Vecs, and trust disjointedness inside the loop.

Perf

Browser bench (examples/web-bench):

Before:

# bins ns / sample μs / quantum % budget
88 730 93 3.50%
264 2097 268 10.07%
440 3449 442 16.56%
880 6888 882 33.06%

After:

# bins ns / sample μs / quantum % budget
88 54 7 0.26%
264 143 18 0.68%
440 238 30 1.14%
880 470 60 2.26%

Speedup: ~13.5× at 88 bins, ~14.7× at 264/880, ~14.5× at 440.

Native bench (cargo bench --bench bank, aarch64): within noise at every bin count (tiny −2% at 88 bins, ±0% elsewhere). Native path intentionally unchanged.

Release

Bumps workspace version 0.1.0 → 0.1.1. Re-unifies resonators-py with the workspace version now that both are at 0.1.1; pyproject.toml uses dynamic = ["version"] so maturin pulls from Cargo.toml.

jhartquist and others added 2 commits April 24, 2026 14:11
On wasm32+simd128, `f32::mul_add` lowered to per-lane `fmaf` calls and
defeated autovectorization of the EWMA loop, leaving the "SIMD" build
roughly 10x slower than it should be. Three changes in bank.rs recover the
full speedup:

- `mul_add(a, b, c)` helper: unfused (a*b + c) on wasm32+simd128 to keep
  the vector loop; `f32::mul_add` on native where vector FMA exists.
- `process_samples` chunks to the next stabilization boundary so the
  per-sample modulo check moves out of the hot path.
- `process_sample_inner` hoists the ten backing `Vec` fields to local
  `&mut [f32]` of known length `n`, letting LLVM drop bounds checks,
  hoist the length-min across slices, and trust disjointedness.

Browser bench throughput (ns/sample, 48 kHz x 1 s):

  bins  before  after   speedup
   88     730     54    13.5x
  264    2097    143    14.7x
  440    3449    238    14.5x
  880    6888    470    14.7x

Native `cargo bench --bench bank` (aarch64): within noise at every bin count.

Co-authored-by: Pengo Wray <pengowray@users.noreply.github.com>
The workspace version stayed at 0.1.0 while `resonators-py` was bumped
independently to 0.1.1 for the AVX2/FMA wheel fix. Now that both are at
0.1.1, put `resonators-py` back on `version.workspace = true` and switch
`pyproject.toml` to `dynamic = ["version"]` so maturin pulls from Cargo.toml.
One place to bump going forward.
Copilot AI review requested due to automatic review settings April 24, 2026 21:19
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Restores WASM SIMD autovectorization in the ResonatorBank per-sample hot loop by avoiding f32::mul_add on wasm32+simd128, and tightens the sample-processing path to keep stabilization checks out of the inner loop while hoisting slices to help LLVM remove bounds checks.

Changes:

  • Add a mul_add(a,b,c) helper that uses a*b + c on wasm32+simd128 (preserving f32::mul_add elsewhere) to avoid wasm SIMD scalarization.
  • Rework process_samples to batch up to the next stabilization boundary and factor the hot loop into process_sample_inner with slice-hoisted locals.
  • Bump workspace version to 0.1.1 and align resonators-py versioning via dynamic = ["version"] and version.workspace = true.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
crates/resonators/src/bank.rs Refactors the hot loop to restore wasm SIMD autovectorization and reduces per-sample overhead by batching stabilization checks.
crates/resonators-py/pyproject.toml Switches Python package version to dynamic so maturin reads it from Cargo metadata.
crates/resonators-py/Cargo.toml Unifies Python binding crate version with the workspace version.
Cargo.toml Bumps workspace package version to 0.1.1.
Cargo.lock Updates locked package versions to 0.1.1.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jhartquist jhartquist merged commit 9e3d42e into main Apr 24, 2026
10 checks passed
@jhartquist jhartquist deleted the perf/wasm-autovec branch April 24, 2026 21:28
@pengowray
Copy link
Copy Markdown
Contributor

Ah, that's a more elegant solution, getting autovectorization to kick in again, and much better performance too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants