Add WASM SIMD128 hot loop + amortise stabilisation by pengowray · Pull Request #1 · jhartquist/resonators

pengowray · 2026-04-24T00:14:45Z

Tried adding this lib to my audio viewer (oversample.com), which pre-caches so performance is more noticeable.

Looks like although you have SIMD support generally, it wasn't getting used in WASM. Adding WASM SIMD made it ~7x faster on the web. This patch also takes a modulo check out of the hot loop for slightly more performance gain.

Performance benefits are WASM-specific. LLVM does a better job optimizing on other platforms (i.e. it generates more optimized SIMD or SSE2 code than this code when making a Windows binary).

You can compare benchmarks

https://jhartquist.github.io/resonators/bench/

# bins	ns / sample	μs / quantum	% budget
88	555	71	2.67%
264	1548	198	7.43%
440	2615	335	12.55%
880	5141	658	24.68%

https://pengowray.github.io/resonators/bench/

# bins	ns / sample	μs / quantum	% budget
88	80	10	0.38%
264	233	30	1.12%
440	399	51	1.92%
880	808	103	3.88%

Not going to pretend this patch was much more than me typing "make it SIMD for WASM", so not too precious about the specifics.

Thanks for making this lib and introducing me to resonator banks / the Resonate algorithm.

Here's the AI's version of this PR:

Claude Code

The per-sample update in ResonatorBank::process_sample is O(n_bins) of independent per-bin work (EWMA + phasor rotate). On native targets LLVM auto-vectorises this to SSE2 / NEON cleanly, so there's no speedup to be had from explicit SIMD there. On WASM, however, auto-vectorisation to SIMD128 is not reliable, and the default scalar output leaves significant throughput on the table.

This adds a WASM-SIMD128-only explicit-SIMD path via wide::f32x4, cfg-gated to target_arch = "wasm32", target_feature = "simd128". Other targets keep the upstream scalar loop unchanged.

Speedup measured in-browser (Firefox 130, Chrome 131) on a log-spaced bank at 48 kHz sample rate:

bins=  65: 7.9x
bins= 129: 6.4x
bins= 257: 7.0x
bins= 513: 6.8x

wide is a portable-SIMD wrapper; we pull it in only on the WASM+SIMD128 target via [target.'cfg(...)'.dependencies] so it doesn't affect non-WASM builds or wasm32 without +simd128.

Loads / stores use core::ptr::read_unaligned / write_unaligned with f32x4 casts — the f32x4::new([a,b,c,d]) array-literal path generates per-lane inserts and defeats lowering to single 128-bit memory ops.

Also amortises the stabilisation modulo check: process_samples used to test sample_count % STABILIZE_EVERY == 0 after every sample; now it batches samples between stabilisations, keeping the hot loop slightly tighter. Independent of SIMD, this gave a few percent on its own in native benches.

All 21 existing bank + resonator tests pass on the scalar fallback (verified on x86_64 Windows). WASM SIMD path compiles cleanly with RUSTFLAGS="-C target-feature=+simd128" cargo check --target wasm32-unknown-unknown.

The per-sample update in ResonatorBank::process_sample is O(n_bins) of independent per-bin work (EWMA + phasor rotate). On native targets LLVM auto-vectorises this to SSE2 / NEON cleanly, so there's no speedup to be had from explicit SIMD there. On WASM, however, auto-vectorisation to SIMD128 is not reliable, and the default scalar output leaves significant throughput on the table. This adds a WASM-SIMD128-only explicit-SIMD path via `wide::f32x4`, cfg-gated to `target_arch = "wasm32", target_feature = "simd128"`. Other targets keep the upstream scalar loop unchanged. Speedup measured in-browser (Firefox 130, Chrome 131) on a log-spaced bank at 48 kHz sample rate: bins= 65: 7.9x bins= 129: 6.4x bins= 257: 7.0x bins= 513: 6.8x `wide` is a portable-SIMD wrapper; we pull it in only on the WASM+SIMD128 target via `[target.'cfg(...)'.dependencies]` so it doesn't affect non-WASM builds or wasm32 without +simd128. Loads / stores use `core::ptr::read_unaligned` / `write_unaligned` with `f32x4` casts — the `f32x4::new([a,b,c,d])` array-literal path generates per-lane inserts and defeats lowering to single 128-bit memory ops. Also amortises the stabilisation modulo check: `process_samples` used to test `sample_count % STABILIZE_EVERY == 0` after every sample; now it batches samples between stabilisations, keeping the hot loop slightly tighter. Independent of SIMD, this gave a few percent on its own in native benches. All 21 existing bank + resonator tests pass on the scalar fallback (verified on x86_64 Windows). WASM SIMD path compiles cleanly with `RUSTFLAGS="-C target-feature=+simd128" cargo check --target wasm32-unknown-unknown`. Signed-off-by: Pengo Wray <me@pengowray.com>

pengowray · 2026-04-24T01:32:56Z

Note: There are no workflow changes in this PR (If it still says "2 workflows awaiting approval" it's because I edited github actions to change which branch was published to Pages and then reverted the change)

phayes · 2026-04-24T01:39:30Z

Could there possibly be a better result for x64 if we used a wider SIMD? Maybe AVX ( 8 at once) or AVX-512 (16 at once) ?

jhartquist · 2026-04-24T02:20:52Z

Thanks for opening this! I'm very new to SIMD myself. When I added

resonators/.cargo/config.toml

Line 2 in 7bf2b0c

rustflags = ["-C", "target-feature=+simd128"]

I remember seeing a decent speedup so I figured it was working. I'll dig into over the next few days.

pengowray · 2026-04-24T04:25:59Z

@jhartquist No worries. Seems LLVM is shy of vectorizing for WASM and needs the extra hints to nudge it into using SIMD instructions. I only tried it on a whim without realizing it was already set to attempt it

pengowray · 2026-04-24T04:31:24Z

@phayes I had a try with AVX2 and AVX-512 and have put the results in #4

There's some decent performance increases to be had but not nearly as much, because LLVM's x86 and aarch64 backends are more mature and already do aggressive auto-vectorizing. Still worthwhile though.

jhartquist · 2026-04-24T21:26:59Z

@pengowray I'm about to merge and release #5. I was able to get similar speedups on WASM without bringing in wide at this time. Nice catch with the stabilization amortization and uncovering the performance opportunity, much appreciated!

pengowray mentioned this pull request Apr 24, 2026

Avx bench #4

Open

jhartquist mentioned this pull request Apr 24, 2026

Restore wasm SIMD autovectorization in ResonatorBank hot loop #5

Merged

jhartquist closed this Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add WASM SIMD128 hot loop + amortise stabilisation#1

Add WASM SIMD128 hot loop + amortise stabilisation#1
pengowray wants to merge 1 commit into
jhartquist:mainfrom
pengowray:wasm-simd-pr

pengowray commented Apr 24, 2026 •

edited

Loading

Uh oh!

pengowray commented Apr 24, 2026 •

edited

Loading

Uh oh!

phayes commented Apr 24, 2026

Uh oh!

jhartquist commented Apr 24, 2026

Uh oh!

pengowray commented Apr 24, 2026

Uh oh!

pengowray commented Apr 24, 2026

Uh oh!

jhartquist commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pengowray commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code

Uh oh!

pengowray commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phayes commented Apr 24, 2026

Uh oh!

jhartquist commented Apr 24, 2026

Uh oh!

pengowray commented Apr 24, 2026

Uh oh!

pengowray commented Apr 24, 2026

Uh oh!

jhartquist commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pengowray commented Apr 24, 2026 •

edited

Loading

pengowray commented Apr 24, 2026 •

edited

Loading