Skip to content

Add WASM SIMD128 hot loop + amortise stabilisation#1

Closed
pengowray wants to merge 1 commit into
jhartquist:mainfrom
pengowray:wasm-simd-pr
Closed

Add WASM SIMD128 hot loop + amortise stabilisation#1
pengowray wants to merge 1 commit into
jhartquist:mainfrom
pengowray:wasm-simd-pr

Conversation

@pengowray
Copy link
Copy Markdown
Contributor

@pengowray pengowray commented Apr 24, 2026

Tried adding this lib to my audio viewer (oversample.com), which pre-caches so performance is more noticeable.

Looks like although you have SIMD support generally, it wasn't getting used in WASM. Adding WASM SIMD made it ~7x faster on the web. This patch also takes a modulo check out of the hot loop for slightly more performance gain.

Performance benefits are WASM-specific. LLVM does a better job optimizing on other platforms (i.e. it generates more optimized SIMD or SSE2 code than this code when making a Windows binary).

You can compare benchmarks

# bins ns / sample μs / quantum % budget
88 555 71 2.67%
264 1548 198 7.43%
440 2615 335 12.55%
880 5141 658 24.68%
# bins ns / sample μs / quantum % budget
88 80 10 0.38%
264 233 30 1.12%
440 399 51 1.92%
880 808 103 3.88%

Not going to pretend this patch was much more than me typing "make it SIMD for WASM", so not too precious about the specifics.

Thanks for making this lib and introducing me to resonator banks / the Resonate algorithm.

Here's the AI's version of this PR:

Claude Code

The per-sample update in ResonatorBank::process_sample is O(n_bins) of independent per-bin work (EWMA + phasor rotate). On native targets LLVM auto-vectorises this to SSE2 / NEON cleanly, so there's no speedup to be had from explicit SIMD there. On WASM, however, auto-vectorisation to SIMD128 is not reliable, and the default scalar output leaves significant throughput on the table.

This adds a WASM-SIMD128-only explicit-SIMD path via wide::f32x4, cfg-gated to target_arch = "wasm32", target_feature = "simd128". Other targets keep the upstream scalar loop unchanged.

Speedup measured in-browser (Firefox 130, Chrome 131) on a log-spaced bank at 48 kHz sample rate:

bins=  65: 7.9x
bins= 129: 6.4x
bins= 257: 7.0x
bins= 513: 6.8x

wide is a portable-SIMD wrapper; we pull it in only on the WASM+SIMD128 target via [target.'cfg(...)'.dependencies] so it doesn't affect non-WASM builds or wasm32 without +simd128.

Loads / stores use core::ptr::read_unaligned / write_unaligned with f32x4 casts — the f32x4::new([a,b,c,d]) array-literal path generates per-lane inserts and defeats lowering to single 128-bit memory ops.

Also amortises the stabilisation modulo check: process_samples used to test sample_count % STABILIZE_EVERY == 0 after every sample; now it batches samples between stabilisations, keeping the hot loop slightly tighter. Independent of SIMD, this gave a few percent on its own in native benches.

All 21 existing bank + resonator tests pass on the scalar fallback (verified on x86_64 Windows). WASM SIMD path compiles cleanly with RUSTFLAGS="-C target-feature=+simd128" cargo check --target wasm32-unknown-unknown.

The per-sample update in ResonatorBank::process_sample is O(n_bins)
of independent per-bin work (EWMA + phasor rotate). On native
targets LLVM auto-vectorises this to SSE2 / NEON cleanly, so there's
no speedup to be had from explicit SIMD there. On WASM, however,
auto-vectorisation to SIMD128 is not reliable, and the default
scalar output leaves significant throughput on the table.

This adds a WASM-SIMD128-only explicit-SIMD path via `wide::f32x4`,
cfg-gated to `target_arch = "wasm32", target_feature = "simd128"`.
Other targets keep the upstream scalar loop unchanged.

Speedup measured in-browser (Firefox 130, Chrome 131) on a
log-spaced bank at 48 kHz sample rate:

    bins=  65: 7.9x
    bins= 129: 6.4x
    bins= 257: 7.0x
    bins= 513: 6.8x

`wide` is a portable-SIMD wrapper; we pull it in only on the
WASM+SIMD128 target via `[target.'cfg(...)'.dependencies]` so it
doesn't affect non-WASM builds or wasm32 without +simd128.

Loads / stores use `core::ptr::read_unaligned` / `write_unaligned`
with `f32x4` casts — the `f32x4::new([a,b,c,d])` array-literal
path generates per-lane inserts and defeats lowering to single
128-bit memory ops.

Also amortises the stabilisation modulo check: `process_samples`
used to test `sample_count % STABILIZE_EVERY == 0` after every
sample; now it batches samples between stabilisations, keeping
the hot loop slightly tighter. Independent of SIMD, this gave
a few percent on its own in native benches.

All 21 existing bank + resonator tests pass on the scalar fallback
(verified on x86_64 Windows). WASM SIMD path compiles cleanly with
`RUSTFLAGS="-C target-feature=+simd128" cargo check --target
wasm32-unknown-unknown`.

Signed-off-by: Pengo Wray <me@pengowray.com>
@pengowray
Copy link
Copy Markdown
Contributor Author

pengowray commented Apr 24, 2026

Note: There are no workflow changes in this PR (If it still says "2 workflows awaiting approval" it's because I edited github actions to change which branch was published to Pages and then reverted the change)

@phayes
Copy link
Copy Markdown

phayes commented Apr 24, 2026

Could there possibly be a better result for x64 if we used a wider SIMD? Maybe AVX ( 8 at once) or AVX-512 (16 at once) ?

@jhartquist
Copy link
Copy Markdown
Owner

Thanks for opening this! I'm very new to SIMD myself. When I added

rustflags = ["-C", "target-feature=+simd128"]
I remember seeing a decent speedup so I figured it was working. I'll dig into over the next few days.

@pengowray pengowray mentioned this pull request Apr 24, 2026
@pengowray
Copy link
Copy Markdown
Contributor Author

@jhartquist No worries. Seems LLVM is shy of vectorizing for WASM and needs the extra hints to nudge it into using SIMD instructions. I only tried it on a whim without realizing it was already set to attempt it

@pengowray
Copy link
Copy Markdown
Contributor Author

@phayes I had a try with AVX2 and AVX-512 and have put the results in #4

There's some decent performance increases to be had but not nearly as much, because LLVM's x86 and aarch64 backends are more mature and already do aggressive auto-vectorizing. Still worthwhile though.

@jhartquist
Copy link
Copy Markdown
Owner

@pengowray I'm about to merge and release #5. I was able to get similar speedups on WASM without bringing in wide at this time. Nice catch with the stabilization amortization and uncovering the performance opportunity, much appreciated!

@jhartquist jhartquist closed this Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants