Avx bench by pengowray · Pull Request #4 · jhartquist/resonators

pengowray · 2026-04-24T04:09:32Z

I was going to just make this a reply to @phayes but it's basically a PR now.

@jhartquist Feel free to merge this or otherwise implement it your own way.

Could there possibly be a better result for x64 if we used a wider SIMD? Maybe AVX ( 8 at once) or AVX-512 (16 at once) ?

My computers are too old to support AVX-512, so I had to benchmark on AWS. There's some performance improvements to be had. Not nearly as much as the WASM SIMD tweaks though.

This patch implements AVX2 and AVX-512F, with runtime dispatch so it hopefully doesn't require separate binaries.

Claude Code summary of results:

AVX2 / AVX-512F + runtime dispatch on x86_64

Branch: avx-bench (4 commits on top of wasm-simd-pr).

What's in the branch

Explicit AVX2+FMA path — 8 bins/iter via __m256 + vfmadd231ps, #[target_feature(enable = "avx2,fma")] so it compiles on any x86_64 build.
Explicit AVX-512F path — 16 bins/iter via __m512. The up-to-15-bin tail uses a masked SIMD iteration (_mm512_maskz_loadu_ps + _mm512_mask_storeu_ps with k-mask = (1 << tail) - 1). Fault suppression on masked-off lanes means we never read past the Vec end, and the mask-store leaves the untouched tail bytes untouched. This replaces what was a scalar loop, which turned out to be the largest single perf win on small bin counts (n=88 AVX-512 time fell 48%).
Runtime dispatch — Backend::{Scalar, Avx2, Avx512} enum with Backend::detect() (runtime is_x86_feature_detected! check), stored on ResonatorBank. new() auto-selects the widest supported; set_backend(..) overrides; backend() inspects. Callers don't change — bank.process_sample(s) now runs AVX-512 on Sapphire Rapids / Zen 4, AVX2+FMA on Haswell / Zen 2-3, scalar elsewhere. Dispatch is at the process_sample / process_samples boundary (match on a single field set at construction — branch predictor fixes on first call).
Correctness: avx2_matches_scalar and avx512_matches_scalar compare each backend against the scalar reference across bin counts that exercise both vector body and tail, including n=1, n=8, n=15 for the AVX-512 "masked tail only, no body" case. Plus dispatched_matches_forced_scalar end-to-end. All 27 tests pass on SPR.

Setup

AWS EC2 c7i.large (Intel Sapphire Rapids, 2 vCPU, shared tenancy), Amazon Linux 2023, rustc 1.95 stable, RUSTFLAGS="-C target-cpu=native". Scalar column compiled with native target — LLVM is free to auto-vectorise to AVX-512.

Shared-tenancy variance is ~5–15% between otherwise-identical runs (criterion reports spurious "Performance regressed/improved" from background contention); headline numbers below are single-run medians.

Results (44.1 kHz × 1 s signal)

n_bins	scalar (auto-vec)	AVX2+FMA	AVX-512F	auto-dispatch	speedup vs scalar
88	4.77 ms	1.84 ms	1.74 ms	1.40 ms	3.4×
264	7.48 ms	5.05 ms	4.33 ms	5.20 ms	1.7×
440	11.74 ms	8.28 ms	8.71 ms	7.40 ms	1.6×
880	20.52 ms	17.83 ms	13.16 ms	14.05 ms	1.6×

Peak throughput 32.6 Melem/s at n=88 on the auto-dispatch path. bank/dispatch backend = Avx512 confirmed at runtime.

Dispatch lands within noise of forced AVX-512 (sometimes faster, sometimes slower across sizes — that's the shared-instance noise floor, not a real signal). The per-sample match cost is statistically zero at this kernel granularity.

Observations

Explicit SIMD beats LLVM auto-vec by 1.6–3.4× even with -C target-cpu=native. The existing comment in bank.rs — "explicit SIMD matches or slightly regresses vs auto-vec on x86_64" — was written against 128-bit / SSE and does not hold at 256-bit or 512-bit width for this kernel.
Biggest win at the smallest bin count (3.4× at n=88). Small banks fit in L1 / register file, and auto-vec's per-bin setup overhead scales badly down.
AVX-512 vs AVX2 scales sub-linearly — ~1.3× best case, not 2×. The hot loop has 10 loads + 6 stores per chunk; SPR has 3 load + 2 store ports per cycle, so it's memory-port bound, not FMA-bound. Wider lanes don't help once the memory pipe is saturated.

Why the win is smaller than the WASM SIMD PR

The WASM SIMD128 PR reports 6–8× speedup over scalar; here we see 1.6–3.4×. The delta is almost entirely in the baseline, not the SIMD path:

The x86 scalar baseline is already partly-vectorised. With -C target-cpu=native, LLVM emits AVX-512 instructions for the "scalar" loop — imperfectly, but definitely emitting wide SIMD. So x86 "scalar" is a partly-vectorised baseline, and explicit SIMD is improving on an already-non-trivial floor. On WASM, engines' JITs (Liftoff / V8 TurboFan / SpiderMonkey Ion) are far more conservative about auto-vectorising f32 loops — the WASM "scalar" baseline runs closer to truly one lane at a time.
Width gains don't compound. WASM SIMD128 is 1→4 lanes (4× ceiling). x86 here is 8→16 lanes (2× ceiling, AVX2 to AVX-512). Doubling an already-wide baseline is less dramatic than quadrupling a scalar one.
This kernel is memory-port bound on native x86. AVX-512 vs AVX2 gives ~1.3× at saturation because SPR's 3 load + 2 store ports per cycle cap memory throughput regardless of FMA width. On WASM, the "ports" (virtual execution pipeline) are simpler, so wider SIMD has more room to help.

Absolute throughput is still dramatically higher on native x86_64 (tens of Melem/s on a 2-vCPU VM) — it's the relative improvement over scalar that's smaller, because the scalar path on x86 wasn't as pessimised to begin with.

Not tested

Zen 4 (c7a): AVX-512 implemented as double-pumped 256-bit units. Expect dispatch ≈ AVX-512 ≈ close to AVX2 on Zen 4 (wins from front-end decode, not FMA throughput).
Ice Lake (c6i): native 512-bit, more AVX-512 frequency throttling on sustained loads. Expected between SPR and Zen 4.
Dedicated / c7i.2xlarge+: would remove the shared-tenancy noise.

The per-sample update in ResonatorBank::process_sample is O(n_bins) of independent per-bin work (EWMA + phasor rotate). On native targets LLVM auto-vectorises this to SSE2 / NEON cleanly, so there's no speedup to be had from explicit SIMD there. On WASM, however, auto-vectorisation to SIMD128 is not reliable, and the default scalar output leaves significant throughput on the table. This adds a WASM-SIMD128-only explicit-SIMD path via `wide::f32x4`, cfg-gated to `target_arch = "wasm32", target_feature = "simd128"`. Other targets keep the upstream scalar loop unchanged. Speedup measured in-browser (Firefox 130, Chrome 131) on a log-spaced bank at 48 kHz sample rate: bins= 65: 7.9x bins= 129: 6.4x bins= 257: 7.0x bins= 513: 6.8x `wide` is a portable-SIMD wrapper; we pull it in only on the WASM+SIMD128 target via `[target.'cfg(...)'.dependencies]` so it doesn't affect non-WASM builds or wasm32 without +simd128. Loads / stores use `core::ptr::read_unaligned` / `write_unaligned` with `f32x4` casts — the `f32x4::new([a,b,c,d])` array-literal path generates per-lane inserts and defeats lowering to single 128-bit memory ops. Also amortises the stabilisation modulo check: `process_samples` used to test `sample_count % STABILIZE_EVERY == 0` after every sample; now it batches samples between stabilisations, keeping the hot loop slightly tighter. Independent of SIMD, this gave a few percent on its own in native benches. All 21 existing bank + resonator tests pass on the scalar fallback (verified on x86_64 Windows). WASM SIMD path compiles cleanly with `RUSTFLAGS="-C target-feature=+simd128" cargo check --target wasm32-unknown-unknown`. Signed-off-by: Pengo Wray <me@pengowray.com>

Adds explicit-SIMD companions to the scalar per-sample hot loop, in the same style as the existing WASM SIMD128 path but for x86_64: - process_sample_avx2: 8 bins / iter via __m256 + vfmadd231ps (target_feature = "avx2,fma") - process_sample_avx512: 16 bins / iter via __m512 (target_feature = "avx512f") Both are #[doc(hidden)] pub unsafe methods, compiled unconditionally on x86_64 via #[target_feature] so a single bench binary can compare all three paths. The caller is responsible for checking CPU support with is_x86_feature_detected! before invoking. The bench (benches/bank.rs) now has three groups — bank/scalar, bank/avx2, bank/avx512 — and uses runtime feature detection to skip groups the host CPU can't run. `just bench-avx` wraps the invocation with -C target-cpu=native so the SCALAR path gets LLVM's widest auto-vectorisation available, which is the fair baseline for comparing against the hand-rolled paths. Correctness: new avx2_matches_scalar / avx512_matches_scalar tests compare each SIMD backend against the scalar loop across a range of bin counts that exercise both the vector body and scalar tail (including n % 8 != 0 and n % 16 != 0). Tolerance is a relative 1e-4 to account for FMA vs separate mul+add rounding. Quick measurement on Ryzen 5 3600 (Zen 2, no AVX-512) at 264 bins, 44.1 kHz, 1s signal: bank/scalar/264 5.50 ms 8.02 Melem/s bank/avx2/264 3.85 ms 11.47 Melem/s (1.43x) bank/avx512/264 skipped — CPU lacks avx512f The ~1.4x win from explicit AVX2 over -C target-cpu=native scalar contradicts the comment in bank.rs claiming auto-vec matches; worth re-checking at the other bin sizes and on a more modern x86_64. AVX-512 speedup is untested here (no hardware); branch exists so it can be benched on a Zen 4 / Ice Lake+ / Sapphire Rapids box or a cloud VM.

At the bench bin counts 88, 264, 440 the AVX-512 path was spending ~35% of total sample time in the up-to-15-element scalar tail (bin 80-87 at n=88, bin 256-263 at n=264, bin 432-439 at n=440). This cost ~40 ns/sample on top of a ~70 ns SIMD body and was the main reason AVX-512 lost to AVX2 at small bin counts on c7i.large. Replace the scalar tail with one additional AVX-512F iteration gated by a k-mask of `(1 << tail) - 1`. Loads use `_mm512_maskz_loadu_ps` (fault suppression on masked-off lanes, so we don't read past the Vec); stores use `_mm512_mask_storeu_ps` so the garbage from zero-loaded lanes never hits the buffers. No change to the 880-bin case (which is already a multiple of 16 and has zero tail). Expected improvement at 88/264/440: tail goes from ~40 ns to one SIMD-body-width iteration (~5 ns), cutting per-sample cost by 30-35% on those sizes. avx512_matches_scalar test already covers tail lengths via n_bins in [1, 8, 15, 16, 17, 23, 64, 88]; the 1/8/15 cases now exercise the "no SIMD body, masked tail only" path directly.

Criterion was warning about not hitting 50 samples within its 5 s default target for the larger bin counts (17 ms/iter at 880 bins × 50 samples plus warmup just barely exceeds 5 s). Numbers were still statistically fine but the log was noisy. 10 s covers the slowest (scalar/880) with headroom for all three backends, and silences the warnings without changing the method.

Adds a `Backend` enum (Scalar / Avx2 / Avx512) and a `backend` field on `ResonatorBank`. `ResonatorBank::new` now calls `Backend::detect` to auto-select the widest backend the host CPU supports at runtime, and `process_sample` / `process_samples` dispatch via a match on `self.backend`. Callers get the best-available SIMD path without needing to know which one their CPU supports — the existing public API (`process_sample`, `process_samples`) is unchanged for the caller, but now runs AVX-512F on Sapphire Rapids / Zen 4, AVX2+FMA on Haswell / Zen+, and the scalar loop elsewhere. API additions (x86_64 only): - `pub enum Backend { Scalar, Avx2, Avx512 }` - `Backend::detect()` — widest supported on the host - `Backend::is_supported()` — check a specific variant - `ResonatorBank::backend()` — getter for the active backend - `ResonatorBank::set_backend(Backend) -> Result<(), Backend>` — override, errors if unsupported (useful for tests, or to avoid AVX-512 frequency throttling on sustained workloads) - `process_sample_scalar` (#[doc(hidden)]) — forces the scalar path regardless of `backend`, used by the bench to measure scalar throughput without the dispatch match in the way Dispatch is at the `process_sample` / `process_samples` boundary, not inside the inner loops: the match runs once per sample (or once per batch for block processing), and the branch is predictable because `self.backend` is set once at construction. The inner kernels stay `#[target_feature]`-gated and get inlined within their respective arms by LLVM. The bench adds a `bank/dispatch` group using the default (auto- dispatched) API; the existing forced `bank/scalar`, `bank/avx2`, `bank/avx512` groups stay for direct-backend comparison. Expected delta between `bank/dispatch` and the forced backend matching `Backend::detect()` is small (the dispatch match). Tests: - `default_backend_is_widest_supported` — `new` picks `detect()` - `set_backend_scalar_always_ok` - `set_backend_unsupported_errors` — skipif host supports all - `dispatched_matches_forced_scalar` — end-to-end dispatch correctness across 33 bins × 1024 samples

pengowray added 5 commits April 24, 2026 08:31

pengowray mentioned this pull request Apr 24, 2026

Add WASM SIMD128 hot loop + amortise stabilisation #1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avx bench#4

Avx bench#4
pengowray wants to merge 5 commits into
jhartquist:mainfrom
pengowray:avx-bench

pengowray commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pengowray commented Apr 24, 2026

AVX2 / AVX-512F + runtime dispatch on x86_64

What's in the branch

Setup

Results (44.1 kHz × 1 s signal)

Observations

Why the win is smaller than the WASM SIMD PR

Not tested

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant