Avx bench#4
Open
pengowray wants to merge 5 commits into
Open
Conversation
The per-sample update in ResonatorBank::process_sample is O(n_bins)
of independent per-bin work (EWMA + phasor rotate). On native
targets LLVM auto-vectorises this to SSE2 / NEON cleanly, so there's
no speedup to be had from explicit SIMD there. On WASM, however,
auto-vectorisation to SIMD128 is not reliable, and the default
scalar output leaves significant throughput on the table.
This adds a WASM-SIMD128-only explicit-SIMD path via `wide::f32x4`,
cfg-gated to `target_arch = "wasm32", target_feature = "simd128"`.
Other targets keep the upstream scalar loop unchanged.
Speedup measured in-browser (Firefox 130, Chrome 131) on a
log-spaced bank at 48 kHz sample rate:
bins= 65: 7.9x
bins= 129: 6.4x
bins= 257: 7.0x
bins= 513: 6.8x
`wide` is a portable-SIMD wrapper; we pull it in only on the
WASM+SIMD128 target via `[target.'cfg(...)'.dependencies]` so it
doesn't affect non-WASM builds or wasm32 without +simd128.
Loads / stores use `core::ptr::read_unaligned` / `write_unaligned`
with `f32x4` casts — the `f32x4::new([a,b,c,d])` array-literal
path generates per-lane inserts and defeats lowering to single
128-bit memory ops.
Also amortises the stabilisation modulo check: `process_samples`
used to test `sample_count % STABILIZE_EVERY == 0` after every
sample; now it batches samples between stabilisations, keeping
the hot loop slightly tighter. Independent of SIMD, this gave
a few percent on its own in native benches.
All 21 existing bank + resonator tests pass on the scalar fallback
(verified on x86_64 Windows). WASM SIMD path compiles cleanly with
`RUSTFLAGS="-C target-feature=+simd128" cargo check --target
wasm32-unknown-unknown`.
Signed-off-by: Pengo Wray <me@pengowray.com>
Adds explicit-SIMD companions to the scalar per-sample hot loop, in
the same style as the existing WASM SIMD128 path but for x86_64:
- process_sample_avx2: 8 bins / iter via __m256 + vfmadd231ps
(target_feature = "avx2,fma")
- process_sample_avx512: 16 bins / iter via __m512
(target_feature = "avx512f")
Both are #[doc(hidden)] pub unsafe methods, compiled unconditionally
on x86_64 via #[target_feature] so a single bench binary can compare
all three paths. The caller is responsible for checking CPU support
with is_x86_feature_detected! before invoking.
The bench (benches/bank.rs) now has three groups — bank/scalar,
bank/avx2, bank/avx512 — and uses runtime feature detection to skip
groups the host CPU can't run. `just bench-avx` wraps the invocation
with -C target-cpu=native so the SCALAR path gets LLVM's widest
auto-vectorisation available, which is the fair baseline for
comparing against the hand-rolled paths.
Correctness: new avx2_matches_scalar / avx512_matches_scalar tests
compare each SIMD backend against the scalar loop across a range of
bin counts that exercise both the vector body and scalar tail
(including n % 8 != 0 and n % 16 != 0). Tolerance is a relative
1e-4 to account for FMA vs separate mul+add rounding.
Quick measurement on Ryzen 5 3600 (Zen 2, no AVX-512) at 264 bins,
44.1 kHz, 1s signal:
bank/scalar/264 5.50 ms 8.02 Melem/s
bank/avx2/264 3.85 ms 11.47 Melem/s (1.43x)
bank/avx512/264 skipped — CPU lacks avx512f
The ~1.4x win from explicit AVX2 over -C target-cpu=native scalar
contradicts the comment in bank.rs claiming auto-vec matches; worth
re-checking at the other bin sizes and on a more modern x86_64.
AVX-512 speedup is untested here (no hardware); branch exists so it
can be benched on a Zen 4 / Ice Lake+ / Sapphire Rapids box or a
cloud VM.
At the bench bin counts 88, 264, 440 the AVX-512 path was spending ~35% of total sample time in the up-to-15-element scalar tail (bin 80-87 at n=88, bin 256-263 at n=264, bin 432-439 at n=440). This cost ~40 ns/sample on top of a ~70 ns SIMD body and was the main reason AVX-512 lost to AVX2 at small bin counts on c7i.large. Replace the scalar tail with one additional AVX-512F iteration gated by a k-mask of `(1 << tail) - 1`. Loads use `_mm512_maskz_loadu_ps` (fault suppression on masked-off lanes, so we don't read past the Vec); stores use `_mm512_mask_storeu_ps` so the garbage from zero-loaded lanes never hits the buffers. No change to the 880-bin case (which is already a multiple of 16 and has zero tail). Expected improvement at 88/264/440: tail goes from ~40 ns to one SIMD-body-width iteration (~5 ns), cutting per-sample cost by 30-35% on those sizes. avx512_matches_scalar test already covers tail lengths via n_bins in [1, 8, 15, 16, 17, 23, 64, 88]; the 1/8/15 cases now exercise the "no SIMD body, masked tail only" path directly.
Criterion was warning about not hitting 50 samples within its 5 s default target for the larger bin counts (17 ms/iter at 880 bins × 50 samples plus warmup just barely exceeds 5 s). Numbers were still statistically fine but the log was noisy. 10 s covers the slowest (scalar/880) with headroom for all three backends, and silences the warnings without changing the method.
Adds a `Backend` enum (Scalar / Avx2 / Avx512) and a `backend` field
on `ResonatorBank`. `ResonatorBank::new` now calls `Backend::detect`
to auto-select the widest backend the host CPU supports at runtime,
and `process_sample` / `process_samples` dispatch via a match on
`self.backend`. Callers get the best-available SIMD path without
needing to know which one their CPU supports — the existing public
API (`process_sample`, `process_samples`) is unchanged for the
caller, but now runs AVX-512F on Sapphire Rapids / Zen 4, AVX2+FMA
on Haswell / Zen+, and the scalar loop elsewhere.
API additions (x86_64 only):
- `pub enum Backend { Scalar, Avx2, Avx512 }`
- `Backend::detect()` — widest supported on the host
- `Backend::is_supported()` — check a specific variant
- `ResonatorBank::backend()` — getter for the active backend
- `ResonatorBank::set_backend(Backend) -> Result<(), Backend>` —
override, errors if unsupported (useful for tests, or to avoid
AVX-512 frequency throttling on sustained workloads)
- `process_sample_scalar` (#[doc(hidden)]) — forces the scalar path
regardless of `backend`, used by the bench to measure scalar
throughput without the dispatch match in the way
Dispatch is at the `process_sample` / `process_samples` boundary,
not inside the inner loops: the match runs once per sample (or
once per batch for block processing), and the branch is predictable
because `self.backend` is set once at construction. The inner
kernels stay `#[target_feature]`-gated and get inlined within their
respective arms by LLVM.
The bench adds a `bank/dispatch` group using the default (auto-
dispatched) API; the existing forced `bank/scalar`, `bank/avx2`,
`bank/avx512` groups stay for direct-backend comparison. Expected
delta between `bank/dispatch` and the forced backend matching
`Backend::detect()` is small (the dispatch match).
Tests:
- `default_backend_is_widest_supported` — `new` picks `detect()`
- `set_backend_scalar_always_ok`
- `set_backend_unsupported_errors` — skipif host supports all
- `dispatched_matches_forced_scalar` — end-to-end dispatch
correctness across 33 bins × 1024 samples
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I was going to just make this a reply to @phayes but it's basically a PR now.
@jhartquist Feel free to merge this or otherwise implement it your own way.
@phayes said in #1 :
My computers are too old to support AVX-512, so I had to benchmark on AWS. There's some performance improvements to be had. Not nearly as much as the WASM SIMD tweaks though.
This patch implements AVX2 and AVX-512F, with runtime dispatch so it hopefully doesn't require separate binaries.
Claude Code summary of results:
AVX2 / AVX-512F + runtime dispatch on x86_64
Branch:
avx-bench(4 commits on top ofwasm-simd-pr).What's in the branch
__m256+vfmadd231ps,#[target_feature(enable = "avx2,fma")]so it compiles on any x86_64 build.__m512. The up-to-15-bin tail uses a masked SIMD iteration (_mm512_maskz_loadu_ps+_mm512_mask_storeu_pswithk-mask = (1 << tail) - 1). Fault suppression on masked-off lanes means we never read past theVecend, and the mask-store leaves the untouched tail bytes untouched. This replaces what was a scalar loop, which turned out to be the largest single perf win on small bin counts (n=88 AVX-512 time fell 48%).Backend::{Scalar, Avx2, Avx512}enum withBackend::detect()(runtimeis_x86_feature_detected!check), stored onResonatorBank.new()auto-selects the widest supported;set_backend(..)overrides;backend()inspects. Callers don't change —bank.process_sample(s)now runs AVX-512 on Sapphire Rapids / Zen 4, AVX2+FMA on Haswell / Zen 2-3, scalar elsewhere. Dispatch is at theprocess_sample/process_samplesboundary (match on a single field set at construction — branch predictor fixes on first call).avx2_matches_scalarandavx512_matches_scalarcompare each backend against the scalar reference across bin counts that exercise both vector body and tail, including n=1, n=8, n=15 for the AVX-512 "masked tail only, no body" case. Plusdispatched_matches_forced_scalarend-to-end. All 27 tests pass on SPR.Setup
AWS EC2
c7i.large(Intel Sapphire Rapids, 2 vCPU, shared tenancy), Amazon Linux 2023, rustc 1.95 stable,RUSTFLAGS="-C target-cpu=native". Scalar column compiled with native target — LLVM is free to auto-vectorise to AVX-512.Shared-tenancy variance is ~5–15% between otherwise-identical runs (criterion reports spurious "Performance regressed/improved" from background contention); headline numbers below are single-run medians.
Results (44.1 kHz × 1 s signal)
Peak throughput 32.6 Melem/s at n=88 on the auto-dispatch path.
bank/dispatch backend = Avx512confirmed at runtime.Dispatch lands within noise of forced AVX-512 (sometimes faster, sometimes slower across sizes — that's the shared-instance noise floor, not a real signal). The per-sample match cost is statistically zero at this kernel granularity.
Observations
-C target-cpu=native. The existing comment inbank.rs— "explicit SIMD matches or slightly regresses vs auto-vec on x86_64" — was written against 128-bit / SSE and does not hold at 256-bit or 512-bit width for this kernel.Why the win is smaller than the WASM SIMD PR
The WASM SIMD128 PR reports 6–8× speedup over scalar; here we see 1.6–3.4×. The delta is almost entirely in the baseline, not the SIMD path:
-C target-cpu=native, LLVM emits AVX-512 instructions for the "scalar" loop — imperfectly, but definitely emitting wide SIMD. So x86 "scalar" is a partly-vectorised baseline, and explicit SIMD is improving on an already-non-trivial floor. On WASM, engines' JITs (Liftoff / V8 TurboFan / SpiderMonkey Ion) are far more conservative about auto-vectorisingf32loops — the WASM "scalar" baseline runs closer to truly one lane at a time.Absolute throughput is still dramatically higher on native x86_64 (tens of Melem/s on a 2-vCPU VM) — it's the relative improvement over scalar that's smaller, because the scalar path on x86 wasn't as pessimised to begin with.
Not tested
c7a): AVX-512 implemented as double-pumped 256-bit units. Expect dispatch ≈ AVX-512 ≈ close to AVX2 on Zen 4 (wins from front-end decode, not FMA throughput).c6i): native 512-bit, more AVX-512 frequency throttling on sustained loads. Expected between SPR and Zen 4.c7i.2xlarge+: would remove the shared-tenancy noise.