Skip to content

Avx bench#4

Open
pengowray wants to merge 5 commits into
jhartquist:mainfrom
pengowray:avx-bench
Open

Avx bench#4
pengowray wants to merge 5 commits into
jhartquist:mainfrom
pengowray:avx-bench

Conversation

@pengowray
Copy link
Copy Markdown
Contributor

I was going to just make this a reply to @phayes but it's basically a PR now.

@jhartquist Feel free to merge this or otherwise implement it your own way.

@phayes said in #1 :

Could there possibly be a better result for x64 if we used a wider SIMD? Maybe AVX ( 8 at once) or AVX-512 (16 at once) ?

My computers are too old to support AVX-512, so I had to benchmark on AWS. There's some performance improvements to be had. Not nearly as much as the WASM SIMD tweaks though.

This patch implements AVX2 and AVX-512F, with runtime dispatch so it hopefully doesn't require separate binaries.

Claude Code summary of results:

AVX2 / AVX-512F + runtime dispatch on x86_64

Branch: avx-bench (4 commits on top of wasm-simd-pr).

What's in the branch

  1. Explicit AVX2+FMA path — 8 bins/iter via __m256 + vfmadd231ps, #[target_feature(enable = "avx2,fma")] so it compiles on any x86_64 build.
  2. Explicit AVX-512F path — 16 bins/iter via __m512. The up-to-15-bin tail uses a masked SIMD iteration (_mm512_maskz_loadu_ps + _mm512_mask_storeu_ps with k-mask = (1 << tail) - 1). Fault suppression on masked-off lanes means we never read past the Vec end, and the mask-store leaves the untouched tail bytes untouched. This replaces what was a scalar loop, which turned out to be the largest single perf win on small bin counts (n=88 AVX-512 time fell 48%).
  3. Runtime dispatchBackend::{Scalar, Avx2, Avx512} enum with Backend::detect() (runtime is_x86_feature_detected! check), stored on ResonatorBank. new() auto-selects the widest supported; set_backend(..) overrides; backend() inspects. Callers don't change — bank.process_sample(s) now runs AVX-512 on Sapphire Rapids / Zen 4, AVX2+FMA on Haswell / Zen 2-3, scalar elsewhere. Dispatch is at the process_sample / process_samples boundary (match on a single field set at construction — branch predictor fixes on first call).
  4. Correctness: avx2_matches_scalar and avx512_matches_scalar compare each backend against the scalar reference across bin counts that exercise both vector body and tail, including n=1, n=8, n=15 for the AVX-512 "masked tail only, no body" case. Plus dispatched_matches_forced_scalar end-to-end. All 27 tests pass on SPR.

Setup

AWS EC2 c7i.large (Intel Sapphire Rapids, 2 vCPU, shared tenancy), Amazon Linux 2023, rustc 1.95 stable, RUSTFLAGS="-C target-cpu=native". Scalar column compiled with native target — LLVM is free to auto-vectorise to AVX-512.

Shared-tenancy variance is ~5–15% between otherwise-identical runs (criterion reports spurious "Performance regressed/improved" from background contention); headline numbers below are single-run medians.

Results (44.1 kHz × 1 s signal)

n_bins scalar (auto-vec) AVX2+FMA AVX-512F auto-dispatch speedup vs scalar
88 4.77 ms 1.84 ms 1.74 ms 1.40 ms 3.4×
264 7.48 ms 5.05 ms 4.33 ms 5.20 ms 1.7×
440 11.74 ms 8.28 ms 8.71 ms 7.40 ms 1.6×
880 20.52 ms 17.83 ms 13.16 ms 14.05 ms 1.6×

Peak throughput 32.6 Melem/s at n=88 on the auto-dispatch path. bank/dispatch backend = Avx512 confirmed at runtime.

Dispatch lands within noise of forced AVX-512 (sometimes faster, sometimes slower across sizes — that's the shared-instance noise floor, not a real signal). The per-sample match cost is statistically zero at this kernel granularity.

Observations

  1. Explicit SIMD beats LLVM auto-vec by 1.6–3.4× even with -C target-cpu=native. The existing comment in bank.rs — "explicit SIMD matches or slightly regresses vs auto-vec on x86_64" — was written against 128-bit / SSE and does not hold at 256-bit or 512-bit width for this kernel.
  2. Biggest win at the smallest bin count (3.4× at n=88). Small banks fit in L1 / register file, and auto-vec's per-bin setup overhead scales badly down.
  3. AVX-512 vs AVX2 scales sub-linearly — ~1.3× best case, not 2×. The hot loop has 10 loads + 6 stores per chunk; SPR has 3 load + 2 store ports per cycle, so it's memory-port bound, not FMA-bound. Wider lanes don't help once the memory pipe is saturated.

Why the win is smaller than the WASM SIMD PR

The WASM SIMD128 PR reports 6–8× speedup over scalar; here we see 1.6–3.4×. The delta is almost entirely in the baseline, not the SIMD path:

  1. The x86 scalar baseline is already partly-vectorised. With -C target-cpu=native, LLVM emits AVX-512 instructions for the "scalar" loop — imperfectly, but definitely emitting wide SIMD. So x86 "scalar" is a partly-vectorised baseline, and explicit SIMD is improving on an already-non-trivial floor. On WASM, engines' JITs (Liftoff / V8 TurboFan / SpiderMonkey Ion) are far more conservative about auto-vectorising f32 loops — the WASM "scalar" baseline runs closer to truly one lane at a time.
  2. Width gains don't compound. WASM SIMD128 is 1→4 lanes (4× ceiling). x86 here is 8→16 lanes (2× ceiling, AVX2 to AVX-512). Doubling an already-wide baseline is less dramatic than quadrupling a scalar one.
  3. This kernel is memory-port bound on native x86. AVX-512 vs AVX2 gives ~1.3× at saturation because SPR's 3 load + 2 store ports per cycle cap memory throughput regardless of FMA width. On WASM, the "ports" (virtual execution pipeline) are simpler, so wider SIMD has more room to help.

Absolute throughput is still dramatically higher on native x86_64 (tens of Melem/s on a 2-vCPU VM) — it's the relative improvement over scalar that's smaller, because the scalar path on x86 wasn't as pessimised to begin with.

Not tested

  • Zen 4 (c7a): AVX-512 implemented as double-pumped 256-bit units. Expect dispatch ≈ AVX-512 ≈ close to AVX2 on Zen 4 (wins from front-end decode, not FMA throughput).
  • Ice Lake (c6i): native 512-bit, more AVX-512 frequency throttling on sustained loads. Expected between SPR and Zen 4.
  • Dedicated / c7i.2xlarge+: would remove the shared-tenancy noise.

The per-sample update in ResonatorBank::process_sample is O(n_bins)
of independent per-bin work (EWMA + phasor rotate). On native
targets LLVM auto-vectorises this to SSE2 / NEON cleanly, so there's
no speedup to be had from explicit SIMD there. On WASM, however,
auto-vectorisation to SIMD128 is not reliable, and the default
scalar output leaves significant throughput on the table.

This adds a WASM-SIMD128-only explicit-SIMD path via `wide::f32x4`,
cfg-gated to `target_arch = "wasm32", target_feature = "simd128"`.
Other targets keep the upstream scalar loop unchanged.

Speedup measured in-browser (Firefox 130, Chrome 131) on a
log-spaced bank at 48 kHz sample rate:

    bins=  65: 7.9x
    bins= 129: 6.4x
    bins= 257: 7.0x
    bins= 513: 6.8x

`wide` is a portable-SIMD wrapper; we pull it in only on the
WASM+SIMD128 target via `[target.'cfg(...)'.dependencies]` so it
doesn't affect non-WASM builds or wasm32 without +simd128.

Loads / stores use `core::ptr::read_unaligned` / `write_unaligned`
with `f32x4` casts — the `f32x4::new([a,b,c,d])` array-literal
path generates per-lane inserts and defeats lowering to single
128-bit memory ops.

Also amortises the stabilisation modulo check: `process_samples`
used to test `sample_count % STABILIZE_EVERY == 0` after every
sample; now it batches samples between stabilisations, keeping
the hot loop slightly tighter. Independent of SIMD, this gave
a few percent on its own in native benches.

All 21 existing bank + resonator tests pass on the scalar fallback
(verified on x86_64 Windows). WASM SIMD path compiles cleanly with
`RUSTFLAGS="-C target-feature=+simd128" cargo check --target
wasm32-unknown-unknown`.

Signed-off-by: Pengo Wray <me@pengowray.com>
Adds explicit-SIMD companions to the scalar per-sample hot loop, in
the same style as the existing WASM SIMD128 path but for x86_64:

  - process_sample_avx2: 8 bins / iter via __m256 + vfmadd231ps
    (target_feature = "avx2,fma")
  - process_sample_avx512: 16 bins / iter via __m512
    (target_feature = "avx512f")

Both are #[doc(hidden)] pub unsafe methods, compiled unconditionally
on x86_64 via #[target_feature] so a single bench binary can compare
all three paths. The caller is responsible for checking CPU support
with is_x86_feature_detected! before invoking.

The bench (benches/bank.rs) now has three groups — bank/scalar,
bank/avx2, bank/avx512 — and uses runtime feature detection to skip
groups the host CPU can't run. `just bench-avx` wraps the invocation
with -C target-cpu=native so the SCALAR path gets LLVM's widest
auto-vectorisation available, which is the fair baseline for
comparing against the hand-rolled paths.

Correctness: new avx2_matches_scalar / avx512_matches_scalar tests
compare each SIMD backend against the scalar loop across a range of
bin counts that exercise both the vector body and scalar tail
(including n % 8 != 0 and n % 16 != 0). Tolerance is a relative
1e-4 to account for FMA vs separate mul+add rounding.

Quick measurement on Ryzen 5 3600 (Zen 2, no AVX-512) at 264 bins,
44.1 kHz, 1s signal:

    bank/scalar/264   5.50 ms    8.02 Melem/s
    bank/avx2/264     3.85 ms   11.47 Melem/s   (1.43x)
    bank/avx512/264   skipped — CPU lacks avx512f

The ~1.4x win from explicit AVX2 over -C target-cpu=native scalar
contradicts the comment in bank.rs claiming auto-vec matches; worth
re-checking at the other bin sizes and on a more modern x86_64.
AVX-512 speedup is untested here (no hardware); branch exists so it
can be benched on a Zen 4 / Ice Lake+ / Sapphire Rapids box or a
cloud VM.
At the bench bin counts 88, 264, 440 the AVX-512 path was spending
~35% of total sample time in the up-to-15-element scalar tail
(bin 80-87 at n=88, bin 256-263 at n=264, bin 432-439 at n=440).
This cost ~40 ns/sample on top of a ~70 ns SIMD body and was the
main reason AVX-512 lost to AVX2 at small bin counts on c7i.large.

Replace the scalar tail with one additional AVX-512F iteration
gated by a k-mask of `(1 << tail) - 1`. Loads use
`_mm512_maskz_loadu_ps` (fault suppression on masked-off lanes, so
we don't read past the Vec); stores use `_mm512_mask_storeu_ps`
so the garbage from zero-loaded lanes never hits the buffers.

No change to the 880-bin case (which is already a multiple of 16
and has zero tail). Expected improvement at 88/264/440: tail goes
from ~40 ns to one SIMD-body-width iteration (~5 ns), cutting
per-sample cost by 30-35% on those sizes.

avx512_matches_scalar test already covers tail lengths via
n_bins in [1, 8, 15, 16, 17, 23, 64, 88]; the 1/8/15 cases now
exercise the "no SIMD body, masked tail only" path directly.
Criterion was warning about not hitting 50 samples within its 5 s
default target for the larger bin counts (17 ms/iter at 880 bins ×
50 samples plus warmup just barely exceeds 5 s). Numbers were still
statistically fine but the log was noisy.

10 s covers the slowest (scalar/880) with headroom for all three
backends, and silences the warnings without changing the method.
Adds a `Backend` enum (Scalar / Avx2 / Avx512) and a `backend` field
on `ResonatorBank`. `ResonatorBank::new` now calls `Backend::detect`
to auto-select the widest backend the host CPU supports at runtime,
and `process_sample` / `process_samples` dispatch via a match on
`self.backend`. Callers get the best-available SIMD path without
needing to know which one their CPU supports — the existing public
API (`process_sample`, `process_samples`) is unchanged for the
caller, but now runs AVX-512F on Sapphire Rapids / Zen 4, AVX2+FMA
on Haswell / Zen+, and the scalar loop elsewhere.

API additions (x86_64 only):

  - `pub enum Backend { Scalar, Avx2, Avx512 }`
  - `Backend::detect()` — widest supported on the host
  - `Backend::is_supported()` — check a specific variant
  - `ResonatorBank::backend()` — getter for the active backend
  - `ResonatorBank::set_backend(Backend) -> Result<(), Backend>` —
    override, errors if unsupported (useful for tests, or to avoid
    AVX-512 frequency throttling on sustained workloads)
  - `process_sample_scalar` (#[doc(hidden)]) — forces the scalar path
    regardless of `backend`, used by the bench to measure scalar
    throughput without the dispatch match in the way

Dispatch is at the `process_sample` / `process_samples` boundary,
not inside the inner loops: the match runs once per sample (or
once per batch for block processing), and the branch is predictable
because `self.backend` is set once at construction. The inner
kernels stay `#[target_feature]`-gated and get inlined within their
respective arms by LLVM.

The bench adds a `bank/dispatch` group using the default (auto-
dispatched) API; the existing forced `bank/scalar`, `bank/avx2`,
`bank/avx512` groups stay for direct-backend comparison. Expected
delta between `bank/dispatch` and the forced backend matching
`Backend::detect()` is small (the dispatch match).

Tests:
  - `default_backend_is_widest_supported` — `new` picks `detect()`
  - `set_backend_scalar_always_ok`
  - `set_backend_unsupported_errors` — skipif host supports all
  - `dispatched_matches_forced_scalar` — end-to-end dispatch
    correctness across 33 bins × 1024 samples
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant