docs: add hardware-acceleration section calling out AMD Ryzen / EPYC

kiritigowda · cursoragent · kiritigowda · commit ecc868ff9cb2 · 2026-05-09T16:12:34.000-07:00
The Cargo features table now covers both crates that ship SIMD
kernels (openvx-core for the FFI graph executor's vxAdd / vxSubtract
/ vxBox3x3 / vxGaussian3x3 / vxColorConvert callbacks, and
openvx-vision for the direct Rust-API kernels). The example build
command is updated to use the matching pair on both crates so the
FFI graph path and the Rust API path both pick up the SIMD kernels.

A new "Hardware acceleration" section calls out that rustVX is tuned
and validated on AMD Ryzen / EPYC (Zen 2+, i.e. Ryzen 3000 and newer,
plus the matching EPYC Rome / Milan / Genoa parts) — the same family
GitHub's Linux runners run on, and the silicon every benchmark number
in this README's "Benchmark &amp; compare" job summary was measured on.
The note also documents:

  * The runtime AVX2 -&gt; SSE2 -&gt; scalar dispatch via
    is_x86_feature_detected!, in openvx-core::simd_kernels (FFI path)
    and openvx-vision::x86_64_simd (Rust path).
  * The recommended `RUSTFLAGS=-C target-cpu=x86-64-v3` plus matching
    --features pair for a manual build, matching exactly what the CI
    workflow emits when it auto-detects an AVX2 host.
  * That the same fast paths run at parity on any AVX2-capable Intel
    CPU (Haswell+); AMD is called out because it's the validation
    silicon, not the only supported silicon.
  * Fallbacks: SSE2-only x86_64 hosts pin to x86-64-v2, AArch64 hosts
    take the neon path, and feature-less builds use the slice-iter
    scalar path.

Co-authored-by: Cursor &lt;cursoragent@cursor.com&gt;
diff --git a/README.md b/README.md
@@ -74,21 +74,46 @@ The standard OpenVX 1.3 C headers are bundled in [`include/VX/`](include/VX/) an
 
 ### Cargo features
 
-The vision kernel crate exposes opt-in performance features:
+Both `openvx-core` (host of the C-API kernel callbacks the OpenVX graph executor invokes) and `openvx-vision` (host of the public Rust API kernels) expose a matching opt-in feature set:
 
 | Feature | Effect |
 |---------|--------|
 | `simd` | Enables architecture-neutral SIMD code paths |
 | `sse2` / `avx2` | x86_64 SIMD back-ends (imply `simd`) |
 | `neon` | AArch64 SIMD back-end (implies `simd`) |
-| `parallel` | Enables Rayon-based multi-threaded kernels |
+| `parallel` (`openvx-vision` only) | Enables Rayon-based multi-threaded kernels |
 
-Build with one or more features, e.g.:
+Build with the matching pair on each crate so the FFI graph path and the direct Rust API path both pick up the SIMD kernels:
 
 ```bash
-cargo build --release -p openvx-ffi --features "openvx-vision/avx2 openvx-vision/parallel"
+cargo build --release -p openvx-ffi \
+  --features "openvx-core/sse2 openvx-core/avx2 openvx-vision/sse2 openvx-vision/avx2"
 ```
 
+### Hardware acceleration
+
+rustVX is **tuned and validated on modern AMD x86_64 silicon** — specifically AMD Ryzen (Zen 2+, *3000-series and newer*) and the matching AMD EPYC (Rome / Milan / Genoa) server parts. CI runs against the AMD EPYC 7763 (Milan) and EPYC 9V74 (Genoa) hosts in GitHub's Linux runner pool, and every published benchmark number in the *Benchmark & compare* job summary comes from those AMD CPUs.
+
+#### Dispatch
+
+Runtime SIMD selection happens via `is_x86_feature_detected!` inside `openvx-core::simd_kernels` (FFI graph path: `vxAdd`, `vxSubtract`, `vxBox3x3`, `vxGaussian3x3`, `vxColorConvert`) and `openvx-vision::x86_64_simd` (direct Rust API). The ordering on x86_64 is **AVX2 → SSE2 → tight scalar slice loop**. Builds without the SIMD features (or non-x86_64 / non-aarch64 targets) compile only the scalar fallback, which is itself slice-iter-based and ~50× faster than naive per-pixel kernel implementations.
+
+#### Recommended build for AMD Ryzen / EPYC
+
+On any Zen 2+ host (Ryzen 3000, 4000, 5000, 7000, 9000; Threadripper 3000+; EPYC Rome, Milan, Genoa), the project's CI workflow auto-detects host CPU flags and emits the right combination, but for a manual build the recipe is:
+
+```bash
+RUSTFLAGS="-C target-cpu=x86-64-v3" \
+  cargo build --release -p openvx-ffi \
+    --features "openvx-core/sse2 openvx-core/avx2 \
+                openvx-vision/sse2 openvx-vision/avx2"
+```
+
+`-C target-cpu=x86-64-v3` is the portable microarch level that matches every Zen 2+ Ryzen / EPYC and every Intel Haswell+ — it gives the compiler licence to auto-vectorize with SSE4.2 / AVX / AVX2 / BMI1+2 / FMA / F16C in the rest of the workspace, on top of the hand-tuned intrinsic kernels gated by the `sse2` / `avx2` Cargo features. This is the exact configuration the `.github/workflows/conformance.yml` build job uses for the published benchmark numbers.
+
+> [!NOTE]
+> The same fast paths run on any AVX2-capable Intel CPU (Haswell and later) at parity — the `is_x86_feature_detected!` dispatch doesn't read the vendor string, only the feature flags. AMD Ryzen / EPYC is called out here because it's the silicon the CI runs on and the silicon every benchmark figure in this README and the *Benchmark & compare* job summary was measured on. On older x86_64 silicon (pre-AVX2), the build automatically selects the SSE2 path and `-C target-cpu=x86-64-v2`. AArch64 hosts (Apple Silicon, AWS Graviton, etc.) take the `openvx-*/neon` path with no `target-cpu` override.
+
 ## Using rustVX from a C application
 
 `libopenvx_ffi` exports the full `vx*` / `vxu*` symbol set defined by the standard OpenVX headers, so existing OpenVX code links against it with no source changes. A minimal example: