Package x86_64 releases via cargo-multivers (v1/v2/v4)

tfenne · tfenne · commit 2e971a99308f · 2026-04-21T15:50:48.000-06:00
Ship a single x86_64 binary that embeds three CPU-specific builds and
dispatches to the best match at startup, rather than picking one floor
and cutting off everyone below it. Variants:

  - x86-64    : SSE2 baseline, any 64-bit x86 CPU (2003+)
  - x86-64-v2 : SSE4.2 + POPCNT (2008+), captures nearly all codegen win
  - x86-64-v4 : AVX-512F/BW/CD/DQ/VL (2017+), ~1% additional win

v3 is skipped — our Granite Rapids bench showed v2 and v3 are within
measurement noise on chelae's workload. The historical "x86-64-v3 wins
6%" result is really a v1→v2 POPCNT/SSE4.2 win; v2→v3 contributes ~0.

Delta compression (gdelta+lz4) keeps the combined binary at ~3.7 MB vs
~2.9 MB for a single variant, with ~0.2 s added startup for decompress
+ memfd_create+exec. Irrelevant for batch FASTQ work.

Infrastructure changes:
  - Cargo.toml: add [profile.dist] and [package.metadata.multivers.x86_64]
  - .cargo/config.toml is now the dev default (target-cpu=native)
  - .cargo/config-portable.toml is the release-time floor (plain x86-64),
    swapped in before `cargo multivers --profile dist`
  - Drop ensure_avx2_or_die from main.rs — multivers picks the right
    variant by construction so the probe would now false-positive on the
    v1 users we're explicitly trying to support

aarch64 stays a single binary. Benchmarks showed Neoverse-specific tuning
is 1-2% max with near-zero cross-tuning penalty, and the Graviton3→G4
generational jump dwarfs tuning anyway. Multivers infrastructure isn't
justified on aarch64.

Release workflow changes (GH Actions) to come in a follow-up.
diff --git a/.cargo/config-portable.toml b/.cargo/config-portable.toml
@@ -0,0 +1,26 @@
+# Portable baseline config used by the release workflow for sanity builds and
+# as the fallback floor under `cargo multivers`. Swap this in place of
+# `.cargo/config.toml` before release packaging:
+#
+#   mv .cargo/config-portable.toml .cargo/config.toml
+#   cargo multivers --profile dist
+#
+# `x86-64` is the baseline AMD64 target: SSE2 only, runs on any 64-bit x86 CPU.
+# This is what we want for non-multivers sanity builds (`cargo build --profile
+# dist`, CI runners with unknown CPU features) — it will run anywhere. The
+# `cargo multivers` step then compiles additional v2 and v4 variants on top
+# per `[package.metadata.multivers.x86_64]` in Cargo.toml.
+#
+# aarch64 targets are untouched here. `cargo multivers` is x86_64-only per
+# our benchmarks (Neoverse-specific tuning gave only ~1-2% over generic
+# ARMv8-A, not worth the variant infrastructure). Apple Silicon / Graviton
+# release builds use a single binary with whatever target-cpu we settle on;
+# see the release workflow.
+[target.x86_64-unknown-linux-gnu]
+rustflags = ["-C", "target-cpu=x86-64"]
+
+[target.x86_64-unknown-linux-musl]
+rustflags = ["-C", "target-cpu=x86-64"]
+
+[target.x86_64-apple-darwin]
+rustflags = ["-C", "target-cpu=x86-64"]
diff --git a/.cargo/config.toml b/.cargo/config.toml
@@ -1,26 +1,6 @@
-# Baseline x86_64 release builds to x86-64-v3 (AVX2 + BMI1/BMI2 + FMA).
-#
-# AVX2 is present on effectively every x86_64 CPU shipped since 2013 (Intel
-# Haswell) / 2017 (AMD Zen) / 2021 (Intel Atom Gracemont+), and on essentially
-# 100% of cloud compute available today. Building to x86-64-v3 recovers ~6%
-# wall-time vs. the default x86-64 (SSE2) baseline on AVX2 hosts — per our
-# Granite Rapids EC2 benchmark — and costs nothing on them.
-#
-# Hosts without AVX2 (pre-Haswell Intel, pre-Zen AMD, pre-Gracemont Atom) will
-# SIGILL on illegal instruction. `src/bin/main.rs` runs a CPUID check at the
-# top of `main()` and prints a friendly error instead, though that check is
-# best-effort — if the Rust runtime emits AVX2 ops before `main()` runs, the
-# SIGILL comes first. Users on old hardware should rebuild with the default
-# portable baseline via `RUSTFLAGS="-C target-cpu=x86-64" cargo build --release`.
-#
-# aarch64 targets (Apple Silicon, AWS Graviton) are untouched and keep their
-# native NEON baseline.
-
-[target.x86_64-unknown-linux-gnu]
-rustflags = ["-C", "target-cpu=x86-64-v3"]
-
-[target.x86_64-unknown-linux-musl]
-rustflags = ["-C", "target-cpu=x86-64-v3"]
-
-[target.x86_64-apple-darwin]
-rustflags = ["-C", "target-cpu=x86-64-v3"]
+# Dev config: build for the local machine's full feature set so local runs
+# are maximally fast. Release artifacts are built via a separate pathway
+# (see `.cargo/config-portable.toml` and `[package.metadata.multivers.x86_64]`
+# in `Cargo.toml`); swap in the portable config before release packaging.
+[build]
+rustflags = ["-C", "target-cpu=native"]
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -50,11 +50,36 @@ Pinned to Rust 1.95 via `rust-toolchain.toml`. Format settings: `max_width = 100
 
 ## Build Targeting
 
-x86_64 release builds target `x86-64-v3` (AVX2 + BMI1/BMI2 + FMA) via `.cargo/config.toml`. This was benchmarked on an EC2 Granite Rapids instance and recovered ~6% wall time over the default x86-64 (SSE2) baseline on AVX2-capable hardware; it costs nothing on hardware that has AVX2. AVX2 is universal on anything ≥ 2013 Intel Haswell / ≥ 2017 AMD Zen / ≥ 2021 Intel Gracemont Atom, which covers essentially all cloud compute.
+### x86_64: cargo-multivers
 
-For older hardware (pre-Haswell Intel etc.) users should rebuild with `RUSTFLAGS="-C target-cpu=x86-64" cargo build --release`. If an AVX2-built binary is nonetheless run on a non-AVX2 CPU, `src/bin/main.rs`'s `ensure_avx2_or_die` probes CPUID before any SIMD code path runs and exits with a friendly error rather than `SIGILL`.
+x86_64 release binaries are packaged via [`cargo-multivers`](https://github.com/ronnychevalier/cargo-multivers) into a single launcher that embeds three CPU-specific builds and dispatches to the best match at startup. See `[package.metadata.multivers.x86_64]` in `Cargo.toml`:
 
-aarch64 targets (Apple Silicon, AWS Graviton) keep their native NEON baseline.
+```toml
+cpus = ["x86-64", "x86-64-v2", "x86-64-v4"]
+```
+
+- `x86-64` — SSE2 baseline, any 64-bit x86 (2003+)
+- `x86-64-v2` — SSE4.2 + POPCNT (2008+). Captures nearly all of the scalar codegen win
+- `x86-64-v4` — AVX-512F/BW/CD/DQ/VL (2017+ server / 2022+ consumer). ~1% additional win
+
+We intentionally skip `x86-64-v3`: our Granite Rapids benchmark showed v2 and v3 within measurement noise on chelae's workload. The historical "x86-64-v3 wins 6% over baseline" finding is actually a v1→v2 win; v2→v3 contributes ~0. Including v3 would bloat the binary without buying anything.
+
+Variants are delta-compressed (`gdelta`) + lz4. Total binary is ~3.7 MB (vs ~2.9 MB for a single-variant build). Startup adds ~0.2 s for decompression + `memfd_create + exec` — negligible for chelae's batch workload.
+
+The cargo-multivers runner sorts variants by feature count descending and picks the first match — so v4 runs on capable hardware, falling back to v2 on pre-AVX-512 systems and v1 on pre-SSE4.2 systems.
+
+### Two-config pattern
+
+- `.cargo/config.toml` — dev default, `target-cpu=native` (fastest local runs).
+- `.cargo/config-portable.toml` — release-time portable baseline (`target-cpu=x86-64`). Swap in via `mv` before `cargo multivers --profile dist`.
+
+### Release profile
+
+`[profile.dist]` in `Cargo.toml` inherits `release` with `incremental = false` for deterministic delta compression across multivers variants. Use `cargo multivers --profile dist` for x86_64 releases and `cargo build --profile dist` for aarch64 releases.
+
+### aarch64
+
+Single binary, no multivers. Benchmarks showed Neoverse-specific tuning yields only ~1-2% over generic ARMv8-A with near-zero cross-tuning penalty, so multivers infrastructure isn't justified. The generational upgrade (Graviton3 → Graviton4 = +24%) dwarfs any tuning delta anyway. Release build target-cpu is whatever we land on post-benchmarking; the dev default (`target-cpu=native`) is fine locally.
 
 ## Architecture
 
diff --git a/Cargo.toml b/Cargo.toml
@@ -19,6 +19,12 @@ categories = ["science"]
 lto = "fat"
 codegen-units = 1
 
+# Release packaging profile for cargo-multivers. Identical codegen to `release`
+# but with incremental off so delta-compression across variants is deterministic.
+[profile.dist]
+inherits = "release"
+incremental = false
+
 # Profile that matches release perf closely but keeps symbol names for samply/perf.
 # Use via `cargo build --profile bench-prof`.
 [profile.bench-prof]
@@ -27,6 +33,23 @@ lto = "thin"
 debug = 1
 strip = false
 
+# x86_64 release binaries are packaged via `cargo multivers` into a single
+# launcher that embeds three CPU variants and dispatches to the best match at
+# startup. The launcher is tiny (~1 KB) and the embedded variants are
+# delta-compressed (gdelta + lz4); total binary ~3.7 MB.
+#
+# Variants (order doesn't matter — cargo-multivers sorts by feature count at
+# link time; the runner picks the highest-feature match at startup):
+#   - x86-64    : baseline SSE2, runs on any 64-bit x86 CPU (2003+)
+#   - x86-64-v2 : SSE4.2 + POPCNT (2008+); captures most codegen win
+#   - x86-64-v4 : AVX-512F/BW/CD/DQ/VL (2017+ server / 2022+ consumer)
+#
+# We skip v3 because our workload shows no measurable delta between v2 and
+# v3 — the v1→v2 jump (POPCNT, SSE4.2) carries essentially all of the
+# historical "x86-64-v3 wins 6% over baseline" benefit.
+[package.metadata.multivers.x86_64]
+cpus = ["x86-64", "x86-64-v2", "x86-64-v4"]
+
 [lib]
 name = "chelae_lib"
 path = "src/lib/mod.rs"
diff --git a/README.md b/README.md
@@ -229,15 +229,17 @@ cargo build --release
 
 ## Build Targeting and Portability
 
-Release builds target `x86-64-v3` on x86_64 platforms (Linux GNU, Linux musl, macOS), which emits AVX2, BMI1/BMI2, FMA, etc. AVX2 is universal on anything Intel Haswell (2013) / AMD Zen (2017) / Intel Gracemont Atom (2021) or newer — effectively all cloud compute. aarch64 targets (Apple Silicon, AWS Graviton) keep their native NEON baseline.
+x86_64 release binaries ship as a single `cargo multivers` launcher that embeds three CPU-specific builds and picks the best match at startup:
 
-Older hardware can be accommodated by rebuilding with a portable baseline:
+- `x86-64` — SSE2 baseline, runs on any 64-bit x86 CPU (2003+)
+- `x86-64-v2` — SSE4.2 + POPCNT (2008+); captures nearly all of the historical "v3 wins 6%" codegen benefit
+- `x86-64-v4` — AVX-512F/BW/CD/DQ/VL for Ice Lake / Sapphire Rapids / Granite Rapids / Zen 4+
 
-```console
-RUSTFLAGS="-C target-cpu=x86-64" cargo build --release
-```
+The launcher is ~3.7 MB total and adds ~0.2 s of startup for decompression + `memfd_create + exec`. v3 is intentionally skipped — on chelae's workload v2 and v3 are within measurement noise, and v4 picks up what little additional win AVX-512 gives (~1% on our benchmarks).
+
+aarch64 release binaries (Apple Silicon, AWS Graviton, GCP Axion, Azure Cobalt) are a single build with generic ARMv8-A / NEON baseline. Benchmarks showed Neoverse-specific tuning yields only ~1-2% over generic and cross-tuning penalty is near zero, so multivers isn't worth the complexity on aarch64.
 
-If an AVX2-built binary is run on a non-AVX2 CPU, `chelae` will print an error and exit 1 at startup rather than crashing mid-run with `SIGILL`.
+For local development, `cargo build --release` uses `target-cpu=native` (see `.cargo/config.toml`) for fastest local runs.
 
 ## Developing
 
diff --git a/src/bin/main.rs b/src/bin/main.rs
@@ -1,7 +1,6 @@
 //! `chelae` — a FASTQ trimming and filtering toolkit. This file is the CLI entry
 //! point; it dispatches to the set of subcommands via the [`Command`] trait and
-//! `enum_dispatch`. Installs `mimalloc` as the global allocator and, on x86_64, runs
-//! a startup CPU check against the AVX2 baseline that `.cargo/config.toml` requires.
+//! `enum_dispatch`. Installs `mimalloc` as the global allocator.
 
 extern crate core;
 
@@ -40,52 +39,11 @@ enum Subcommand {
     Trim(Trim),
 }
 
-/// Best-effort guard against running an AVX2-compiled binary on a CPU that lacks
-/// AVX2 (pre-2013 Intel Haswell / pre-2017 AMD Zen / pre-2021 Intel Gracemont
-/// Atom). The crate's `.cargo/config.toml` targets `x86-64-v3` for x86_64
-/// release builds, which emits AVX2 instructions throughout the binary; hitting
-/// one of those on an older CPU produces `SIGILL` with no explanation.
-///
-/// This runs at the top of `main()` and probes CPUID directly (the `cpuid`
-/// instruction carries no SIMD baggage, so the check itself is safe on any
-/// x86_64 CPU). If AVX2 is missing we print a clear message and exit 1.
-///
-/// **Caveat:** the guard only covers code that runs after `main()` starts. If
-/// the Rust runtime emits AVX2 ops during pre-main startup (TLS setup, allocator
-/// init, etc.), the `SIGILL` will still beat us. In practice Rust's pre-main
-/// path is small enough that we don't observe this, but we can't guarantee it.
-#[cfg(target_arch = "x86_64")]
-fn ensure_avx2_or_die() {
-    // `__cpuid`/`__cpuid_count` were stabilized as safe in Rust 1.89 — the `cpuid`
-    // instruction is unconditionally available on every x86_64 CPU and has no memory
-    // side effects.
-    use std::arch::x86_64::{__cpuid, __cpuid_count};
-    // Leaf 1: ECX bit 27 = OSXSAVE (OS supports xsave, required to use YMM state);
-    // bit 28 = AVX. Leaf 7 sub-leaf 0: EBX bit 5 = AVX2.
-    let l1 = __cpuid(1);
-    let osxsave = (l1.ecx >> 27) & 1 != 0;
-    let avx = (l1.ecx >> 28) & 1 != 0;
-    let l7 = __cpuid_count(7, 0);
-    let avx2 = (l7.ebx >> 5) & 1 != 0;
-    if !(osxsave && avx && avx2) {
-        eprintln!(
-            "error: this chelae binary was built for x86-64-v3 (AVX2+) but this\n\
-             CPU does not report AVX2 support. Required features: AVX, AVX2,\n\
-             OSXSAVE. Rebuild from source with a portable baseline:\n\
-             \n\
-             \tRUSTFLAGS=\"-C target-cpu=x86-64\" cargo build --release\n"
-        );
-        std::process::exit(1);
-    }
-}
-
-/// Process entry point. Runs the AVX2 guard (x86_64 only), initializes env_logger with
-/// a default of `info` level, parses argv, and dispatches to the selected subcommand's
-/// `execute()`. Errors propagate out as `anyhow::Error` for the runtime's default
-/// `eprintln!` + nonzero-exit handling.
+/// Process entry point. Initializes env_logger with a default of `info` level,
+/// parses argv, and dispatches to the selected subcommand's `execute()`. Errors
+/// propagate out as `anyhow::Error` for the runtime's default `eprintln!` +
+/// nonzero-exit handling.
 fn main() -> Result<()> {
-    #[cfg(target_arch = "x86_64")]
-    ensure_avx2_or_die();
     env_logger::Builder::from_env(Env::default().default_filter_or("info")).init();
     let args: Args = Args::parse();
     args.subcommand.execute()