Skip to content

Commit 2e971a9

Browse files
committed
Package x86_64 releases via cargo-multivers (v1/v2/v4)
Ship a single x86_64 binary that embeds three CPU-specific builds and dispatches to the best match at startup, rather than picking one floor and cutting off everyone below it. Variants: - x86-64 : SSE2 baseline, any 64-bit x86 CPU (2003+) - x86-64-v2 : SSE4.2 + POPCNT (2008+), captures nearly all codegen win - x86-64-v4 : AVX-512F/BW/CD/DQ/VL (2017+), ~1% additional win v3 is skipped — our Granite Rapids bench showed v2 and v3 are within measurement noise on chelae's workload. The historical "x86-64-v3 wins 6%" result is really a v1→v2 POPCNT/SSE4.2 win; v2→v3 contributes ~0. Delta compression (gdelta+lz4) keeps the combined binary at ~3.7 MB vs ~2.9 MB for a single variant, with ~0.2 s added startup for decompress + memfd_create+exec. Irrelevant for batch FASTQ work. Infrastructure changes: - Cargo.toml: add [profile.dist] and [package.metadata.multivers.x86_64] - .cargo/config.toml is now the dev default (target-cpu=native) - .cargo/config-portable.toml is the release-time floor (plain x86-64), swapped in before `cargo multivers --profile dist` - Drop ensure_avx2_or_die from main.rs — multivers picks the right variant by construction so the probe would now false-positive on the v1 users we're explicitly trying to support aarch64 stays a single binary. Benchmarks showed Neoverse-specific tuning is 1-2% max with near-zero cross-tuning penalty, and the Graviton3→G4 generational jump dwarfs tuning anyway. Multivers infrastructure isn't justified on aarch64. Release workflow changes (GH Actions) to come in a follow-up.
1 parent 2a76e4f commit 2e971a9

6 files changed

Lines changed: 96 additions & 82 deletions

File tree

.cargo/config-portable.toml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# Portable baseline config used by the release workflow for sanity builds and
2+
# as the fallback floor under `cargo multivers`. Swap this in place of
3+
# `.cargo/config.toml` before release packaging:
4+
#
5+
# mv .cargo/config-portable.toml .cargo/config.toml
6+
# cargo multivers --profile dist
7+
#
8+
# `x86-64` is the baseline AMD64 target: SSE2 only, runs on any 64-bit x86 CPU.
9+
# This is what we want for non-multivers sanity builds (`cargo build --profile
10+
# dist`, CI runners with unknown CPU features) — it will run anywhere. The
11+
# `cargo multivers` step then compiles additional v2 and v4 variants on top
12+
# per `[package.metadata.multivers.x86_64]` in Cargo.toml.
13+
#
14+
# aarch64 targets are untouched here. `cargo multivers` is x86_64-only per
15+
# our benchmarks (Neoverse-specific tuning gave only ~1-2% over generic
16+
# ARMv8-A, not worth the variant infrastructure). Apple Silicon / Graviton
17+
# release builds use a single binary with whatever target-cpu we settle on;
18+
# see the release workflow.
19+
[target.x86_64-unknown-linux-gnu]
20+
rustflags = ["-C", "target-cpu=x86-64"]
21+
22+
[target.x86_64-unknown-linux-musl]
23+
rustflags = ["-C", "target-cpu=x86-64"]
24+
25+
[target.x86_64-apple-darwin]
26+
rustflags = ["-C", "target-cpu=x86-64"]

.cargo/config.toml

Lines changed: 6 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,6 @@
1-
# Baseline x86_64 release builds to x86-64-v3 (AVX2 + BMI1/BMI2 + FMA).
2-
#
3-
# AVX2 is present on effectively every x86_64 CPU shipped since 2013 (Intel
4-
# Haswell) / 2017 (AMD Zen) / 2021 (Intel Atom Gracemont+), and on essentially
5-
# 100% of cloud compute available today. Building to x86-64-v3 recovers ~6%
6-
# wall-time vs. the default x86-64 (SSE2) baseline on AVX2 hosts — per our
7-
# Granite Rapids EC2 benchmark — and costs nothing on them.
8-
#
9-
# Hosts without AVX2 (pre-Haswell Intel, pre-Zen AMD, pre-Gracemont Atom) will
10-
# SIGILL on illegal instruction. `src/bin/main.rs` runs a CPUID check at the
11-
# top of `main()` and prints a friendly error instead, though that check is
12-
# best-effort — if the Rust runtime emits AVX2 ops before `main()` runs, the
13-
# SIGILL comes first. Users on old hardware should rebuild with the default
14-
# portable baseline via `RUSTFLAGS="-C target-cpu=x86-64" cargo build --release`.
15-
#
16-
# aarch64 targets (Apple Silicon, AWS Graviton) are untouched and keep their
17-
# native NEON baseline.
18-
19-
[target.x86_64-unknown-linux-gnu]
20-
rustflags = ["-C", "target-cpu=x86-64-v3"]
21-
22-
[target.x86_64-unknown-linux-musl]
23-
rustflags = ["-C", "target-cpu=x86-64-v3"]
24-
25-
[target.x86_64-apple-darwin]
26-
rustflags = ["-C", "target-cpu=x86-64-v3"]
1+
# Dev config: build for the local machine's full feature set so local runs
2+
# are maximally fast. Release artifacts are built via a separate pathway
3+
# (see `.cargo/config-portable.toml` and `[package.metadata.multivers.x86_64]`
4+
# in `Cargo.toml`); swap in the portable config before release packaging.
5+
[build]
6+
rustflags = ["-C", "target-cpu=native"]

CLAUDE.md

Lines changed: 28 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -50,11 +50,36 @@ Pinned to Rust 1.95 via `rust-toolchain.toml`. Format settings: `max_width = 100
5050

5151
## Build Targeting
5252

53-
x86_64 release builds target `x86-64-v3` (AVX2 + BMI1/BMI2 + FMA) via `.cargo/config.toml`. This was benchmarked on an EC2 Granite Rapids instance and recovered ~6% wall time over the default x86-64 (SSE2) baseline on AVX2-capable hardware; it costs nothing on hardware that has AVX2. AVX2 is universal on anything ≥ 2013 Intel Haswell / ≥ 2017 AMD Zen / ≥ 2021 Intel Gracemont Atom, which covers essentially all cloud compute.
53+
### x86_64: cargo-multivers
5454

55-
For older hardware (pre-Haswell Intel etc.) users should rebuild with `RUSTFLAGS="-C target-cpu=x86-64" cargo build --release`. If an AVX2-built binary is nonetheless run on a non-AVX2 CPU, `src/bin/main.rs`'s `ensure_avx2_or_die` probes CPUID before any SIMD code path runs and exits with a friendly error rather than `SIGILL`.
55+
x86_64 release binaries are packaged via [`cargo-multivers`](https://github.com/ronnychevalier/cargo-multivers) into a single launcher that embeds three CPU-specific builds and dispatches to the best match at startup. See `[package.metadata.multivers.x86_64]` in `Cargo.toml`:
5656

57-
aarch64 targets (Apple Silicon, AWS Graviton) keep their native NEON baseline.
57+
```toml
58+
cpus = ["x86-64", "x86-64-v2", "x86-64-v4"]
59+
```
60+
61+
- `x86-64` — SSE2 baseline, any 64-bit x86 (2003+)
62+
- `x86-64-v2` — SSE4.2 + POPCNT (2008+). Captures nearly all of the scalar codegen win
63+
- `x86-64-v4` — AVX-512F/BW/CD/DQ/VL (2017+ server / 2022+ consumer). ~1% additional win
64+
65+
We intentionally skip `x86-64-v3`: our Granite Rapids benchmark showed v2 and v3 within measurement noise on chelae's workload. The historical "x86-64-v3 wins 6% over baseline" finding is actually a v1→v2 win; v2→v3 contributes ~0. Including v3 would bloat the binary without buying anything.
66+
67+
Variants are delta-compressed (`gdelta`) + lz4. Total binary is ~3.7 MB (vs ~2.9 MB for a single-variant build). Startup adds ~0.2 s for decompression + `memfd_create + exec` — negligible for chelae's batch workload.
68+
69+
The cargo-multivers runner sorts variants by feature count descending and picks the first match — so v4 runs on capable hardware, falling back to v2 on pre-AVX-512 systems and v1 on pre-SSE4.2 systems.
70+
71+
### Two-config pattern
72+
73+
- `.cargo/config.toml` — dev default, `target-cpu=native` (fastest local runs).
74+
- `.cargo/config-portable.toml` — release-time portable baseline (`target-cpu=x86-64`). Swap in via `mv` before `cargo multivers --profile dist`.
75+
76+
### Release profile
77+
78+
`[profile.dist]` in `Cargo.toml` inherits `release` with `incremental = false` for deterministic delta compression across multivers variants. Use `cargo multivers --profile dist` for x86_64 releases and `cargo build --profile dist` for aarch64 releases.
79+
80+
### aarch64
81+
82+
Single binary, no multivers. Benchmarks showed Neoverse-specific tuning yields only ~1-2% over generic ARMv8-A with near-zero cross-tuning penalty, so multivers infrastructure isn't justified. The generational upgrade (Graviton3 → Graviton4 = +24%) dwarfs any tuning delta anyway. Release build target-cpu is whatever we land on post-benchmarking; the dev default (`target-cpu=native`) is fine locally.
5883

5984
## Architecture
6085

Cargo.toml

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,12 @@ categories = ["science"]
1919
lto = "fat"
2020
codegen-units = 1
2121

22+
# Release packaging profile for cargo-multivers. Identical codegen to `release`
23+
# but with incremental off so delta-compression across variants is deterministic.
24+
[profile.dist]
25+
inherits = "release"
26+
incremental = false
27+
2228
# Profile that matches release perf closely but keeps symbol names for samply/perf.
2329
# Use via `cargo build --profile bench-prof`.
2430
[profile.bench-prof]
@@ -27,6 +33,23 @@ lto = "thin"
2733
debug = 1
2834
strip = false
2935

36+
# x86_64 release binaries are packaged via `cargo multivers` into a single
37+
# launcher that embeds three CPU variants and dispatches to the best match at
38+
# startup. The launcher is tiny (~1 KB) and the embedded variants are
39+
# delta-compressed (gdelta + lz4); total binary ~3.7 MB.
40+
#
41+
# Variants (order doesn't matter — cargo-multivers sorts by feature count at
42+
# link time; the runner picks the highest-feature match at startup):
43+
# - x86-64 : baseline SSE2, runs on any 64-bit x86 CPU (2003+)
44+
# - x86-64-v2 : SSE4.2 + POPCNT (2008+); captures most codegen win
45+
# - x86-64-v4 : AVX-512F/BW/CD/DQ/VL (2017+ server / 2022+ consumer)
46+
#
47+
# We skip v3 because our workload shows no measurable delta between v2 and
48+
# v3 — the v1→v2 jump (POPCNT, SSE4.2) carries essentially all of the
49+
# historical "x86-64-v3 wins 6% over baseline" benefit.
50+
[package.metadata.multivers.x86_64]
51+
cpus = ["x86-64", "x86-64-v2", "x86-64-v4"]
52+
3053
[lib]
3154
name = "chelae_lib"
3255
path = "src/lib/mod.rs"

README.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -229,15 +229,17 @@ cargo build --release
229229

230230
## Build Targeting and Portability
231231

232-
Release builds target `x86-64-v3` on x86_64 platforms (Linux GNU, Linux musl, macOS), which emits AVX2, BMI1/BMI2, FMA, etc. AVX2 is universal on anything Intel Haswell (2013) / AMD Zen (2017) / Intel Gracemont Atom (2021) or newer — effectively all cloud compute. aarch64 targets (Apple Silicon, AWS Graviton) keep their native NEON baseline.
232+
x86_64 release binaries ship as a single `cargo multivers` launcher that embeds three CPU-specific builds and picks the best match at startup:
233233

234-
Older hardware can be accommodated by rebuilding with a portable baseline:
234+
- `x86-64` — SSE2 baseline, runs on any 64-bit x86 CPU (2003+)
235+
- `x86-64-v2` — SSE4.2 + POPCNT (2008+); captures nearly all of the historical "v3 wins 6%" codegen benefit
236+
- `x86-64-v4` — AVX-512F/BW/CD/DQ/VL for Ice Lake / Sapphire Rapids / Granite Rapids / Zen 4+
235237

236-
```console
237-
RUSTFLAGS="-C target-cpu=x86-64" cargo build --release
238-
```
238+
The launcher is ~3.7 MB total and adds ~0.2 s of startup for decompression + `memfd_create + exec`. v3 is intentionally skipped — on chelae's workload v2 and v3 are within measurement noise, and v4 picks up what little additional win AVX-512 gives (~1% on our benchmarks).
239+
240+
aarch64 release binaries (Apple Silicon, AWS Graviton, GCP Axion, Azure Cobalt) are a single build with generic ARMv8-A / NEON baseline. Benchmarks showed Neoverse-specific tuning yields only ~1-2% over generic and cross-tuning penalty is near zero, so multivers isn't worth the complexity on aarch64.
239241

240-
If an AVX2-built binary is run on a non-AVX2 CPU, `chelae` will print an error and exit 1 at startup rather than crashing mid-run with `SIGILL`.
242+
For local development, `cargo build --release` uses `target-cpu=native` (see `.cargo/config.toml`) for fastest local runs.
241243

242244
## Developing
243245

src/bin/main.rs

Lines changed: 5 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
//! `chelae` — a FASTQ trimming and filtering toolkit. This file is the CLI entry
22
//! point; it dispatches to the set of subcommands via the [`Command`] trait and
3-
//! `enum_dispatch`. Installs `mimalloc` as the global allocator and, on x86_64, runs
4-
//! a startup CPU check against the AVX2 baseline that `.cargo/config.toml` requires.
3+
//! `enum_dispatch`. Installs `mimalloc` as the global allocator.
54
65
extern crate core;
76

@@ -40,52 +39,11 @@ enum Subcommand {
4039
Trim(Trim),
4140
}
4241

43-
/// Best-effort guard against running an AVX2-compiled binary on a CPU that lacks
44-
/// AVX2 (pre-2013 Intel Haswell / pre-2017 AMD Zen / pre-2021 Intel Gracemont
45-
/// Atom). The crate's `.cargo/config.toml` targets `x86-64-v3` for x86_64
46-
/// release builds, which emits AVX2 instructions throughout the binary; hitting
47-
/// one of those on an older CPU produces `SIGILL` with no explanation.
48-
///
49-
/// This runs at the top of `main()` and probes CPUID directly (the `cpuid`
50-
/// instruction carries no SIMD baggage, so the check itself is safe on any
51-
/// x86_64 CPU). If AVX2 is missing we print a clear message and exit 1.
52-
///
53-
/// **Caveat:** the guard only covers code that runs after `main()` starts. If
54-
/// the Rust runtime emits AVX2 ops during pre-main startup (TLS setup, allocator
55-
/// init, etc.), the `SIGILL` will still beat us. In practice Rust's pre-main
56-
/// path is small enough that we don't observe this, but we can't guarantee it.
57-
#[cfg(target_arch = "x86_64")]
58-
fn ensure_avx2_or_die() {
59-
// `__cpuid`/`__cpuid_count` were stabilized as safe in Rust 1.89 — the `cpuid`
60-
// instruction is unconditionally available on every x86_64 CPU and has no memory
61-
// side effects.
62-
use std::arch::x86_64::{__cpuid, __cpuid_count};
63-
// Leaf 1: ECX bit 27 = OSXSAVE (OS supports xsave, required to use YMM state);
64-
// bit 28 = AVX. Leaf 7 sub-leaf 0: EBX bit 5 = AVX2.
65-
let l1 = __cpuid(1);
66-
let osxsave = (l1.ecx >> 27) & 1 != 0;
67-
let avx = (l1.ecx >> 28) & 1 != 0;
68-
let l7 = __cpuid_count(7, 0);
69-
let avx2 = (l7.ebx >> 5) & 1 != 0;
70-
if !(osxsave && avx && avx2) {
71-
eprintln!(
72-
"error: this chelae binary was built for x86-64-v3 (AVX2+) but this\n\
73-
CPU does not report AVX2 support. Required features: AVX, AVX2,\n\
74-
OSXSAVE. Rebuild from source with a portable baseline:\n\
75-
\n\
76-
\tRUSTFLAGS=\"-C target-cpu=x86-64\" cargo build --release\n"
77-
);
78-
std::process::exit(1);
79-
}
80-
}
81-
82-
/// Process entry point. Runs the AVX2 guard (x86_64 only), initializes env_logger with
83-
/// a default of `info` level, parses argv, and dispatches to the selected subcommand's
84-
/// `execute()`. Errors propagate out as `anyhow::Error` for the runtime's default
85-
/// `eprintln!` + nonzero-exit handling.
42+
/// Process entry point. Initializes env_logger with a default of `info` level,
43+
/// parses argv, and dispatches to the selected subcommand's `execute()`. Errors
44+
/// propagate out as `anyhow::Error` for the runtime's default `eprintln!` +
45+
/// nonzero-exit handling.
8646
fn main() -> Result<()> {
87-
#[cfg(target_arch = "x86_64")]
88-
ensure_avx2_or_die();
8947
env_logger::Builder::from_env(Env::default().default_filter_or("info")).init();
9048
let args: Args = Args::parse();
9149
args.subcommand.execute()

0 commit comments

Comments
 (0)