Skip to content

Commit a72bedb

Browse files
Perf/shm latency and compiler optimizations (#116)
* perf: SHM latency optimizations — close gap with reference C implementation - Add .cargo/config.toml with target-cpu=native for optimal codegen - Replace nix crate clock_gettime wrapper with direct libc call - Capture receive timestamp inside receive_blocking() immediately after condvar wake, matching C measurement point - Eliminate redundant zero-fill: vec![0u8;N] → Vec::with_capacity + set_len - Replace per-byte ring buffer copies with copy_nonoverlapping in blocking path - Add #[inline] hints to all hot-path SHM functions - Server loop uses transport-captured timestamp when available Reduces SHM direct mode mean latency by ~12.5% (23.4µs → 20.5µs), narrowing gap vs reference C benchmark from 25% to ~10%. Co-authored-by: Cursor <cursoragent@cursor.com> * perf: clarify receive timestamp comment in SHM-direct Update comment on the receive-side clock_gettime call to better describe why it's captured inside the mutex (matches reference C approach for accurate latency measurement). Co-authored-by: Cursor <cursoragent@cursor.com> * docs: add detailed PERF comments to all optimized code paths Document the rationale behind each optimization with inline comments: - .cargo/config.toml: explain target-cpu=native and portability note - mod.rs: explain direct libc vs nix crate clock_gettime, receive_time_ns field - shared_memory_direct.rs: send/receive timestamp placement, zero-fill elimination - shared_memory_blocking.rs: bulk copy_nonoverlapping vs byte-by-byte with before/after - shared_memory.rs: inline hints on ring buffer hot-path functions - main.rs: transport-level timestamp preference in both server loops Co-authored-by: Cursor <cursoragent@cursor.com> * style: fix cargo fmt formatting issues - Collapse short copy_nonoverlapping calls to single line in shared_memory_blocking.rs - Remove extra blank line in shared_memory_direct.rs Co-authored-by: Cursor <cursoragent@cursor.com> * fix: address PR review — conditional timestamp placement and clock_gettime error handling Move SHM-direct receive timestamp inside/outside mutex based on --send-delay: latency benchmarks (send-delay > 0) capture inside the mutex for accuracy matching the reference C implementation; throughput benchmarks (no send-delay) capture after mutex unlock to eliminate the 22-31% regression at small message sizes. The flag is derived automatically with no new user-facing CLI options. Add debug_assert! on all raw clock_gettime return values as cheap insurance against silent failures. Remove .cargo/config.toml (target-cpu=native) to restore binary portability across CPU variants. Co-authored-by: Cursor <cursoragent@cursor.com> * docs+test: add precise_timestamps tests, CPU-optimized build docs, and target-cpu=native rationale - Add 3 unit tests for BlockingSharedMemoryDirect::with_precise_timestamps(): constructor flag verification (true/false) and end-to-end receive with precise_timestamps=true exercising the inside-mutex timestamp code path - Add factory test verifying send_delay variants (None, ZERO, 10ms) are accepted when creating SHM-direct transports - Document SHM-direct conditional timestamp placement in README: adaptive inside/outside-mutex receive timestamp based on --send-delay, with latency vs throughput tradeoff explanation (22-31% regression context) - Document CPU-optimized builds in README: rationale for removing .cargo/config.toml (portability, CI cross-compilation risks across NXP S32G/Qualcomm Ride SX4/Renesas R-Car S4), on-target builds with RUSTFLAGS="-C target-cpu=native", per-platform cross-compile examples - Update CONFIG.md Rust Compiler Optimizations section with callout explaining why target-cpu=native must not be in repo-wide config, add cross-compile example and link to README - Fix pre-existing clippy lint: map_or -> is_some_and on send_delay wiring - All tests passing, clippy clean, cargo fmt applied AI-assisted-by: Claude Opus 4 (Anthropic) --------- Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 50db84c commit a72bedb

8 files changed

Lines changed: 436 additions & 61 deletions

File tree

CONFIG.md

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -334,10 +334,26 @@ sudo sysctl -p
334334
### Application-Level Tuning
335335

336336
#### Rust Compiler Optimizations
337+
338+
> **Important:** This repo does **not** ship a `.cargo/config.toml` with
339+
> `target-cpu=native`. That setting would apply to every build —
340+
> including CI — producing non-portable binaries. When CI
341+
> cross-compiles for ARM on x86 runners, `target-cpu=native`
342+
> silently optimizes for the build host, not the target, and may
343+
> emit instructions the deployment CPU does not support (e.g.,
344+
> ARMv8.2+ on a Cortex-A53). Apply CPU tuning explicitly at build
345+
> time instead. See the README's
346+
> [CPU-Optimized Builds](README.md#cpu-optimized-builds) section
347+
> for per-platform examples.
348+
337349
```bash
338-
# Maximum optimization
350+
# On-target build (uses every instruction the local CPU supports)
339351
RUSTFLAGS="-C target-cpu=native -C opt-level=3" cargo build --release
340352

353+
# Cross-compile for a specific ARM CPU
354+
RUSTFLAGS="-C target-cpu=cortex-a53 -C opt-level=3" cargo build \
355+
--release --target aarch64-unknown-linux-gnu
356+
341357
# Link-time optimization
342358
RUSTFLAGS="-C lto=fat" cargo build --release
343359

README.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -205,6 +205,16 @@ send-wait-receive per message).
205205
3. **Client Side**: Timestamp captured after receiving response
206206
4. **Latency Calculation**: Total elapsed time from send to receive
207207

208+
#### SHM-Direct Conditional Timestamp Placement
209+
210+
The `--shm-direct` transport uses **adaptive receive timestamp placement** based on the test type, controlled automatically by `--send-delay`:
211+
212+
- **Latency-focused tests** (`--send-delay > 0`): The receive timestamp is captured **inside the mutex**, immediately after the condvar wake-up. This matches the reference C SHM implementation and excludes payload copy, allocation, and mutex unlock from measured latency (~5–10 µs savings). The send-delay between messages dwarfs any additional mutex contention.
213+
214+
- **Throughput-focused tests** (no `--send-delay`): The receive timestamp is captured **after the mutex unlock**, keeping the critical section minimal. This avoids a 22–31% throughput regression at small message sizes caused by the extra `clock_gettime` call inside the mutex.
215+
216+
This behavior is fully automatic — no additional CLI flags are needed. The `--send-delay` flag is sufficient to signal intent: if you're pacing messages for latency measurement, you get the most accurate timestamps; if you're saturating the pipe for throughput, you get maximum performance.
217+
208218
#### Streaming Output Columns
209219

210220
The per-message streaming output (JSON and CSV) contains the
@@ -358,6 +368,37 @@ cargo build --release
358368

359369
The optimized binary will be available at `target/release/ipc-benchmark`.
360370

371+
### CPU-Optimized Builds
372+
373+
By default, `cargo build --release` produces **portable binaries** that run on any CPU in the target architecture family (e.g., generic `aarch64`). This is intentional — the repo does not ship a `.cargo/config.toml` with `target-cpu=native` because that setting would silently affect every build, including CI, producing non-portable binaries that may use instructions unsupported on the deployment target.
374+
375+
This matters especially for cross-platform ARM development. If CI runs on AWS Graviton (Neoverse-N1) but the target is an NXP S32G (Cortex-A53), a `target-cpu=native` binary built on Graviton could use ARMv8.2+ instructions that the Cortex-A53 does not support, causing illegal-instruction crashes at runtime.
376+
377+
**When building directly on target hardware**, enable CPU-specific optimizations at build time:
378+
379+
```bash
380+
# On-target build: let the compiler use every instruction the local CPU supports
381+
RUSTFLAGS="-C target-cpu=native" cargo build --release
382+
```
383+
384+
**When cross-compiling in CI**, specify the exact CPU target per platform:
385+
386+
```bash
387+
# NXP S32G (Cortex-A53)
388+
RUSTFLAGS="-C target-cpu=cortex-a53" cargo build --release \
389+
--target aarch64-unknown-linux-gnu
390+
391+
# Qualcomm Ride SX4 (Cortex-A78AE) — use the closest supported LLVM target
392+
RUSTFLAGS="-C target-cpu=cortex-a78" cargo build --release \
393+
--target aarch64-unknown-linux-gnu
394+
395+
# Renesas R-Car S4 (Cortex-A76)
396+
RUSTFLAGS="-C target-cpu=cortex-a76" cargo build --release \
397+
--target aarch64-unknown-linux-gnu
398+
```
399+
400+
The performance-critical code paths in this project (timestamp placement, bulk copies, zero-fill elimination, direct `libc::clock_gettime`) are pure code optimizations that do not depend on `target-cpu`. They provide the bulk of the latency improvement regardless of CPU target. The `target-cpu` flag adds a smaller, incremental gain from SIMD auto-vectorization and instruction scheduling tuned for the specific microarchitecture.
401+
361402
### Quick Start
362403

363404
```bash

src/benchmark_blocking.rs

Lines changed: 22 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -477,6 +477,13 @@ impl BlockingBenchmarkRunner {
477477
.arg(self.config.pmq_priority.to_string());
478478
}
479479

480+
// Forward send-delay to server so SHM-direct can enable precise
481+
// (inside-mutex) timestamps for latency-focused benchmarks.
482+
if let Some(delay) = self.config.send_delay {
483+
let micros = delay.as_micros();
484+
cmd.arg("--send-delay").arg(format!("{micros}us"));
485+
}
486+
480487
// Add latency file path if provided (for true IPC measurement)
481488
if let Some(path) = latency_file_path {
482489
cmd.arg("--internal-latency-file").arg(path);
@@ -813,8 +820,11 @@ impl BlockingBenchmarkRunner {
813820
/// - `Ok(())`: Warmup completed successfully
814821
/// - `Err(anyhow::Error)`: Warmup failed
815822
fn run_warmup(&self, transport_config: &TransportConfig) -> Result<()> {
816-
let mut client_transport =
817-
BlockingTransportFactory::create(&self.mechanism, self.args.shm_direct)?;
823+
let mut client_transport = BlockingTransportFactory::create(
824+
&self.mechanism,
825+
self.args.shm_direct,
826+
self.config.send_delay,
827+
)?;
818828

819829
// --- Server Process Spawning ---
820830
let (mut server_process, mut pipe_reader) = self.spawn_server_process(transport_config)?;
@@ -1000,8 +1010,11 @@ impl BlockingBenchmarkRunner {
10001010
metrics_collector: &mut MetricsCollector,
10011011
mut results_manager: Option<&mut crate::results_blocking::BlockingResultsManager>,
10021012
) -> Result<()> {
1003-
let mut client_transport =
1004-
BlockingTransportFactory::create(&self.mechanism, self.args.shm_direct)?;
1013+
let mut client_transport = BlockingTransportFactory::create(
1014+
&self.mechanism,
1015+
self.args.shm_direct,
1016+
self.config.send_delay,
1017+
)?;
10051018

10061019
// Create a temporary file for server to write latencies
10071020
let latency_file_path = std::env::temp_dir()
@@ -1181,8 +1194,11 @@ impl BlockingBenchmarkRunner {
11811194
metrics_collector: &mut MetricsCollector,
11821195
mut results_manager: Option<&mut crate::results_blocking::BlockingResultsManager>,
11831196
) -> Result<()> {
1184-
let mut client_transport =
1185-
BlockingTransportFactory::create(&self.mechanism, self.args.shm_direct)?;
1197+
let mut client_transport = BlockingTransportFactory::create(
1198+
&self.mechanism,
1199+
self.args.shm_direct,
1200+
self.config.send_delay,
1201+
)?;
11861202

11871203
// --- Server Process Spawning ---
11881204
let (mut server_process, mut pipe_reader) = self.spawn_server_process(transport_config)?;

0 commit comments

Comments
 (0)