CPU affinity settings to reduce latency jitter in benchmark measurements.
The CPU affinity system partitions physical cores between LoadGen (main process) and Workers. Each process gets all hyperthreads (SMT siblings) of its assigned physical cores to prevent cross-process cache thrashing.
Key concepts:
- Physical core isolation: LoadGen and workers never share physical cores
- Hyperthread grouping: Each process gets all logical CPUs of its physical cores
- Performance-based ranking: Fastest cores assigned to LoadGen first
| Setting | Location | Default | Purpose |
|---|---|---|---|
cpu_affinity |
Top-level | -1 |
Pin loadgen and worker processes to CPU cores |
Values:
-1(auto): Physical core isolation with SMT siblings, fastest cores to loadgenlist[int]: Use specific cores (shared by loadgen and workers)null: Disabled
cpu_affinity: -1 # Auto: physical core isolation with SMT siblings
# cpu_affinity: [4, 5, 6, 7, 8, 9, 10, 11] # Explicit cores
# cpu_affinity: null # DisabledAuto mode allocation (default 6 physical cores for loadgen):
- 1 core: Session thread (scheduler, busy-wait timing)
- 1 core: Event loop thread (uvloop, response handling)
- 4 cores: ZMQ I/O threads
- Remaining physical cores: Workers (one per core with all SMT siblings)
- Linux only: Uses
os.sched_setaffinity()and sysfs for topology detection - Non-Linux: Affinity settings are skipped with a warning
- Performance ranking: Uses ACPI CPPC
highest_perf, ARMcpu_capacity, orcpuinfo_max_freq(in order of preference)
Optimal worker count depends on your workload — prompt size, streaming mode, and connection count all affect throughput. Use the benchmark script to sweep worker counts against your expected prompt lengths and pick the configuration that maximizes recv rate.
python -m inference_endpoint.utils.benchmark_httpclient --full -d 5
python -m inference_endpoint.utils.benchmark_httpclient --full -d 5 --streamRuns all common worker counts against a range of prompt lengths (CPU pinning is on by default). Produces a plot at /tmp/sweep_*.png showing send/recv rate per configuration, with shaded variation bands and a stall% overlay.
With --stream, the full sweep also varies stream interval (0%, 50%, 100% of prompt length) and adds an SSE-pkts/s subplot. Streaming typically requires more workers to sustain the same recv rate because each response involves many SSE events that must be parsed individually.
# Sweep workers for a specific prompt length
python -m inference_endpoint.utils.benchmark_httpclient -w 1:16 -l 4096 -d 10
# Sweep workers with explicit values
python -m inference_endpoint.utils.benchmark_httpclient -w 1,2,4,8,12,16 -l 4096 -d 10
# Cartesian product: workers x prompt lengths
python -m inference_endpoint.utils.benchmark_httpclient -w 1:16::8 -l 128,1024,8192 -d 5
# Streaming: sweep workers with a fixed stream interval (chars per SSE event)
python -m inference_endpoint.utils.benchmark_httpclient -w 1:16 -l 4096 --stream --stream-interval 100 -d 5
# Streaming: sweep stream intervals (total events = ceil(output_length / interval))
python -m inference_endpoint.utils.benchmark_httpclient -w 8 --stream --stream-interval 1,50,500 -d 5- Send Rate: requests/s the client can issue. Higher is better.
- Recv Rate: responses/s received. This is the effective throughput.
- SSE-pkts/s: SSE events received per second (streaming mode only). Derived from
recv_rate * events_per_response. Use this to gauge how the client handles high packet rates at different stream intervals. - Stall%: fraction of send time spent blocked on back-pressure (inflight limit). High stall% indicates client-side overhead — the client can't process responses fast enough to make room for new sends. The target server (MaxThroughputServer) returns pre-built responses with no compute, so stall is purely client overhead.
- Variation bands: shaded region shows min/max per-second rate during each run. Wide bands indicate instability.
Pick the worker count where recv rate peaks and stall% is low.
For streaming workloads, also watch SSE-pkts/s — a small stream interval (fine-grained events) dramatically increases packet rate and may require more workers to keep up. If SSE-pkts/s plateaus while recv rate drops, the client is bottlenecked on SSE parsing overhead.
The ZMQ transport uses a pre-allocated receive buffer (bytearray) for zero-copy message deserialization. If a serialized message exceeds this buffer, the worker crashes with:
RuntimeError: ZMQ message truncated (18874368 > 16777216 bytes). Increase client.transport.recv_buffer_size in config.
| Setting | Default | Description |
|---|---|---|
recv_buffer_size |
16 MB | Application receive buffer per socket |
send_buffer_size |
16 MB | Kernel send buffer hint (advisory on IPC) |
When to increase: Multimodal workloads with large base64-encoded images in the request payload. A single VLM request with a high-resolution image can easily exceed 16 MB after msgspec serialization.
When the default is fine: Text-only workloads. A 32K-token prompt serializes to ~150 KB — well within the 16 MB buffer.
settings:
client:
transport:
type: zmq
recv_buffer_size: 67108864 # 64 MB for large multimodal payloads
send_buffer_size: 67108864Note: recv_buffer_size sets the application-level recv_into buffer, not a kernel limit. IPC (Unix domain) sockets ignore SO_RCVBUF/SO_SNDBUF — the OS handles arbitrarily large messages regardless. The send_buffer_size is passed to zmq.SNDBUF as a kernel hint but has no effect on IPC transport.
Two built-in servers for benchmarking without a real GPU endpoint.
Returns identical pre-compiled responses instantly — zero compute, pure client roofline.
python -m inference_endpoint.testing.max_throughput_server --port 12345 --stats
python -m inference_endpoint.testing.max_throughput_server --stream --stream-interval 50 --stats| Flag | Default | Description |
|---|---|---|
--output-length |
4000 | Characters in response |
--stream |
off | SSE streaming mode |
--stream-interval |
1 | Characters per SSE event |
--num-workers |
4 | Server worker processes |
Realistic LLM simulation with per-request variable output lengths, TTFT, and TPOT.
Two mutually exclusive timing modes:
- Response-rate mode (
--response-rate-mean): per-worker token bucket controls global throughput - Inter-token mode (
--inter-token-latency): per-token generation time (TPOT) in ms. Inter-SSE-event delay = TPOT × stream_interval
# Non-streaming with response-rate control
python -m inference_endpoint.testing.variable_throughput_server --stats \
--response-rate-mean 1000
# Streaming with TPOT + TTFT
python -m inference_endpoint.testing.variable_throughput_server --stream --stats \
--inter-token-latency 15 --first-chunk-latency 1.5 --stream-interval 10
# With jitter
python -m inference_endpoint.testing.variable_throughput_server --stream --stats \
--response-rate-mean 50 --response-rate-spread 0.2 \
--first-chunk-latency 0.5 --first-chunk-spread 0.2| Flag | Default | Description |
|---|---|---|
--output-len-mean |
1000 | Mean output length (chars) |
--output-len-spread |
0.3 | CoV for output length (lognormal) |
--response-rate-mean |
0 | Global throughput (resp/sec). Mutually exclusive with --inter-token-latency |
--inter-token-latency |
0 | Per-token delay in ms (TPOT). Mutually exclusive with --response-rate-mean |
--first-chunk-latency |
0 | Mean TTFT in seconds |
--first-chunk-spread |
0.2 | CoV for TTFT |
--stream-interval |
1 | Chars per SSE event |
--max-concurrency |
0 | Max concurrent requests (0 = unlimited) |
--num-workers |
10 | Server worker processes |