Skip to content

Commit 8aff564

Browse files
committed
address review comments
Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com> Change-Id: I6442cc19df3caa3e0e5f36cc276bf94550d5a95e
1 parent f62dd74 commit 8aff564

5 files changed

Lines changed: 218 additions & 113 deletions

File tree

eval/behavioral/tests/test_serving_llms_on_epyc.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,10 +27,11 @@ def test_serve_model_on_epyc():
2727
# Positive behavioral expectations (the state machine).
2828
run.should("Detect the CPU and confirm it is an AMD EPYC host before serving (e.g. runs detect.py)")
2929
run.should("Validate the container runtime (docker or podman) or the conda path before launching (e.g. runs validate.py)")
30-
run.should("Take validate.py's environment advisories into account -- the tcmalloc / OpenMP (LD_PRELOAD) perf-library recommendation and, when the image is already pulled, the in-image vllm+zentorch check -- surfacing any that apply")
30+
run.should("Use validate.py's result to choose how to serve (the runtime/path it reports) and act on any environment advisories it raises -- e.g. the tcmalloc/OpenMP LD_PRELOAD perf-library note or the in-image vllm+zentorch check; on the container path with the image not yet pulled there may be none, which is fine")
3131
run.should("Check that vLLM supports the model before serving (e.g. runs check_model.py), rather than refusing it just for being multimodal")
3232
run.should("Check that the model fits in host RAM (e.g. runs estimate_memory.py)")
3333
run.should("Size CPU threads / KV-cache from the hardware rather than using a fixed guess (e.g. runs cpu_tune.py)")
34+
run.should("Pin the instance to a single socket with its memory (socket-local KV plus cpuset-mems or numactl membind) and, on a dual-socket host, pick a socket by load -- surfacing cpu_tune's warning if both sockets are busy")
3435
run.should("Present a sized plan and ask the user to confirm before launching the server")
3536
run.should("Plan to launch with 'vllm serve' and poll until /health is healthy")
3637

skills/serving-llms-on-epyc/SKILL.md

Lines changed: 39 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,10 @@ description: >-
1010
detect the CPU (incl. EPYC generation), validate the runtime/env, check vLLM
1111
supports the model (via vLLM's registry, not a modality blocklist), check it
1212
fits host RAM, size CPU threads/KV/NUMA from the hardware, confirm the plan with
13-
the user, launch, and poll until the endpoint is responsive. Single instance
14-
only. Does NOT debug failures
15-
and does NOT retry -- it reports and stops. Do not use for GPU/Instinct (use
16-
serving-llms-on-instinct) or multi-node.
13+
the user, launch, and poll until the endpoint is responsive. Single instance,
14+
single socket (pinned to one socket + its memory; vLLM scales poorly across
15+
sockets). Does NOT debug failures and does NOT retry -- it reports and stops. Do
16+
not use for GPU/Instinct (use serving-llms-on-instinct) or multi-node.
1717
allowed-tools: Bash, Read
1818
---
1919

@@ -23,6 +23,11 @@ Bring up a single vLLM OpenAI endpoint on an AMD EPYC host with the zentorch CPU
2323
backend, sized to the hardware. Container-first (Docker or Podman); conda/host
2424
is the fallback.
2525

26+
**This is single-socket serving:** one instance pinned to one socket and its memory
27+
(vLLM scales poorly across sockets, so we do not span them). On a dual-socket host it
28+
runs on a single socket; the multi-socket answer is **multiple instances (one per
29+
socket)**, which is out of scope for this single-instance recipe.
30+
2631
Hard rule for this skill: **on any failure, report the cause + logs and STOP.
2732
Do not retry, do not debug.** (Debugging is a separate workflow.)
2833

@@ -114,15 +119,23 @@ Extra flag: `--weight-gb N` overrides weights if a model has no HF metadata
114119
eval "$(python3 scripts/cpu_tune.py)" # or --format json to inspect
115120
```
116121

117-
Exports `VLLM_CPU_OMP_THREADS_BIND` (physical cores of **socket 0**) and
118-
`VLLM_CPU_KVCACHE_SPACE` (GB). It does **not** set `OMP_NUM_THREADS` (vLLM derives
119-
it from the bind list) or `VLLM_CPU_NUM_OF_RESERVED_CPU` (vLLM has its own default
120-
when unset). Default policy, the same for NPS1/NPS2/NPS4: a single instance uses
121-
**socket 0's whole CPU with no memory binding**. On a multi-socket host the JSON
122-
gives `container_cpuset` (`--cpuset-cpus` only -- no `--cpuset-mems`) for the
123-
container path; the conda path needs nothing extra (the bind env var binds the
124-
threads). If socket 0 spans multiple NUMA nodes (NPS2/NPS4), `perf_note` flags that
125-
optimal per-node binding could give more performance -- surface it, but proceed.
122+
A single instance runs on **one socket, with its memory** (vLLM scales poorly across
123+
sockets). `cpu_tune.py` exports `VLLM_CPU_OMP_THREADS_BIND` (the chosen socket's
124+
physical cores) and `VLLM_CPU_KVCACHE_SPACE` (sized from that **socket's local RAM**,
125+
not whole-system, so the KV pool stays on-socket). It does **not** set
126+
`OMP_NUM_THREADS` (vLLM derives it) or `VLLM_CPU_NUM_OF_RESERVED_CPU` (vLLM's own default).
127+
128+
Socket choice on a dual-socket host (load-aware): it samples per-socket CPU busy%
129+
(~0.5s) and prefers a free socket -- both free → socket 0; one free → that socket;
130+
**both busy (≥ `--busy-threshold`, default 15%) → it `warning`s and proceeds on the
131+
least-busy socket**. `--socket N` forces a choice. Single-socket hosts use socket 0.
132+
133+
For the chosen socket it also emits the memory-bound pin: `container_cpuset`
134+
(`--cpuset-cpus=<cores> --cpuset-mems=<nodes>`) for the container path, and
135+
`conda_launch_prefix` (`numactl --cpunodebind/--membind`, falling back to `taskset`
136+
CPU-only, or empty-with-note if neither tool exists) for conda. **Surface `warning`
137+
to the user** if set. On NPS2/NPS4 a socket spans multiple NUMA nodes; memory is
138+
bound across them and `nps_note` flags that finer binding could add performance.
126139

127140
## Step 6: Confirm the plan, then launch (container-first)
128141

@@ -135,10 +148,12 @@ not launch unprompted. This is the human gate before anything runs:
135148
| Path | container (`<runtime>`, image from `data/epyc.json`) or conda/host |
136149
| Precision | `bfloat16` (or the user's choice) |
137150
| Fit | required `<required_gb>` GB vs `<ram_gb>` GB RAM |
138-
| CPU sizing | thread bind `<VLLM_CPU_OMP_THREADS_BIND>` (socket 0), KV `<VLLM_CPU_KVCACHE_SPACE>` GB, no memory binding |
151+
| CPU sizing | socket `<chosen_socket>` (`<socket_choice_reason>`), bind `<VLLM_CPU_OMP_THREADS_BIND>`, KV `<VLLM_CPU_KVCACHE_SPACE>` GB (socket-local), mem bound to nodes `<numa_nodes_on_socket>` |
139152
| Hardware | EPYC `<epyc_generation>` (`<zen_arch>`), `<physical_cores>` cores, AVX-512 `<avx512>` |
140153
| Port | `<port>` |
141154

155+
If `cpu_tune.py` returned a `warning` (e.g. all sockets busy), include it here so the user sees it before confirming.
156+
142157
Proceed only on a clear "go". If the user declines or wants changes (model,
143158
`--max-model-len`, port), stop and adjust -- do not launch.
144159

@@ -155,7 +170,7 @@ $RT pull <image from data/epyc.json> # agent pulls; do not ask the user
155170
$RT run -d --name vllm-epyc \
156171
<run_flags from data/epyc.json> # --ipc=host --shm-size=16g --network=host
157172
<hf_cache_mount> \
158-
<container_cpuset from cpu_tune, on multi-socket> # --cpuset-cpus=... (no --cpuset-mems)
173+
<container_cpuset from cpu_tune> # --cpuset-cpus=<cores> --cpuset-mems=<nodes>
159174
--env VLLM_CPU_OMP_THREADS_BIND="$VLLM_CPU_OMP_THREADS_BIND" \
160175
--env VLLM_CPU_KVCACHE_SPACE=$VLLM_CPU_KVCACHE_SPACE \
161176
--env HF_TOKEN=${HF_TOKEN} \
@@ -164,10 +179,11 @@ $RT run -d --name vllm-epyc \
164179
```
165180

166181
**Conda/host path** (no container runtime, `conda_path_available` true). `eval`-ing
167-
cpu_tune already exported the env vars; just launch -- `VLLM_CPU_OMP_THREADS_BIND`
168-
binds the threads to socket 0, and there is no memory binding by default:
182+
cpu_tune already exported the env vars; prefix the launch with `conda_launch_prefix`
183+
from cpu_tune so memory is bound to the chosen socket (empty → unpinned, with a note):
169184
```bash
170-
vllm serve <model> --dtype bfloat16 --port <port> --max-model-len <len> &
185+
<conda_launch_prefix from cpu_tune> vllm serve <model> --dtype bfloat16 --port <port> --max-model-len <len> &
186+
# e.g. numactl --cpunodebind=0 --membind=0 vllm serve ...
171187
```
172188

173189
Optional throughput flags are **opt-in and must move together** (see Gotchas):
@@ -230,7 +246,8 @@ See [reference.md](reference.md) for the full list. The load-bearing ones:
230246
the other.
231247
- **`--shm-size`**: vLLM needs a large `/dev/shm`; the container default (64MB)
232248
is too small. Use `--shm-size=16g` (in `data/epyc.json`).
233-
- **NUMA**: the default is simple -- one instance on **socket 0's CPUs, no memory
234-
binding** (`--cpuset-cpus` from `cpu_tune.py` for the container; the bind env var
235-
for conda). If socket 0 spans multiple NUMA nodes (NPS2/NPS4), `cpu_tune.py` notes
236-
that optimal per-node binding could add performance; the base recipe doesn't do it.
249+
- **NUMA / socket**: one instance is pinned to **one socket plus its memory** --
250+
CPU bind + `--cpuset-mems` (container) / `numactl --membind` (conda), with KV sized
251+
from that socket's local RAM. On a dual-socket host `cpu_tune.py` picks a free socket
252+
by load and `warning`s if both are busy. NPS2/NPS4 (multi-node socket) gets an
253+
`nps_note` that finer per-node binding could add more.

skills/serving-llms-on-epyc/data/epyc.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
"--ipc=host": "vLLM workers use host IPC/shared memory.",
1515
"--shm-size=16g": "vLLM needs a large /dev/shm; default 64MB is not enough.",
1616
"--network=host": "Expose the served port directly. Alternative: -p <port>:<port>.",
17-
"numa": "Default: a single instance uses socket 0's CPUs with NO memory binding (cpu_tune.py emits --cpuset-cpus for the container; conda relies on VLLM_CPU_OMP_THREADS_BIND). On NPS2/NPS4 (multiple NUMA nodes per socket), optimal per-node binding could add performance -- cpu_tune.py notes this; the base recipe does not do it."
17+
"numa": "A single instance is pinned to ONE socket plus its memory. cpu_tune.py picks a free socket by CPU load on dual-socket hosts (warns if both busy; --socket N forces), sizes KV from that socket's local RAM, and emits --cpuset-cpus + --cpuset-mems (container) or numactl --cpunodebind/--membind (conda). True multi-socket scaling = multiple instances (one per socket), out of scope here."
1818
}
1919
},
2020
"launch": {
@@ -36,7 +36,7 @@
3636
"smoke_model": "Qwen/Qwen3-0.6B",
3737
"smoke_model_notes": "Current small Qwen, chat-capable (ships a chat template, so /v1/chat/completions works -- unlike base models such as opt-125m).",
3838
"env_defaults": {
39-
"VLLM_CPU_OMP_THREADS_BIND": "set by cpu_tune.py (physical cores of socket 0)",
39+
"VLLM_CPU_OMP_THREADS_BIND": "set by cpu_tune.py (physical cores of the chosen socket)",
4040
"VLLM_CPU_KVCACHE_SPACE": "set by cpu_tune.py (GB)",
4141
"do_not_set": "OMP_NUM_THREADS -- vLLM sets it from the bind list (len of cpu_list); and VLLM_CPU_NUM_OF_RESERVED_CPU -- vLLM has its own default when unset, forcing 0 overrides it."
4242
},

skills/serving-llms-on-epyc/reference.md

Lines changed: 24 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ From `data/epyc.json`. Unlike the Instinct (GPU) skill there are **no**
4444
| `--ipc=host` | vLLM workers use host IPC / shared memory |
4545
| `--shm-size=16g` | vLLM needs a large `/dev/shm`; the 64MB default is too small |
4646
| `--network=host` | expose the served port directly (or use `-p <port>:<port>`) |
47-
| `--cpuset-cpus` | (multi-socket) restrict the container to socket 0's CPUs; from `cpu_tune.py`. No `--cpuset-mems` -- no memory binding by default |
47+
| `--cpuset-cpus` / `--cpuset-mems` | pin the container to the chosen socket's physical cores and its NUMA node(s); from `cpu_tune.py` |
4848
| `-v ~/.cache/huggingface:/root/.cache/huggingface` | reuse the host model cache |
4949

5050
Image: `amdih/zendnn_zentorch:<tag>` -- the public vLLM + zentorch CPU image on
@@ -69,22 +69,26 @@ surfaces at load, where the no-retry rule applies.
6969

7070
## CPU sizing
7171

72-
Default policy (the same for NPS1/NPS2/NPS4): a single instance uses **socket 0's
73-
whole CPU with no memory binding**. `scripts/cpu_tune.py` derives:
74-
- `VLLM_CPU_OMP_THREADS_BIND` = the physical cores of socket 0 (one thread per
75-
physical core; SMT siblings do not help vLLM CPU). vLLM sets `OMP_NUM_THREADS`
76-
itself from this list, so we don't.
77-
- `VLLM_CPU_KVCACHE_SPACE` (GB) = `min(mem*kv_frac, mem-16)`; on <=32GB hosts, `mem*0.5`.
78-
- `container_cpuset` = `--cpuset-cpus=<socket 0 cpus>` (no `--cpuset-mems`) for the
79-
container path on a multi-socket host. The conda path needs nothing extra -- the
80-
bind env var binds the threads.
72+
Policy: a single instance is pinned to **one socket plus its memory** (vLLM scales
73+
poorly across sockets). `scripts/cpu_tune.py` derives:
74+
- **Socket choice** (dual-socket): samples per-socket CPU busy% (~0.5s) and prefers a
75+
free socket -- both free → socket 0; one free → that one; both at/above
76+
`--busy-threshold` (default 15%) → `warning` and proceed on the least-busy. `--socket N`
77+
forces it. Single-socket → socket 0.
78+
- `VLLM_CPU_OMP_THREADS_BIND` = the chosen socket's physical cores (SMT dropped). vLLM
79+
sets `OMP_NUM_THREADS` from this, so we don't.
80+
- `VLLM_CPU_KVCACHE_SPACE` (GB) = `min(socket_ram*kv_frac, socket_ram-16)` -- sized from
81+
the **chosen socket's local RAM** so the KV pool stays on-socket (≤32GB → `*0.5`).
82+
- Memory-bound pin: `container_cpuset` = `--cpuset-cpus=<cores> --cpuset-mems=<nodes>`;
83+
`conda_launch_prefix` = `numactl --cpunodebind=<nodes> --membind=<nodes>` (falls back to
84+
`taskset` CPU-only, or empty-with-note if neither tool exists).
8185

8286
Not set: `OMP_NUM_THREADS` (vLLM derives it from the bind) and
8387
`VLLM_CPU_NUM_OF_RESERVED_CPU` (vLLM has its own default when unset).
8488

85-
When socket 0 spans multiple NUMA nodes (NPS2/NPS4), `cpu_tune.py` emits a
86-
`perf_note`: the simple default leaves some performance on the table versus optimal
87-
per-NUMA-node binding (one instance per node, memory bound). That tuning is out of
89+
When the chosen socket spans multiple NUMA nodes (NPS2/NPS4), `cpu_tune.py` emits an
90+
`nps_note`: memory is bound across the socket's nodes, and finer per-node binding
91+
(one instance per node) could add more. That tuning is out of
8892
scope for the base recipe.
8993

9094
## Known quirks
@@ -115,8 +119,10 @@ HF file sizes (`.safetensors` or legacy `.bin`); `--weight-gb` overrides when a
115119
model has no metadata. KV cache is bf16-only on zentorch CPU (no fp8 KV), so the estimate always uses 2 bytes/element.
116120

117121
**NUMA cross-node traffic**
118-
On a 2-socket EPYC, an unpinned instance spreads threads across both sockets and
119-
pays cross-socket latency. The default keeps one instance on **socket 0's CPUs**
120-
(`cpu_tune.py` -> `VLLM_CPU_OMP_THREADS_BIND`, plus `--cpuset-cpus` for the
121-
container), with **no memory binding**. On NPS2/NPS4, `cpu_tune.py` notes that
122-
optimal per-NUMA-node binding could add performance; the base recipe doesn't do it.
122+
On a 2-socket EPYC, an unpinned instance spreads threads + memory across both sockets
123+
and pays cross-socket latency. `cpu_tune.py` keeps one instance on **one socket plus
124+
its memory**: CPU bind (`VLLM_CPU_OMP_THREADS_BIND` + `--cpuset-cpus`), memory bind
125+
(`--cpuset-mems` / `numactl --membind`), and KV sized from that socket's local RAM so
126+
the KV pool never lands on the other socket. The socket is chosen by load (free socket
127+
preferred; warns if both busy). True multi-socket throughput = **multiple instances**
128+
(one per socket) -- out of scope for this single-instance recipe.

0 commit comments

Comments
 (0)