address review comments

amd-lalithnc · amd-lalithnc · commit 8aff5648beb7 · 2026-06-30T08:38:39.000-06:00
Signed-off-by: Lalithnarayan C &lt;Lalithnarayan.C@amd.com&gt;
Change-Id: I6442cc19df3caa3e0e5f36cc276bf94550d5a95e
diff --git a/eval/behavioral/tests/test_serving_llms_on_epyc.py b/eval/behavioral/tests/test_serving_llms_on_epyc.py
@@ -27,10 +27,11 @@ def test_serve_model_on_epyc():
         # Positive behavioral expectations (the state machine).
         run.should("Detect the CPU and confirm it is an AMD EPYC host before serving (e.g. runs detect.py)")
         run.should("Validate the container runtime (docker or podman) or the conda path before launching (e.g. runs validate.py)")
-        run.should("Take validate.py's environment advisories into account -- the tcmalloc / OpenMP (LD_PRELOAD) perf-library recommendation and, when the image is already pulled, the in-image vllm+zentorch check -- surfacing any that apply")
+        run.should("Use validate.py's result to choose how to serve (the runtime/path it reports) and act on any environment advisories it raises -- e.g. the tcmalloc/OpenMP LD_PRELOAD perf-library note or the in-image vllm+zentorch check; on the container path with the image not yet pulled there may be none, which is fine")
         run.should("Check that vLLM supports the model before serving (e.g. runs check_model.py), rather than refusing it just for being multimodal")
         run.should("Check that the model fits in host RAM (e.g. runs estimate_memory.py)")
         run.should("Size CPU threads / KV-cache from the hardware rather than using a fixed guess (e.g. runs cpu_tune.py)")
+        run.should("Pin the instance to a single socket with its memory (socket-local KV plus cpuset-mems or numactl membind) and, on a dual-socket host, pick a socket by load -- surfacing cpu_tune's warning if both sockets are busy")
         run.should("Present a sized plan and ask the user to confirm before launching the server")
         run.should("Plan to launch with 'vllm serve' and poll until /health is healthy")
 
diff --git a/skills/serving-llms-on-epyc/SKILL.md b/skills/serving-llms-on-epyc/SKILL.md
@@ -10,10 +10,10 @@ description: >-
   detect the CPU (incl. EPYC generation), validate the runtime/env, check vLLM
   supports the model (via vLLM's registry, not a modality blocklist), check it
   fits host RAM, size CPU threads/KV/NUMA from the hardware, confirm the plan with
-  the user, launch, and poll until the endpoint is responsive. Single instance
-  only. Does NOT debug failures
-  and does NOT retry -- it reports and stops. Do not use for GPU/Instinct (use
-  serving-llms-on-instinct) or multi-node.
+  the user, launch, and poll until the endpoint is responsive. Single instance,
+  single socket (pinned to one socket + its memory; vLLM scales poorly across
+  sockets). Does NOT debug failures and does NOT retry -- it reports and stops. Do
+  not use for GPU/Instinct (use serving-llms-on-instinct) or multi-node.
 allowed-tools: Bash, Read
 ---
 
@@ -23,6 +23,11 @@ Bring up a single vLLM OpenAI endpoint on an AMD EPYC host with the zentorch CPU
 backend, sized to the hardware. Container-first (Docker or Podman); conda/host
 is the fallback.
 
+**This is single-socket serving:** one instance pinned to one socket and its memory
+(vLLM scales poorly across sockets, so we do not span them). On a dual-socket host it
+runs on a single socket; the multi-socket answer is **multiple instances (one per
+socket)**, which is out of scope for this single-instance recipe.
+
 Hard rule for this skill: **on any failure, report the cause + logs and STOP.
 Do not retry, do not debug.** (Debugging is a separate workflow.)
 
@@ -114,15 +119,23 @@ Extra flag: `--weight-gb N` overrides weights if a model has no HF metadata
 eval "$(python3 scripts/cpu_tune.py)"      # or --format json to inspect
 ```
 
-Exports `VLLM_CPU_OMP_THREADS_BIND` (physical cores of **socket 0**) and
-`VLLM_CPU_KVCACHE_SPACE` (GB). It does **not** set `OMP_NUM_THREADS` (vLLM derives
-it from the bind list) or `VLLM_CPU_NUM_OF_RESERVED_CPU` (vLLM has its own default
-when unset). Default policy, the same for NPS1/NPS2/NPS4: a single instance uses
-**socket 0's whole CPU with no memory binding**. On a multi-socket host the JSON
-gives `container_cpuset` (`--cpuset-cpus` only -- no `--cpuset-mems`) for the
-container path; the conda path needs nothing extra (the bind env var binds the
-threads). If socket 0 spans multiple NUMA nodes (NPS2/NPS4), `perf_note` flags that
-optimal per-node binding could give more performance -- surface it, but proceed.
+A single instance runs on **one socket, with its memory** (vLLM scales poorly across
+sockets). `cpu_tune.py` exports `VLLM_CPU_OMP_THREADS_BIND` (the chosen socket's
+physical cores) and `VLLM_CPU_KVCACHE_SPACE` (sized from that **socket's local RAM**,
+not whole-system, so the KV pool stays on-socket). It does **not** set
+`OMP_NUM_THREADS` (vLLM derives it) or `VLLM_CPU_NUM_OF_RESERVED_CPU` (vLLM's own default).
+
+Socket choice on a dual-socket host (load-aware): it samples per-socket CPU busy%
+(~0.5s) and prefers a free socket -- both free → socket 0; one free → that socket;
+**both busy (≥ `--busy-threshold`, default 15%) → it `warning`s and proceeds on the
+least-busy socket**. `--socket N` forces a choice. Single-socket hosts use socket 0.
+
+For the chosen socket it also emits the memory-bound pin: `container_cpuset`
+(`--cpuset-cpus=<cores> --cpuset-mems=<nodes>`) for the container path, and
+`conda_launch_prefix` (`numactl --cpunodebind/--membind`, falling back to `taskset`
+CPU-only, or empty-with-note if neither tool exists) for conda. **Surface `warning`
+to the user** if set. On NPS2/NPS4 a socket spans multiple NUMA nodes; memory is
+bound across them and `nps_note` flags that finer binding could add performance.
 
 ## Step 6: Confirm the plan, then launch (container-first)
 
@@ -135,10 +148,12 @@ not launch unprompted. This is the human gate before anything runs:
 | Path | container (`<runtime>`, image from `data/epyc.json`) or conda/host |
 | Precision | `bfloat16` (or the user's choice) |
 | Fit | required `<required_gb>` GB vs `<ram_gb>` GB RAM |
-| CPU sizing | thread bind `<VLLM_CPU_OMP_THREADS_BIND>` (socket 0), KV `<VLLM_CPU_KVCACHE_SPACE>` GB, no memory binding |
+| CPU sizing | socket `<chosen_socket>` (`<socket_choice_reason>`), bind `<VLLM_CPU_OMP_THREADS_BIND>`, KV `<VLLM_CPU_KVCACHE_SPACE>` GB (socket-local), mem bound to nodes `<numa_nodes_on_socket>` |
 | Hardware | EPYC `<epyc_generation>` (`<zen_arch>`), `<physical_cores>` cores, AVX-512 `<avx512>` |
 | Port | `<port>` |
 
+If `cpu_tune.py` returned a `warning` (e.g. all sockets busy), include it here so the user sees it before confirming.
+
 Proceed only on a clear "go". If the user declines or wants changes (model,
 `--max-model-len`, port), stop and adjust -- do not launch.
 
@@ -155,7 +170,7 @@ $RT pull <image from data/epyc.json>          # agent pulls; do not ask the user
 $RT run -d --name vllm-epyc \
   <run_flags from data/epyc.json>            # --ipc=host --shm-size=16g --network=host
   <hf_cache_mount> \
-  <container_cpuset from cpu_tune, on multi-socket>   # --cpuset-cpus=... (no --cpuset-mems)
+  <container_cpuset from cpu_tune>             # --cpuset-cpus=<cores> --cpuset-mems=<nodes>
   --env VLLM_CPU_OMP_THREADS_BIND="$VLLM_CPU_OMP_THREADS_BIND" \
   --env VLLM_CPU_KVCACHE_SPACE=$VLLM_CPU_KVCACHE_SPACE \
   --env HF_TOKEN=${HF_TOKEN} \
@@ -164,10 +179,11 @@ $RT run -d --name vllm-epyc \
 ```
 
 **Conda/host path** (no container runtime, `conda_path_available` true). `eval`-ing
-cpu_tune already exported the env vars; just launch -- `VLLM_CPU_OMP_THREADS_BIND`
-binds the threads to socket 0, and there is no memory binding by default:
+cpu_tune already exported the env vars; prefix the launch with `conda_launch_prefix`
+from cpu_tune so memory is bound to the chosen socket (empty → unpinned, with a note):
 ```bash
-vllm serve <model> --dtype bfloat16 --port <port> --max-model-len <len> &
+<conda_launch_prefix from cpu_tune> vllm serve <model> --dtype bfloat16 --port <port> --max-model-len <len> &
+# e.g. numactl --cpunodebind=0 --membind=0 vllm serve ...
 ```
 
 Optional throughput flags are **opt-in and must move together** (see Gotchas):
@@ -230,7 +246,8 @@ See [reference.md](reference.md) for the full list. The load-bearing ones:
   the other.
 - **`--shm-size`**: vLLM needs a large `/dev/shm`; the container default (64MB)
   is too small. Use `--shm-size=16g` (in `data/epyc.json`).
-- **NUMA**: the default is simple -- one instance on **socket 0's CPUs, no memory
-  binding** (`--cpuset-cpus` from `cpu_tune.py` for the container; the bind env var
-  for conda). If socket 0 spans multiple NUMA nodes (NPS2/NPS4), `cpu_tune.py` notes
-  that optimal per-node binding could add performance; the base recipe doesn't do it.
+- **NUMA / socket**: one instance is pinned to **one socket plus its memory** --
+  CPU bind + `--cpuset-mems` (container) / `numactl --membind` (conda), with KV sized
+  from that socket's local RAM. On a dual-socket host `cpu_tune.py` picks a free socket
+  by load and `warning`s if both are busy. NPS2/NPS4 (multi-node socket) gets an
+  `nps_note` that finer per-node binding could add more.
diff --git a/skills/serving-llms-on-epyc/data/epyc.json b/skills/serving-llms-on-epyc/data/epyc.json
@@ -14,7 +14,7 @@
       "--ipc=host": "vLLM workers use host IPC/shared memory.",
       "--shm-size=16g": "vLLM needs a large /dev/shm; default 64MB is not enough.",
       "--network=host": "Expose the served port directly. Alternative: -p <port>:<port>.",
-      "numa": "Default: a single instance uses socket 0's CPUs with NO memory binding (cpu_tune.py emits --cpuset-cpus for the container; conda relies on VLLM_CPU_OMP_THREADS_BIND). On NPS2/NPS4 (multiple NUMA nodes per socket), optimal per-node binding could add performance -- cpu_tune.py notes this; the base recipe does not do it."
+      "numa": "A single instance is pinned to ONE socket plus its memory. cpu_tune.py picks a free socket by CPU load on dual-socket hosts (warns if both busy; --socket N forces), sizes KV from that socket's local RAM, and emits --cpuset-cpus + --cpuset-mems (container) or numactl --cpunodebind/--membind (conda). True multi-socket scaling = multiple instances (one per socket), out of scope here."
     }
   },
   "launch": {
@@ -36,7 +36,7 @@
   "smoke_model": "Qwen/Qwen3-0.6B",
   "smoke_model_notes": "Current small Qwen, chat-capable (ships a chat template, so /v1/chat/completions works -- unlike base models such as opt-125m).",
   "env_defaults": {
-    "VLLM_CPU_OMP_THREADS_BIND": "set by cpu_tune.py (physical cores of socket 0)",
+    "VLLM_CPU_OMP_THREADS_BIND": "set by cpu_tune.py (physical cores of the chosen socket)",
     "VLLM_CPU_KVCACHE_SPACE": "set by cpu_tune.py (GB)",
     "do_not_set": "OMP_NUM_THREADS -- vLLM sets it from the bind list (len of cpu_list); and VLLM_CPU_NUM_OF_RESERVED_CPU -- vLLM has its own default when unset, forcing 0 overrides it."
   },
diff --git a/skills/serving-llms-on-epyc/reference.md b/skills/serving-llms-on-epyc/reference.md
@@ -44,7 +44,7 @@ From `data/epyc.json`. Unlike the Instinct (GPU) skill there are **no**
 | `--ipc=host` | vLLM workers use host IPC / shared memory |
 | `--shm-size=16g` | vLLM needs a large `/dev/shm`; the 64MB default is too small |
 | `--network=host` | expose the served port directly (or use `-p <port>:<port>`) |
-| `--cpuset-cpus` | (multi-socket) restrict the container to socket 0's CPUs; from `cpu_tune.py`. No `--cpuset-mems` -- no memory binding by default |
+| `--cpuset-cpus` / `--cpuset-mems` | pin the container to the chosen socket's physical cores and its NUMA node(s); from `cpu_tune.py` |
 | `-v ~/.cache/huggingface:/root/.cache/huggingface` | reuse the host model cache |
 
 Image: `amdih/zendnn_zentorch:<tag>` -- the public vLLM + zentorch CPU image on
@@ -69,22 +69,26 @@ surfaces at load, where the no-retry rule applies.
 
 ## CPU sizing
 
-Default policy (the same for NPS1/NPS2/NPS4): a single instance uses **socket 0's
-whole CPU with no memory binding**. `scripts/cpu_tune.py` derives:
-- `VLLM_CPU_OMP_THREADS_BIND` = the physical cores of socket 0 (one thread per
-  physical core; SMT siblings do not help vLLM CPU). vLLM sets `OMP_NUM_THREADS`
-  itself from this list, so we don't.
-- `VLLM_CPU_KVCACHE_SPACE` (GB) = `min(mem*kv_frac, mem-16)`; on <=32GB hosts, `mem*0.5`.
-- `container_cpuset` = `--cpuset-cpus=<socket 0 cpus>` (no `--cpuset-mems`) for the
-  container path on a multi-socket host. The conda path needs nothing extra -- the
-  bind env var binds the threads.
+Policy: a single instance is pinned to **one socket plus its memory** (vLLM scales
+poorly across sockets). `scripts/cpu_tune.py` derives:
+- **Socket choice** (dual-socket): samples per-socket CPU busy% (~0.5s) and prefers a
+  free socket -- both free → socket 0; one free → that one; both at/above
+  `--busy-threshold` (default 15%) → `warning` and proceed on the least-busy. `--socket N`
+  forces it. Single-socket → socket 0.
+- `VLLM_CPU_OMP_THREADS_BIND` = the chosen socket's physical cores (SMT dropped). vLLM
+  sets `OMP_NUM_THREADS` from this, so we don't.
+- `VLLM_CPU_KVCACHE_SPACE` (GB) = `min(socket_ram*kv_frac, socket_ram-16)` -- sized from
+  the **chosen socket's local RAM** so the KV pool stays on-socket (≤32GB → `*0.5`).
+- Memory-bound pin: `container_cpuset` = `--cpuset-cpus=<cores> --cpuset-mems=<nodes>`;
+  `conda_launch_prefix` = `numactl --cpunodebind=<nodes> --membind=<nodes>` (falls back to
+  `taskset` CPU-only, or empty-with-note if neither tool exists).
 
 Not set: `OMP_NUM_THREADS` (vLLM derives it from the bind) and
 `VLLM_CPU_NUM_OF_RESERVED_CPU` (vLLM has its own default when unset).
 
-When socket 0 spans multiple NUMA nodes (NPS2/NPS4), `cpu_tune.py` emits a
-`perf_note`: the simple default leaves some performance on the table versus optimal
-per-NUMA-node binding (one instance per node, memory bound). That tuning is out of
+When the chosen socket spans multiple NUMA nodes (NPS2/NPS4), `cpu_tune.py` emits an
+`nps_note`: memory is bound across the socket's nodes, and finer per-node binding
+(one instance per node) could add more. That tuning is out of
 scope for the base recipe.
 
 ## Known quirks
@@ -115,8 +119,10 @@ HF file sizes (`.safetensors` or legacy `.bin`); `--weight-gb` overrides when a
 model has no metadata. KV cache is bf16-only on zentorch CPU (no fp8 KV), so the estimate always uses 2 bytes/element.
 
 **NUMA cross-node traffic**
-On a 2-socket EPYC, an unpinned instance spreads threads across both sockets and
-pays cross-socket latency. The default keeps one instance on **socket 0's CPUs**
-(`cpu_tune.py` -> `VLLM_CPU_OMP_THREADS_BIND`, plus `--cpuset-cpus` for the
-container), with **no memory binding**. On NPS2/NPS4, `cpu_tune.py` notes that
-optimal per-NUMA-node binding could add performance; the base recipe doesn't do it.
+On a 2-socket EPYC, an unpinned instance spreads threads + memory across both sockets
+and pays cross-socket latency. `cpu_tune.py` keeps one instance on **one socket plus
+its memory**: CPU bind (`VLLM_CPU_OMP_THREADS_BIND` + `--cpuset-cpus`), memory bind
+(`--cpuset-mems` / `numactl --membind`), and KV sized from that socket's local RAM so
+the KV pool never lands on the other socket. The socket is chosen by load (free socket
+preferred; warns if both busy). True multi-socket throughput = **multiple instances**
+(one per socket) -- out of scope for this single-instance recipe.
diff --git a/skills/serving-llms-on-epyc/scripts/cpu_tune.py b/skills/serving-llms-on-epyc/scripts/cpu_tune.py