You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Positive behavioral expectations (the state machine).
28
28
run.should("Detect the CPU and confirm it is an AMD EPYC host before serving (e.g. runs detect.py)")
29
29
run.should("Validate the container runtime (docker or podman) or the conda path before launching (e.g. runs validate.py)")
30
-
run.should("Take validate.py's environment advisories into account -- the tcmalloc / OpenMP (LD_PRELOAD) perf-library recommendation and, when the image is already pulled, the in-image vllm+zentorch check -- surfacing any that apply")
30
+
run.should("Use validate.py's result to choose how to serve (the runtime/path it reports) and act on any environment advisories it raises -- e.g. the tcmalloc/OpenMP LD_PRELOAD perf-library note or the in-image vllm+zentorch check; on the container path with the image not yet pulled there may be none, which is fine")
31
31
run.should("Check that vLLM supports the model before serving (e.g. runs check_model.py), rather than refusing it just for being multimodal")
32
32
run.should("Check that the model fits in host RAM (e.g. runs estimate_memory.py)")
33
33
run.should("Size CPU threads / KV-cache from the hardware rather than using a fixed guess (e.g. runs cpu_tune.py)")
34
+
run.should("Pin the instance to a single socket with its memory (socket-local KV plus cpuset-mems or numactl membind) and, on a dual-socket host, pick a socket by load -- surfacing cpu_tune's warning if both sockets are busy")
34
35
run.should("Present a sized plan and ask the user to confirm before launching the server")
35
36
run.should("Plan to launch with 'vllm serve' and poll until /health is healthy")
Copy file name to clipboardExpand all lines: skills/serving-llms-on-epyc/data/epyc.json
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,7 @@
14
14
"--ipc=host": "vLLM workers use host IPC/shared memory.",
15
15
"--shm-size=16g": "vLLM needs a large /dev/shm; default 64MB is not enough.",
16
16
"--network=host": "Expose the served port directly. Alternative: -p <port>:<port>.",
17
-
"numa": "Default: a single instance uses socket 0's CPUs with NO memory binding (cpu_tune.py emits --cpuset-cpus for the container; conda relies on VLLM_CPU_OMP_THREADS_BIND). On NPS2/NPS4 (multiple NUMA nodes per socket), optimal per-node binding could add performance -- cpu_tune.py notes this; the base recipe does not do it."
17
+
"numa": "A single instance is pinned to ONE socket plus its memory. cpu_tune.py picks a free socket by CPU load on dual-socket hosts (warns if both busy; --socket N forces), sizes KV from that socket's local RAM, and emits --cpuset-cpus + --cpuset-mems (container) or numactl --cpunodebind/--membind (conda). True multi-socket scaling = multiple instances (one per socket), out of scope here."
18
18
}
19
19
},
20
20
"launch": {
@@ -36,7 +36,7 @@
36
36
"smoke_model": "Qwen/Qwen3-0.6B",
37
37
"smoke_model_notes": "Current small Qwen, chat-capable (ships a chat template, so /v1/chat/completions works -- unlike base models such as opt-125m).",
38
38
"env_defaults": {
39
-
"VLLM_CPU_OMP_THREADS_BIND": "set by cpu_tune.py (physical cores of socket 0)",
39
+
"VLLM_CPU_OMP_THREADS_BIND": "set by cpu_tune.py (physical cores of the chosen socket)",
40
40
"VLLM_CPU_KVCACHE_SPACE": "set by cpu_tune.py (GB)",
41
41
"do_not_set": "OMP_NUM_THREADS -- vLLM sets it from the bind list (len of cpu_list); and VLLM_CPU_NUM_OF_RESERVED_CPU -- vLLM has its own default when unset, forcing 0 overrides it."
0 commit comments