You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add serving-llms-on-epyc walkthrough and EPYC serving fixes
- Add walkthroughs/serving-llms-on-epyc.md (+ README link) for issue #82.
- Re-register the skill in the marketplace (needs a walkthrough to be listed).
- Require AVX-512 (Zen4+): hard gate in Step 1; scope in the description.
- Fix launch: drop --shm-size (conflicts with --ipc=host on podman).
- detect.py: recognize lettered EPYC SKUs (e.g. 9B45 -> Turin/Zen5).
- Note re-run name collision, rootless-podman cpuset, and HF_HOME cache mount.
Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com>
Change-Id: Ia2bf2b8f40c2c709f8ad3b3d394a7946d4949b26
Copy file name to clipboardExpand all lines: .claude-plugin/marketplace.json
+5Lines changed: 5 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -24,6 +24,11 @@
24
24
"source": "./skills/magpie-kernel-evaluator",
25
25
"description": "Performs GPU kernel correctness and performance evaluation and LLM inference benchmarking with Magpie. Analyzes single or multiple kernels (HIP/CUDA/PyTorch), compares kernel implementations, runs vLLM/SGLang benchmarks with profiling and TraceLens, and runs gap analysis on torch traces."
26
26
},
27
+
{
28
+
"name": "serving-llms-on-epyc",
29
+
"source": "./skills/serving-llms-on-epyc",
30
+
"description": "Serve LLMs on AMD EPYC CPUs (AVX-512 / Zen4+) with vLLM + zentorch, in a container (Docker/Podman) or conda. Detects the CPU, validates runtime/env, checks vLLM model support and RAM fit, sizes threads/KV and pins one socket + its memory, launches, and verifies. Single instance; reports and stops on failure."
Copy file name to clipboardExpand all lines: .cursor-plugin/marketplace.json
+5Lines changed: 5 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -24,6 +24,11 @@
24
24
"source": "./skills/magpie-kernel-evaluator",
25
25
"description": "Performs GPU kernel correctness and performance evaluation and LLM inference benchmarking with Magpie. Analyzes single or multiple kernels (HIP/CUDA/PyTorch), compares kernel implementations, runs vLLM/SGLang benchmarks with profiling and TraceLens, and runs gap analysis on torch traces."
26
26
},
27
+
{
28
+
"name": "serving-llms-on-epyc",
29
+
"source": "./skills/serving-llms-on-epyc",
30
+
"description": "Serve LLMs on AMD EPYC CPUs (AVX-512 / Zen4+) with vLLM + zentorch, in a container (Docker/Podman) or conda. Detects the CPU, validates runtime/env, checks vLLM model support and RAM fit, sizes threads/KV and pins one socket + its memory, launches, and verifies. Single instance; reports and stops on failure."
Copy file name to clipboardExpand all lines: skills/serving-llms-on-epyc/data/epyc.json
+2-3Lines changed: 2 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -6,13 +6,12 @@
6
6
"comment": "Public vLLM + zentorch CPU image on Docker Hub (amdih/zendnn_zentorch) -- no internal-registry access needed. Tags are vllm_v<ver>_zentorch_v<ver>_<os>_<build>; prefer the newest ubuntu22.04 stable. Both docker and podman are supported; the skill prefers docker and falls back to podman.",
"--ipc=host": "vLLM workers use host IPC/shared memory.",
15
-
"--shm-size=16g": "vLLM needs a large /dev/shm; default 64MB is not enough.",
13
+
"--ipc=host": "vLLM workers need a large /dev/shm; --ipc=host shares the host's (large) shared memory, which covers it. Do NOT also pass --shm-size: podman rejects '--shm-size' together with '--ipc=host' (cannot set shmsize in host IPC namespace), and it is redundant on docker too.",
14
+
"shm_alternative": "If you must isolate IPC (drop --ipc=host), then add --shm-size=16g instead (the 64MB container default is too small for vLLM). Use one or the other, never both.",
16
15
"--network=host": "Expose the served port directly. Alternative: -p <port>:<port>.",
17
16
"numa": "A single instance is pinned to ONE socket plus its memory. cpu_tune.py picks a free socket by CPU load on dual-socket hosts (warns if both busy; --socket N forces), sizes KV from that socket's local RAM, and emits --cpuset-cpus + --cpuset-mems (container) or numactl --cpunodebind/--membind (conda). True multi-socket scaling = multiple instances (one per socket), out of scope here."
Copy file name to clipboardExpand all lines: skills/serving-llms-on-epyc/reference.md
+7-4Lines changed: 7 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -41,8 +41,8 @@ From `data/epyc.json`. Unlike the Instinct (GPU) skill there are **no**
41
41
42
42
| Flag | Why |
43
43
|---|---|
44
-
|`--ipc=host`| vLLM workers use host IPC / shared memory|
45
-
|`--shm-size=16g`|vLLM needs a large `/dev/shm`; the 64MB default is too small |
44
+
|`--ipc=host`| vLLM workers need a large `/dev/shm`; sharing the host IPC namespace provides it. **Do not also pass `--shm-size`** -- podman rejects the combination, and it is redundant on docker|
45
+
|`--shm-size=16g`|**only if you drop `--ipc=host`** (isolated IPC). The 64MB container default is too small for vLLM. Use one or the other, never both|
46
46
|`--network=host`| expose the served port directly (or use `-p <port>:<port>`) |
47
47
|`--cpuset-cpus` / `--cpuset-mems`| pin the container to the chosen socket's physical cores and its NUMA node(s); from `cpu_tune.py`|
48
48
|`-v ~/.cache/huggingface:/root/.cache/huggingface`| reuse the host model cache |
@@ -107,8 +107,11 @@ between the failing and passing runs was `VLLM_USE_AOT_COMPILE`. Never set
107
107
`FREEZING=1` without `VLLM_USE_AOT_COMPILE=0`. The base recipe leaves both unset.
108
108
109
109
**`/dev/shm` too small**
110
-
Without `--shm-size=16g` (or `--ipc=host`), vLLM workers fail to allocate shared
111
-
memory at startup.
110
+
vLLM workers need a large `/dev/shm` or they fail to allocate shared memory at
111
+
startup. The base recipe uses `--ipc=host` (shares the host's large shared memory).
112
+
**Do not combine `--ipc=host` with `--shm-size`** -- podman errors *"cannot set
113
+
shmsize when running in the host IPC Namespace"*, and it is redundant on docker. If
114
+
you drop `--ipc=host`, use `--shm-size=16g` instead -- one or the other, never both.
112
115
113
116
**RAM is the ceiling, not VRAM**
114
117
CPU serving keeps weights + KV cache in system RAM. `estimate_memory.py` checks
The goal of this skill is to teach your AI agent to bring up a vLLM OpenAI-compatible
4
+
endpoint on an **AMD EPYC CPU** host using the zentorch backend — detecting the CPU,
5
+
validating the environment, checking the model fits, sizing the runtime to the
6
+
hardware, launching, and verifying the endpoint responds.
7
+
8
+
**What you'll end up with:** a running `vllm serve` endpoint on your EPYC box (in a
9
+
Docker/Podman container, or a conda env), sized to a single socket and ready to answer
10
+
OpenAI `/v1/chat/completions` requests.
11
+
12
+
## Prerequisites
13
+
14
+
- An **AMD EPYC CPU with AVX-512 support** — i.e. **Zen4+ (Genoa / Bergamo / Siena / Turin) or newer**. This is CPU serving (no GPU required); AVX-512 is required for the zentorch CPU path, and `detect.py` reports it (`avx512`).
15
+
- A container runtime — **Docker** or **Podman** — *or* a conda env with `vllm` + `zentorch` installed.
16
+
- Enough host RAM for the model (weights + KV cache both live in RAM on CPU).
17
+
- A HuggingFace token in `HF_TOKEN`**only** for gated models (Llama, Gemma). The default model (Qwen3) needs none.
18
+
-**Node.js ≥ 18** — required by the `skills` CLI used in Step 2 (`npx skills ...`). Check with `node -v`; on older hosts install a newer Node (e.g. `conda create -n node20 -c conda-forge 'nodejs>=20'`).
19
+
20
+
## Step 1 - Understanding which skills are available
21
+
22
+
* Run `claude "Which skills can you see?" --model sonnet`. You should see a list of skills that does **not** include anything about serving LLMs on EPYC / CPU.
23
+
* Make sure there is no `AGENTS.md` file in your local folder.
24
+
25
+
## Step 2 - Enabling claude to see `serving-llms-on-epyc`
26
+
27
+
* Install the skill with the [`skills` CLI](https://github.com/vercel-labs/skills):
* Run `claude "Which skills can you see?" --model sonnet`. You should see a list of skills that now includes `serving-llms-on-epyc`.
34
+
35
+
## Step 3 - Running the skill
36
+
37
+
Run `claude --model sonnet` on your EPYC host with this prompt:
38
+
39
+
```
40
+
Serve Qwen/Qwen3-0.6B on this AMD EPYC box with vLLM and zentorch.
41
+
```
42
+
43
+
Claude should:
44
+
45
+
1.**Detect the CPU** — confirm it is AMD EPYC and read the generation (Genoa/Turin/…), AVX-512, physical cores, NUMA layout, and RAM.
46
+
2.**Validate the environment** — find an accessible runtime (Docker or Podman, else the conda path), check the image, `HF_TOKEN`, and RAM; report any perf-library advisories.
47
+
3.**Check vLLM supports the model** — verify the architecture against vLLM's model registry (it does not blanket-block multimodal; it rejects non-chat models like embeddings/rerankers).
48
+
4.**Check it fits host RAM** — weights + KV cache + headroom vs available RAM.
49
+
5.**Size the runtime to the hardware** — bind to one socket's physical cores, size the KV cache from that socket's local RAM, and bind memory to that socket (this is **single-socket serving**; vLLM scales poorly across sockets).
50
+
6.**Confirm the plan with you** — present a sized summary (model, path, precision, fit, CPU sizing, port) and wait for you to approve before launching.
51
+
7.**Launch and verify** — pull the public `amdih/zendnn_zentorch` image, run `vllm serve`, poll `/health`, and prove `/v1/chat/completions` works.
52
+
53
+
On any failure it reports the cause + logs and **stops** — it does not retry or start a debugging loop.
54
+
55
+
## Step 4 - Talk to the endpoint
56
+
57
+
Once Claude reports the endpoint is healthy, call it — use the **port from Claude's
***A real workload:** ask for a larger model once the flow is proven, e.g. *"Serve Qwen/Qwen3-8B ..."*. Claude re-checks the RAM fit and re-sizes.
69
+
***Gated models:**`export HF_TOKEN=...` (and accept the model license on HuggingFace), then ask for `meta-llama/Llama-3.1-8B-Instruct`.
70
+
***Pick a socket:** on a dual-socket box Claude picks a free socket by load; you can steer it (*"serve it on socket 1"*).
71
+
72
+
## Step 6 - (Optional) Try to get things done without AMD Skills
73
+
74
+
Remove the added skill and rerun the experiment above. The `skills` CLI installs a
75
+
copy under **both**`.claude/skills/serving-llms-on-epyc`**and**
76
+
`.agents/skills/serving-llms-on-epyc`, so delete both (otherwise the leftover copy
77
+
keeps the skill active and the comparison isn't clean). Without the skill, common
78
+
issues include:
79
+
80
+
* Passing `--device cpu` to `vllm serve` (removed in vLLM ≥ 0.20 with the zentorch plugin) — the server errors out on launch.
81
+
* Guessing at a container image or using a GPU/CUDA image instead of the public CPU `amdih/zendnn_zentorch` one.
82
+
* No hardware-aware sizing — spreading threads across both sockets and sizing the KV cache from whole-system RAM, so the KV pool spills cross-socket and throughput tanks.
83
+
* Launching a model that does not fit host RAM (or an embedding/reranker model that has no chat endpoint) and then looping on the failure.
84
+
* Providing a knowledge article instead of actually bringing up a working endpoint.
0 commit comments