Skip to content

Commit 40e8397

Browse files
committed
Add serving-llms-on-epyc walkthrough and EPYC serving fixes
- Add walkthroughs/serving-llms-on-epyc.md (+ README link) for issue #82. - Re-register the skill in the marketplace (needs a walkthrough to be listed). - Require AVX-512 (Zen4+): hard gate in Step 1; scope in the description. - Fix launch: drop --shm-size (conflicts with --ipc=host on podman). - detect.py: recognize lettered EPYC SKUs (e.g. 9B45 -> Turin/Zen5). - Note re-run name collision, rootless-podman cpuset, and HF_HOME cache mount. Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com> Change-Id: Ia2bf2b8f40c2c709f8ad3b3d394a7946d4949b26
1 parent f505814 commit 40e8397

8 files changed

Lines changed: 145 additions & 23 deletions

File tree

.claude-plugin/marketplace.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,11 @@
2424
"source": "./skills/magpie-kernel-evaluator",
2525
"description": "Performs GPU kernel correctness and performance evaluation and LLM inference benchmarking with Magpie. Analyzes single or multiple kernels (HIP/CUDA/PyTorch), compares kernel implementations, runs vLLM/SGLang benchmarks with profiling and TraceLens, and runs gap analysis on torch traces."
2626
},
27+
{
28+
"name": "serving-llms-on-epyc",
29+
"source": "./skills/serving-llms-on-epyc",
30+
"description": "Serve LLMs on AMD EPYC CPUs (AVX-512 / Zen4+) with vLLM + zentorch, in a container (Docker/Podman) or conda. Detects the CPU, validates runtime/env, checks vLLM model support and RAM fit, sizes threads/KV and pins one socket + its memory, launches, and verifies. Single instance; reports and stops on failure."
31+
},
2732
{
2833
"name": "serving-llms-on-instinct",
2934
"source": "./skills/serving-llms-on-instinct",

.cursor-plugin/marketplace.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,11 @@
2424
"source": "./skills/magpie-kernel-evaluator",
2525
"description": "Performs GPU kernel correctness and performance evaluation and LLM inference benchmarking with Magpie. Analyzes single or multiple kernels (HIP/CUDA/PyTorch), compares kernel implementations, runs vLLM/SGLang benchmarks with profiling and TraceLens, and runs gap analysis on torch traces."
2626
},
27+
{
28+
"name": "serving-llms-on-epyc",
29+
"source": "./skills/serving-llms-on-epyc",
30+
"description": "Serve LLMs on AMD EPYC CPUs (AVX-512 / Zen4+) with vLLM + zentorch, in a container (Docker/Podman) or conda. Detects the CPU, validates runtime/env, checks vLLM model support and RAM fit, sizes threads/KV and pins one socket + its memory, launches, and verifies. Single instance; reports and stops on failure."
31+
},
2732
{
2833
"name": "serving-llms-on-instinct",
2934
"source": "./skills/serving-llms-on-instinct",

skills/serving-llms-on-epyc/SKILL.md

Lines changed: 33 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -5,15 +5,15 @@ description: >-
55
backend, in a container (Docker or Podman) or a conda env. Use whenever the
66
user wants to run, serve, deploy, start, host, or launch an LLM on AMD EPYC,
77
Zen CPU, "vLLM on CPU", "zentorch serving", or "serve a model without a GPU".
8-
Use for "serve Qwen on EPYC", "start a CPU vLLM endpoint", "run an OpenAI
9-
server on my EPYC box", or similar. Handles the full single-instance flow:
8+
Handles the full single-instance flow:
109
detect the CPU (incl. EPYC generation), validate the runtime/env, check vLLM
1110
supports the model (via vLLM's registry, not a modality blocklist), check it
1211
fits host RAM, size CPU threads/KV/NUMA from the hardware, confirm the plan with
1312
the user, launch, and poll until the endpoint is responsive. Single instance,
1413
single socket (pinned to one socket + its memory; vLLM scales poorly across
1514
sockets). Does NOT debug failures and does NOT retry -- it reports and stops. Do
16-
not use for GPU/Instinct (use serving-llms-on-instinct) or multi-node.
15+
not use for GPU/Instinct (use serving-llms-on-instinct), multi-node, or pre-Zen4
16+
EPYC without AVX-512 (Naples/Rome/Milan).
1717
allowed-tools: Bash, Read
1818
---
1919

@@ -52,10 +52,16 @@ python3 scripts/detect.py # add --host user@box for a remote host
5252

5353
Returns `cpu_model`, `is_amd_epyc`, `epyc_generation`
5454
(Naples/Rome/Milan/Genoa/Bergamo/Siena/Turin), `zen_arch`, `avx512`,
55-
`logical_cores`, `physical_cores`, `sockets`, `numa_nodes`, `memory_gb`. If
56-
`is_amd_epyc` is `false`, stop: this skill targets AMD EPYC. (Other x86 may work
57-
but is unsupported here.) Carry `epyc_generation` / `avx512` through the later
58-
phases -- e.g. AVX-512 + bf16 land on Zen4+ (Genoa/Turin), and Turin packs up to
55+
`logical_cores`, `physical_cores`, `sockets`, `numa_nodes`, `memory_gb`.
56+
57+
Two hard gates -- stop if either fails:
58+
- `is_amd_epyc` is `false` -> stop: this skill targets AMD EPYC. (Other x86 may work
59+
but is unsupported here.)
60+
- `avx512` is `false` -> stop: the zentorch CPU path **requires AVX-512**, i.e. Zen4+
61+
(Genoa / Bergamo / Siena / Turin) or newer. Pre-Zen4 EPYC (Naples / Rome / Milan)
62+
is not supported -- say so and stop rather than launching into a load-time failure.
63+
64+
Carry `epyc_generation` / `avx512` through the later phases -- e.g. Turin packs up to
5965
128 cores/socket, which the thread-binding in Step 5 sizes from.
6066

6167
## Step 2: Validate the runtime and environment
@@ -166,9 +172,10 @@ auto-selects the CPU platform and `vllm serve` rejects the flag. Only add it if
166172
including the pull. `RT` is the resolved runtime verbatim:
167173
```bash
168174
RT="<runtime from validate.py: docker | podman>"
175+
$RT rm -f vllm-epyc 2>/dev/null # clear any leftover container from a prior run (name collision otherwise)
169176
$RT pull <image from data/epyc.json> # agent pulls; do not ask the user to
170177
$RT run -d --name vllm-epyc \
171-
<run_flags from data/epyc.json> # --ipc=host --shm-size=16g --network=host
178+
<run_flags from data/epyc.json> # --ipc=host --network=host (NO --shm-size: it conflicts with --ipc=host on podman)
172179
<hf_cache_mount> \
173180
<container_cpuset from cpu_tune> # --cpuset-cpus=<cores> --cpuset-mems=<nodes>
174181
--env VLLM_CPU_OMP_THREADS_BIND="$VLLM_CPU_OMP_THREADS_BIND" \
@@ -244,10 +251,26 @@ See [reference.md](reference.md) for the full list. The load-bearing ones:
244251
zentorch 2.11 (`AssertionError: expected OutputCode, got function`). It only
245252
works with `VLLM_USE_AOT_COMPILE=0` set alongside it. Never set one without
246253
the other.
247-
- **`--shm-size`**: vLLM needs a large `/dev/shm`; the container default (64MB)
248-
is too small. Use `--shm-size=16g` (in `data/epyc.json`).
254+
- **`/dev/shm` — use `--ipc=host`, not `--shm-size`.** vLLM needs a large
255+
`/dev/shm` (the 64MB container default is too small). The base recipe uses
256+
`--ipc=host`, which shares the host's large shared memory. **Do not also pass
257+
`--shm-size`**: podman errors with *"cannot set shmsize when running in the host
258+
IPC Namespace"*, and it is redundant on docker. If you instead isolate IPC (drop
259+
`--ipc=host`), then add `--shm-size=16g` — one or the other, never both.
249260
- **NUMA / socket**: one instance is pinned to **one socket plus its memory** --
250261
CPU bind + `--cpuset-mems` (container) / `numactl --membind` (conda), with KV sized
251262
from that socket's local RAM. On a dual-socket host `cpu_tune.py` picks a free socket
252263
by load and `warning`s if both are busy. NPS2/NPS4 (multi-node socket) gets an
253264
`nps_note` that finer per-node binding could add more.
265+
- **Rootless podman + `--cpuset-cpus`/`--cpuset-mems`**: these are cgroup limits and
266+
may be **ignored or rejected** on rootless podman without cpuset cgroup delegation
267+
(cgroup v1, or v2 without the controller delegated). This is **not fatal**: CPU
268+
thread binding still applies via `VLLM_CPU_OMP_THREADS_BIND` inside the container;
269+
only the container-level memory pin is lost (reduced NUMA locality). If the run
270+
errors specifically on the cpuset flags, drop them and proceed -- do not treat it
271+
as a launch failure.
272+
- **HF cache mount**: the default mounts `~/.cache/huggingface`. If `HF_HOME` points
273+
elsewhere (common on shared hosts, e.g. `/proj/.../vllm`), mount **that** path to
274+
`/root/.cache/huggingface` instead, or the model re-downloads inside the container.
275+
- **Container name reuse**: a leftover `vllm-epyc` from a prior run makes `run` fail
276+
with "name already in use" -- Step 6 clears it first with `$RT rm -f vllm-epyc`.

skills/serving-llms-on-epyc/data/epyc.json

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,12 @@
66
"comment": "Public vLLM + zentorch CPU image on Docker Hub (amdih/zendnn_zentorch) -- no internal-registry access needed. Tags are vllm_v<ver>_zentorch_v<ver>_<os>_<build>; prefer the newest ubuntu22.04 stable. Both docker and podman are supported; the skill prefers docker and falls back to podman.",
77
"run_flags": [
88
"--ipc=host",
9-
"--shm-size=16g",
109
"--network=host"
1110
],
1211
"hf_cache_mount": "-v ~/.cache/huggingface:/root/.cache/huggingface",
1312
"flag_notes": {
14-
"--ipc=host": "vLLM workers use host IPC/shared memory.",
15-
"--shm-size=16g": "vLLM needs a large /dev/shm; default 64MB is not enough.",
13+
"--ipc=host": "vLLM workers need a large /dev/shm; --ipc=host shares the host's (large) shared memory, which covers it. Do NOT also pass --shm-size: podman rejects '--shm-size' together with '--ipc=host' (cannot set shmsize in host IPC namespace), and it is redundant on docker too.",
14+
"shm_alternative": "If you must isolate IPC (drop --ipc=host), then add --shm-size=16g instead (the 64MB container default is too small for vLLM). Use one or the other, never both.",
1615
"--network=host": "Expose the served port directly. Alternative: -p <port>:<port>.",
1716
"numa": "A single instance is pinned to ONE socket plus its memory. cpu_tune.py picks a free socket by CPU load on dual-socket hosts (warns if both busy; --socket N forces), sizes KV from that socket's local RAM, and emits --cpuset-cpus + --cpuset-mems (container) or numactl --cpunodebind/--membind (conda). True multi-socket scaling = multiple instances (one per socket), out of scope here."
1817
}

skills/serving-llms-on-epyc/reference.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -41,8 +41,8 @@ From `data/epyc.json`. Unlike the Instinct (GPU) skill there are **no**
4141

4242
| Flag | Why |
4343
|---|---|
44-
| `--ipc=host` | vLLM workers use host IPC / shared memory |
45-
| `--shm-size=16g` | vLLM needs a large `/dev/shm`; the 64MB default is too small |
44+
| `--ipc=host` | vLLM workers need a large `/dev/shm`; sharing the host IPC namespace provides it. **Do not also pass `--shm-size`** -- podman rejects the combination, and it is redundant on docker |
45+
| `--shm-size=16g` | **only if you drop `--ipc=host`** (isolated IPC). The 64MB container default is too small for vLLM. Use one or the other, never both |
4646
| `--network=host` | expose the served port directly (or use `-p <port>:<port>`) |
4747
| `--cpuset-cpus` / `--cpuset-mems` | pin the container to the chosen socket's physical cores and its NUMA node(s); from `cpu_tune.py` |
4848
| `-v ~/.cache/huggingface:/root/.cache/huggingface` | reuse the host model cache |
@@ -107,8 +107,11 @@ between the failing and passing runs was `VLLM_USE_AOT_COMPILE`. Never set
107107
`FREEZING=1` without `VLLM_USE_AOT_COMPILE=0`. The base recipe leaves both unset.
108108

109109
**`/dev/shm` too small**
110-
Without `--shm-size=16g` (or `--ipc=host`), vLLM workers fail to allocate shared
111-
memory at startup.
110+
vLLM workers need a large `/dev/shm` or they fail to allocate shared memory at
111+
startup. The base recipe uses `--ipc=host` (shares the host's large shared memory).
112+
**Do not combine `--ipc=host` with `--shm-size`** -- podman errors *"cannot set
113+
shmsize when running in the host IPC Namespace"*, and it is redundant on docker. If
114+
you drop `--ipc=host`, use `--shm-size=16g` instead -- one or the other, never both.
112115

113116
**RAM is the ceiling, not VRAM**
114117
CPU serving keeps weights + KV cache in system RAM. `estimate_memory.py` checks

skills/serving-llms-on-epyc/scripts/detect.py

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -49,11 +49,13 @@ def _lscpu_field(lscpu_out, label):
4949
def _epyc_generation(model):
5050
"""Map an AMD EPYC model name to (generation, zen_arch).
5151
52-
EPYC numbering encodes the generation: 7xx1=Naples (Zen1), 7xx2=Rome (Zen2),
53-
7xx3=Milan (Zen3), 8xx4=Siena (Zen4c), 97x4=Bergamo (Zen4c), 9xx4=Genoa (Zen4),
54-
9xx5=Turin (Zen5). The agent should carry this through every phase (e.g. AVX-512
55-
+ bf16 land on Zen4+, Turin has up to 128 cores per socket -> thread binding)."""
56-
m = re.search(r"EPYC\s+(\d{4})", model.upper())
52+
EPYC numbering encodes the generation by its first and last digit: 7xx1=Naples
53+
(Zen1), 7xx2=Rome (Zen2), 7xx3=Milan (Zen3), 8xx4=Siena (Zen4c), 97x4=Bergamo
54+
(Zen4c), 9xx4=Genoa (Zen4), 9xx5=Turin (Zen5). Some SKUs carry a letter in the
55+
middle (e.g. 9B45 -> 9__5 -> Turin), so we match 4 alphanumerics whose first and
56+
last chars are digits and key off those. The agent carries this through every
57+
phase (e.g. AVX-512 + bf16 land on Zen4+, Turin has up to 128 cores/socket)."""
58+
m = re.search(r"EPYC\s+(\d[0-9A-Z]{2}\d)", model.upper())
5759
if not m:
5860
return "unknown", "unknown"
5961
num = m.group(1)

walkthroughs/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,4 +9,5 @@ Participatns using other are still encouraged to participate. Just please note t
99
Please choose a skill to get started.
1010

1111
* [local-ai-use](./local-ai-use.md): Teach your agent how to run image generation locally.
12-
* [local-ai-app-integration](./local-ai-app-integration.md): Add a local AI mode to a cloud-only app.
12+
* [local-ai-app-integration](./local-ai-app-integration.md): Add a local AI mode to a cloud-only app.
13+
* [serving-llms-on-epyc](./serving-llms-on-epyc.md): Bring up a vLLM + zentorch LLM endpoint on an AMD EPYC CPU.
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# AMD Skills Walkthroughs: `serving-llms-on-epyc`
2+
3+
The goal of this skill is to teach your AI agent to bring up a vLLM OpenAI-compatible
4+
endpoint on an **AMD EPYC CPU** host using the zentorch backend — detecting the CPU,
5+
validating the environment, checking the model fits, sizing the runtime to the
6+
hardware, launching, and verifying the endpoint responds.
7+
8+
**What you'll end up with:** a running `vllm serve` endpoint on your EPYC box (in a
9+
Docker/Podman container, or a conda env), sized to a single socket and ready to answer
10+
OpenAI `/v1/chat/completions` requests.
11+
12+
## Prerequisites
13+
14+
- An **AMD EPYC CPU with AVX-512 support** — i.e. **Zen4+ (Genoa / Bergamo / Siena / Turin) or newer**. This is CPU serving (no GPU required); AVX-512 is required for the zentorch CPU path, and `detect.py` reports it (`avx512`).
15+
- A container runtime — **Docker** or **Podman***or* a conda env with `vllm` + `zentorch` installed.
16+
- Enough host RAM for the model (weights + KV cache both live in RAM on CPU).
17+
- A HuggingFace token in `HF_TOKEN` **only** for gated models (Llama, Gemma). The default model (Qwen3) needs none.
18+
- **Node.js ≥ 18** — required by the `skills` CLI used in Step 2 (`npx skills ...`). Check with `node -v`; on older hosts install a newer Node (e.g. `conda create -n node20 -c conda-forge 'nodejs>=20'`).
19+
20+
## Step 1 - Understanding which skills are available
21+
22+
* Run `claude "Which skills can you see?" --model sonnet`. You should see a list of skills that does **not** include anything about serving LLMs on EPYC / CPU.
23+
* Make sure there is no `AGENTS.md` file in your local folder.
24+
25+
## Step 2 - Enabling claude to see `serving-llms-on-epyc`
26+
27+
* Install the skill with the [`skills` CLI](https://github.com/vercel-labs/skills):
28+
29+
```bash
30+
npx skills add amd/skills --skill serving-llms-on-epyc --agent claude-code
31+
```
32+
33+
* Run `claude "Which skills can you see?" --model sonnet`. You should see a list of skills that now includes `serving-llms-on-epyc`.
34+
35+
## Step 3 - Running the skill
36+
37+
Run `claude --model sonnet` on your EPYC host with this prompt:
38+
39+
```
40+
Serve Qwen/Qwen3-0.6B on this AMD EPYC box with vLLM and zentorch.
41+
```
42+
43+
Claude should:
44+
45+
1. **Detect the CPU** — confirm it is AMD EPYC and read the generation (Genoa/Turin/…), AVX-512, physical cores, NUMA layout, and RAM.
46+
2. **Validate the environment** — find an accessible runtime (Docker or Podman, else the conda path), check the image, `HF_TOKEN`, and RAM; report any perf-library advisories.
47+
3. **Check vLLM supports the model** — verify the architecture against vLLM's model registry (it does not blanket-block multimodal; it rejects non-chat models like embeddings/rerankers).
48+
4. **Check it fits host RAM** — weights + KV cache + headroom vs available RAM.
49+
5. **Size the runtime to the hardware** — bind to one socket's physical cores, size the KV cache from that socket's local RAM, and bind memory to that socket (this is **single-socket serving**; vLLM scales poorly across sockets).
50+
6. **Confirm the plan with you** — present a sized summary (model, path, precision, fit, CPU sizing, port) and wait for you to approve before launching.
51+
7. **Launch and verify** — pull the public `amdih/zendnn_zentorch` image, run `vllm serve`, poll `/health`, and prove `/v1/chat/completions` works.
52+
53+
On any failure it reports the cause + logs and **stops** — it does not retry or start a debugging loop.
54+
55+
## Step 4 - Talk to the endpoint
56+
57+
Once Claude reports the endpoint is healthy, call it — use the **port from Claude's
58+
connection table** (it uses `8000` by default):
59+
60+
```bash
61+
curl -s http://localhost:8000/v1/chat/completions \
62+
-H "Content-Type: application/json" \
63+
-d '{"model":"Qwen/Qwen3-0.6B","messages":[{"role":"user","content":"Hello"}]}'
64+
```
65+
66+
## Step 5 - (Optional) Going beyond
67+
68+
* **A real workload:** ask for a larger model once the flow is proven, e.g. *"Serve Qwen/Qwen3-8B ..."*. Claude re-checks the RAM fit and re-sizes.
69+
* **Gated models:** `export HF_TOKEN=...` (and accept the model license on HuggingFace), then ask for `meta-llama/Llama-3.1-8B-Instruct`.
70+
* **Pick a socket:** on a dual-socket box Claude picks a free socket by load; you can steer it (*"serve it on socket 1"*).
71+
72+
## Step 6 - (Optional) Try to get things done without AMD Skills
73+
74+
Remove the added skill and rerun the experiment above. The `skills` CLI installs a
75+
copy under **both** `.claude/skills/serving-llms-on-epyc` **and**
76+
`.agents/skills/serving-llms-on-epyc`, so delete both (otherwise the leftover copy
77+
keeps the skill active and the comparison isn't clean). Without the skill, common
78+
issues include:
79+
80+
* Passing `--device cpu` to `vllm serve` (removed in vLLM ≥ 0.20 with the zentorch plugin) — the server errors out on launch.
81+
* Guessing at a container image or using a GPU/CUDA image instead of the public CPU `amdih/zendnn_zentorch` one.
82+
* No hardware-aware sizing — spreading threads across both sockets and sizing the KV cache from whole-system RAM, so the KV pool spills cross-socket and throughput tanks.
83+
* Launching a model that does not fit host RAM (or an embedding/reranker model that has no chat endpoint) and then looping on the failure.
84+
* Providing a knowledge article instead of actually bringing up a working endpoint.

0 commit comments

Comments
 (0)