Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .claude-plugin/marketplace.json
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,11 @@
"source": "./skills/magpie-kernel-evaluator",
"description": "Performs GPU kernel correctness and performance evaluation and LLM inference benchmarking with Magpie. Analyzes single or multiple kernels (HIP/CUDA/PyTorch), compares kernel implementations, runs vLLM/SGLang benchmarks with profiling and TraceLens, and runs gap analysis on torch traces."
},
{
"name": "serving-llms-on-epyc",
"source": "./skills/serving-llms-on-epyc",
"description": "Serve LLMs on AMD EPYC CPUs (AVX-512 / Zen4+) with vLLM + zentorch, in a container (Docker/Podman) or conda. Detects the CPU, validates runtime/env, checks vLLM model support and RAM fit, sizes threads/KV and pins one socket + its memory, launches, and verifies. Single instance; reports and stops on failure."
},
{
"name": "serving-llms-on-instinct",
"source": "./skills/serving-llms-on-instinct",
Expand Down
5 changes: 5 additions & 0 deletions .cursor-plugin/marketplace.json
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,11 @@
"source": "./skills/magpie-kernel-evaluator",
"description": "Performs GPU kernel correctness and performance evaluation and LLM inference benchmarking with Magpie. Analyzes single or multiple kernels (HIP/CUDA/PyTorch), compares kernel implementations, runs vLLM/SGLang benchmarks with profiling and TraceLens, and runs gap analysis on torch traces."
},
{
"name": "serving-llms-on-epyc",
"source": "./skills/serving-llms-on-epyc",
"description": "Serve LLMs on AMD EPYC CPUs (AVX-512 / Zen4+) with vLLM + zentorch, in a container (Docker/Podman) or conda. Detects the CPU, validates runtime/env, checks vLLM model support and RAM fit, sizes threads/KV and pins one socket + its memory, launches, and verifies. Single instance; reports and stops on failure."
},
{
"name": "serving-llms-on-instinct",
"source": "./skills/serving-llms-on-instinct",
Expand Down
43 changes: 33 additions & 10 deletions skills/serving-llms-on-epyc/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,15 @@ description: >-
backend, in a container (Docker or Podman) or a conda env. Use whenever the
user wants to run, serve, deploy, start, host, or launch an LLM on AMD EPYC,
Zen CPU, "vLLM on CPU", "zentorch serving", or "serve a model without a GPU".
Use for "serve Qwen on EPYC", "start a CPU vLLM endpoint", "run an OpenAI
server on my EPYC box", or similar. Handles the full single-instance flow:
Handles the full single-instance flow:
detect the CPU (incl. EPYC generation), validate the runtime/env, check vLLM
supports the model (via vLLM's registry, not a modality blocklist), check it
fits host RAM, size CPU threads/KV/NUMA from the hardware, confirm the plan with
the user, launch, and poll until the endpoint is responsive. Single instance,
single socket (pinned to one socket + its memory; vLLM scales poorly across
sockets). Does NOT debug failures and does NOT retry -- it reports and stops. Do
not use for GPU/Instinct (use serving-llms-on-instinct) or multi-node.
not use for GPU/Instinct (use serving-llms-on-instinct), multi-node, or pre-Zen4
EPYC without AVX-512 (Naples/Rome/Milan).
allowed-tools: Bash, Read
---

Expand Down Expand Up @@ -52,10 +52,16 @@ python3 scripts/detect.py # add --host user@box for a remote host

Returns `cpu_model`, `is_amd_epyc`, `epyc_generation`
(Naples/Rome/Milan/Genoa/Bergamo/Siena/Turin), `zen_arch`, `avx512`,
`logical_cores`, `physical_cores`, `sockets`, `numa_nodes`, `memory_gb`. If
`is_amd_epyc` is `false`, stop: this skill targets AMD EPYC. (Other x86 may work
but is unsupported here.) Carry `epyc_generation` / `avx512` through the later
phases -- e.g. AVX-512 + bf16 land on Zen4+ (Genoa/Turin), and Turin packs up to
`logical_cores`, `physical_cores`, `sockets`, `numa_nodes`, `memory_gb`.

Two hard gates -- stop if either fails:
- `is_amd_epyc` is `false` -> stop: this skill targets AMD EPYC. (Other x86 may work
but is unsupported here.)
- `avx512` is `false` -> stop: the zentorch CPU path **requires AVX-512**, i.e. Zen4+
(Genoa / Bergamo / Siena / Turin) or newer. Pre-Zen4 EPYC (Naples / Rome / Milan)
is not supported -- say so and stop rather than launching into a load-time failure.

Carry `epyc_generation` / `avx512` through the later phases -- e.g. Turin packs up to
128 cores/socket, which the thread-binding in Step 5 sizes from.

## Step 2: Validate the runtime and environment
Expand Down Expand Up @@ -166,9 +172,10 @@ auto-selects the CPU platform and `vllm serve` rejects the flag. Only add it if
including the pull. `RT` is the resolved runtime verbatim:
```bash
RT="<runtime from validate.py: docker | podman>"
$RT rm -f vllm-epyc 2>/dev/null # clear any leftover container from a prior run (name collision otherwise)
$RT pull <image from data/epyc.json> # agent pulls; do not ask the user to
$RT run -d --name vllm-epyc \
<run_flags from data/epyc.json> # --ipc=host --shm-size=16g --network=host
<run_flags from data/epyc.json> # --ipc=host --network=host (NO --shm-size: it conflicts with --ipc=host on podman)
<hf_cache_mount> \
<container_cpuset from cpu_tune> # --cpuset-cpus=<cores> --cpuset-mems=<nodes>
--env VLLM_CPU_OMP_THREADS_BIND="$VLLM_CPU_OMP_THREADS_BIND" \
Expand Down Expand Up @@ -244,10 +251,26 @@ See [reference.md](reference.md) for the full list. The load-bearing ones:
zentorch 2.11 (`AssertionError: expected OutputCode, got function`). It only
works with `VLLM_USE_AOT_COMPILE=0` set alongside it. Never set one without
the other.
- **`--shm-size`**: vLLM needs a large `/dev/shm`; the container default (64MB)
is too small. Use `--shm-size=16g` (in `data/epyc.json`).
- **`/dev/shm` — use `--ipc=host`, not `--shm-size`.** vLLM needs a large
`/dev/shm` (the 64MB container default is too small). The base recipe uses
`--ipc=host`, which shares the host's large shared memory. **Do not also pass
`--shm-size`**: podman errors with *"cannot set shmsize when running in the host
IPC Namespace"*, and it is redundant on docker. If you instead isolate IPC (drop
`--ipc=host`), then add `--shm-size=16g` — one or the other, never both.
- **NUMA / socket**: one instance is pinned to **one socket plus its memory** --
CPU bind + `--cpuset-mems` (container) / `numactl --membind` (conda), with KV sized
from that socket's local RAM. On a dual-socket host `cpu_tune.py` picks a free socket
by load and `warning`s if both are busy. NPS2/NPS4 (multi-node socket) gets an
`nps_note` that finer per-node binding could add more.
- **Rootless podman + `--cpuset-cpus`/`--cpuset-mems`**: these are cgroup limits and
may be **ignored or rejected** on rootless podman without cpuset cgroup delegation
(cgroup v1, or v2 without the controller delegated). This is **not fatal**: CPU
thread binding still applies via `VLLM_CPU_OMP_THREADS_BIND` inside the container;
only the container-level memory pin is lost (reduced NUMA locality). If the run
errors specifically on the cpuset flags, drop them and proceed -- do not treat it
as a launch failure.
- **HF cache mount**: the default mounts `~/.cache/huggingface`. If `HF_HOME` points
elsewhere (common on shared hosts, e.g. `/proj/.../vllm`), mount **that** path to
`/root/.cache/huggingface` instead, or the model re-downloads inside the container.
- **Container name reuse**: a leftover `vllm-epyc` from a prior run makes `run` fail
with "name already in use" -- Step 6 clears it first with `$RT rm -f vllm-epyc`.
5 changes: 2 additions & 3 deletions skills/serving-llms-on-epyc/data/epyc.json
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,12 @@
"comment": "Public vLLM + zentorch CPU image on Docker Hub (amdih/zendnn_zentorch) -- no internal-registry access needed. Tags are vllm_v<ver>_zentorch_v<ver>_<os>_<build>; prefer the newest ubuntu22.04 stable. Both docker and podman are supported; the skill prefers docker and falls back to podman.",
"run_flags": [
"--ipc=host",
"--shm-size=16g",
"--network=host"
],
"hf_cache_mount": "-v ~/.cache/huggingface:/root/.cache/huggingface",
"flag_notes": {
"--ipc=host": "vLLM workers use host IPC/shared memory.",
"--shm-size=16g": "vLLM needs a large /dev/shm; default 64MB is not enough.",
"--ipc=host": "vLLM workers need a large /dev/shm; --ipc=host shares the host's (large) shared memory, which covers it. Do NOT also pass --shm-size: podman rejects '--shm-size' together with '--ipc=host' (cannot set shmsize in host IPC namespace), and it is redundant on docker too.",
"shm_alternative": "If you must isolate IPC (drop --ipc=host), then add --shm-size=16g instead (the 64MB container default is too small for vLLM). Use one or the other, never both.",
"--network=host": "Expose the served port directly. Alternative: -p <port>:<port>.",
"numa": "A single instance is pinned to ONE socket plus its memory. cpu_tune.py picks a free socket by CPU load on dual-socket hosts (warns if both busy; --socket N forces), sizes KV from that socket's local RAM, and emits --cpuset-cpus + --cpuset-mems (container) or numactl --cpunodebind/--membind (conda). True multi-socket scaling = multiple instances (one per socket), out of scope here."
}
Expand Down
11 changes: 7 additions & 4 deletions skills/serving-llms-on-epyc/reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,8 @@ From `data/epyc.json`. Unlike the Instinct (GPU) skill there are **no**

| Flag | Why |
|---|---|
| `--ipc=host` | vLLM workers use host IPC / shared memory |
| `--shm-size=16g` | vLLM needs a large `/dev/shm`; the 64MB default is too small |
| `--ipc=host` | vLLM workers need a large `/dev/shm`; sharing the host IPC namespace provides it. **Do not also pass `--shm-size`** -- podman rejects the combination, and it is redundant on docker |
| `--shm-size=16g` | **only if you drop `--ipc=host`** (isolated IPC). The 64MB container default is too small for vLLM. Use one or the other, never both |
| `--network=host` | expose the served port directly (or use `-p <port>:<port>`) |
| `--cpuset-cpus` / `--cpuset-mems` | pin the container to the chosen socket's physical cores and its NUMA node(s); from `cpu_tune.py` |
| `-v ~/.cache/huggingface:/root/.cache/huggingface` | reuse the host model cache |
Expand Down Expand Up @@ -107,8 +107,11 @@ between the failing and passing runs was `VLLM_USE_AOT_COMPILE`. Never set
`FREEZING=1` without `VLLM_USE_AOT_COMPILE=0`. The base recipe leaves both unset.

**`/dev/shm` too small**
Without `--shm-size=16g` (or `--ipc=host`), vLLM workers fail to allocate shared
memory at startup.
vLLM workers need a large `/dev/shm` or they fail to allocate shared memory at
startup. The base recipe uses `--ipc=host` (shares the host's large shared memory).
**Do not combine `--ipc=host` with `--shm-size`** -- podman errors *"cannot set
shmsize when running in the host IPC Namespace"*, and it is redundant on docker. If
you drop `--ipc=host`, use `--shm-size=16g` instead -- one or the other, never both.

**RAM is the ceiling, not VRAM**
CPU serving keeps weights + KV cache in system RAM. `estimate_memory.py` checks
Expand Down
12 changes: 7 additions & 5 deletions skills/serving-llms-on-epyc/scripts/detect.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,11 +49,13 @@ def _lscpu_field(lscpu_out, label):
def _epyc_generation(model):
"""Map an AMD EPYC model name to (generation, zen_arch).

EPYC numbering encodes the generation: 7xx1=Naples (Zen1), 7xx2=Rome (Zen2),
7xx3=Milan (Zen3), 8xx4=Siena (Zen4c), 97x4=Bergamo (Zen4c), 9xx4=Genoa (Zen4),
9xx5=Turin (Zen5). The agent should carry this through every phase (e.g. AVX-512
+ bf16 land on Zen4+, Turin has up to 128 cores per socket -> thread binding)."""
m = re.search(r"EPYC\s+(\d{4})", model.upper())
EPYC numbering encodes the generation by its first and last digit: 7xx1=Naples
(Zen1), 7xx2=Rome (Zen2), 7xx3=Milan (Zen3), 8xx4=Siena (Zen4c), 97x4=Bergamo
(Zen4c), 9xx4=Genoa (Zen4), 9xx5=Turin (Zen5). Some SKUs carry a letter in the
middle (e.g. 9B45 -> 9__5 -> Turin), so we match 4 alphanumerics whose first and
last chars are digits and key off those. The agent carries this through every
phase (e.g. AVX-512 + bf16 land on Zen4+, Turin has up to 128 cores/socket)."""
m = re.search(r"EPYC\s+(\d[0-9A-Z]{2}\d)", model.upper())
if not m:
return "unknown", "unknown"
num = m.group(1)
Expand Down
3 changes: 2 additions & 1 deletion walkthroughs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,5 @@ Participatns using other are still encouraged to participate. Just please note t
Please choose a skill to get started.

* [local-ai-use](./local-ai-use.md): Teach your agent how to run image generation locally.
* [local-ai-app-integration](./local-ai-app-integration.md): Add a local AI mode to a cloud-only app.
* [local-ai-app-integration](./local-ai-app-integration.md): Add a local AI mode to a cloud-only app.
* [serving-llms-on-epyc](./serving-llms-on-epyc.md): Bring up a vLLM + zentorch LLM endpoint on an AMD EPYC CPU.
84 changes: 84 additions & 0 deletions walkthroughs/serving-llms-on-epyc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# AMD Skills Walkthroughs: `serving-llms-on-epyc`

The goal of this skill is to teach your AI agent to bring up a vLLM OpenAI-compatible
endpoint on an **AMD EPYC CPU** host using the zentorch backend — detecting the CPU,
validating the environment, checking the model fits, sizing the runtime to the
hardware, launching, and verifying the endpoint responds.

**What you'll end up with:** a running `vllm serve` endpoint on your EPYC box (in a
Docker/Podman container, or a conda env), sized to a single socket and ready to answer
OpenAI `/v1/chat/completions` requests.

## Prerequisites

- An **AMD EPYC CPU with AVX-512 support** — i.e. **Zen4+ (Genoa / Bergamo / Siena / Turin) or newer**. This is CPU serving (no GPU required); AVX-512 is required for the zentorch CPU path, and `detect.py` reports it (`avx512`).
- A container runtime — **Docker** or **Podman** — *or* a conda env with `vllm` + `zentorch` installed.
- Enough host RAM for the model (weights + KV cache both live in RAM on CPU).
- A HuggingFace token in `HF_TOKEN` **only** for gated models (Llama, Gemma). The default model (Qwen3) needs none.
- **Node.js ≥ 18** — required by the `skills` CLI used in Step 2 (`npx skills ...`). Check with `node -v`; on older hosts install a newer Node (e.g. `conda create -n node20 -c conda-forge 'nodejs>=20'`).

## Step 1 - Understanding which skills are available

* Run `claude "Which skills can you see?" --model sonnet`. You should see a list of skills that does **not** include anything about serving LLMs on EPYC / CPU.
* Make sure there is no `AGENTS.md` file in your local folder.

## Step 2 - Enabling claude to see `serving-llms-on-epyc`

* Install the skill with the [`skills` CLI](https://github.com/vercel-labs/skills):

```bash
npx skills add amd/skills --skill serving-llms-on-epyc --agent claude-code
```

* Run `claude "Which skills can you see?" --model sonnet`. You should see a list of skills that now includes `serving-llms-on-epyc`.

## Step 3 - Running the skill

Run `claude --model sonnet` on your EPYC host with this prompt:

```
Serve Qwen/Qwen3-0.6B on this AMD EPYC box with vLLM and zentorch.
```

Claude should:

1. **Detect the CPU** — confirm it is AMD EPYC and read the generation (Genoa/Turin/…), AVX-512, physical cores, NUMA layout, and RAM.
2. **Validate the environment** — find an accessible runtime (Docker or Podman, else the conda path), check the image, `HF_TOKEN`, and RAM; report any perf-library advisories.
3. **Check vLLM supports the model** — verify the architecture against vLLM's model registry (it does not blanket-block multimodal; it rejects non-chat models like embeddings/rerankers).
4. **Check it fits host RAM** — weights + KV cache + headroom vs available RAM.
5. **Size the runtime to the hardware** — bind to one socket's physical cores, size the KV cache from that socket's local RAM, and bind memory to that socket (this is **single-socket serving**; vLLM scales poorly across sockets).
6. **Confirm the plan with you** — present a sized summary (model, path, precision, fit, CPU sizing, port) and wait for you to approve before launching.
7. **Launch and verify** — pull the public `amdih/zendnn_zentorch` image, run `vllm serve`, poll `/health`, and prove `/v1/chat/completions` works.

On any failure it reports the cause + logs and **stops** — it does not retry or start a debugging loop.

## Step 4 - Talk to the endpoint

Once Claude reports the endpoint is healthy, call it — use the **port from Claude's
connection table** (it uses `8000` by default):

```bash
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"Qwen/Qwen3-0.6B","messages":[{"role":"user","content":"Hello"}]}'
```

## Step 5 - (Optional) Going beyond

* **A real workload:** ask for a larger model once the flow is proven, e.g. *"Serve Qwen/Qwen3-8B ..."*. Claude re-checks the RAM fit and re-sizes.
* **Gated models:** `export HF_TOKEN=...` (and accept the model license on HuggingFace), then ask for `meta-llama/Llama-3.1-8B-Instruct`.
* **Pick a socket:** on a dual-socket box Claude picks a free socket by load; you can steer it (*"serve it on socket 1"*).

## Step 6 - (Optional) Try to get things done without AMD Skills

Remove the added skill and rerun the experiment above. The `skills` CLI installs a
copy under **both** `.claude/skills/serving-llms-on-epyc` **and**
`.agents/skills/serving-llms-on-epyc`, so delete both (otherwise the leftover copy
keeps the skill active and the comparison isn't clean). Without the skill, common
issues include:

* Passing `--device cpu` to `vllm serve` (removed in vLLM ≥ 0.20 with the zentorch plugin) — the server errors out on launch.
* Guessing at a container image or using a GPU/CUDA image instead of the public CPU `amdih/zendnn_zentorch` one.
* No hardware-aware sizing — spreading threads across both sockets and sizing the KV cache from whole-system RAM, so the KV pool spills cross-socket and throughput tanks.
* Launching a model that does not fit host RAM (or an embedding/reranker model that has no chat endpoint) and then looping on the failure.
* Providing a knowledge article instead of actually bringing up a working endpoint.
Loading