diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json index f00ba1e..2c723f1 100644 --- a/.claude-plugin/marketplace.json +++ b/.claude-plugin/marketplace.json @@ -24,6 +24,11 @@ "source": "./skills/magpie-kernel-evaluator", "description": "Performs GPU kernel correctness and performance evaluation and LLM inference benchmarking with Magpie. Analyzes single or multiple kernels (HIP/CUDA/PyTorch), compares kernel implementations, runs vLLM/SGLang benchmarks with profiling and TraceLens, and runs gap analysis on torch traces." }, + { + "name": "serving-llms-on-epyc", + "source": "./skills/serving-llms-on-epyc", + "description": "Serve LLMs on AMD EPYC CPUs (AVX-512 / Zen4+) with vLLM + zentorch, in a container (Docker/Podman) or conda. Detects the CPU, validates runtime/env, checks vLLM model support and RAM fit, sizes threads/KV and pins one socket + its memory, launches, and verifies. Single instance; reports and stops on failure." + }, { "name": "serving-llms-on-instinct", "source": "./skills/serving-llms-on-instinct", diff --git a/.cursor-plugin/marketplace.json b/.cursor-plugin/marketplace.json index f00ba1e..2c723f1 100644 --- a/.cursor-plugin/marketplace.json +++ b/.cursor-plugin/marketplace.json @@ -24,6 +24,11 @@ "source": "./skills/magpie-kernel-evaluator", "description": "Performs GPU kernel correctness and performance evaluation and LLM inference benchmarking with Magpie. Analyzes single or multiple kernels (HIP/CUDA/PyTorch), compares kernel implementations, runs vLLM/SGLang benchmarks with profiling and TraceLens, and runs gap analysis on torch traces." }, + { + "name": "serving-llms-on-epyc", + "source": "./skills/serving-llms-on-epyc", + "description": "Serve LLMs on AMD EPYC CPUs (AVX-512 / Zen4+) with vLLM + zentorch, in a container (Docker/Podman) or conda. Detects the CPU, validates runtime/env, checks vLLM model support and RAM fit, sizes threads/KV and pins one socket + its memory, launches, and verifies. Single instance; reports and stops on failure." + }, { "name": "serving-llms-on-instinct", "source": "./skills/serving-llms-on-instinct", diff --git a/skills/serving-llms-on-epyc/SKILL.md b/skills/serving-llms-on-epyc/SKILL.md index 14b97e3..ad08765 100644 --- a/skills/serving-llms-on-epyc/SKILL.md +++ b/skills/serving-llms-on-epyc/SKILL.md @@ -5,15 +5,15 @@ description: >- backend, in a container (Docker or Podman) or a conda env. Use whenever the user wants to run, serve, deploy, start, host, or launch an LLM on AMD EPYC, Zen CPU, "vLLM on CPU", "zentorch serving", or "serve a model without a GPU". - Use for "serve Qwen on EPYC", "start a CPU vLLM endpoint", "run an OpenAI - server on my EPYC box", or similar. Handles the full single-instance flow: + Handles the full single-instance flow: detect the CPU (incl. EPYC generation), validate the runtime/env, check vLLM supports the model (via vLLM's registry, not a modality blocklist), check it fits host RAM, size CPU threads/KV/NUMA from the hardware, confirm the plan with the user, launch, and poll until the endpoint is responsive. Single instance, single socket (pinned to one socket + its memory; vLLM scales poorly across sockets). Does NOT debug failures and does NOT retry -- it reports and stops. Do - not use for GPU/Instinct (use serving-llms-on-instinct) or multi-node. + not use for GPU/Instinct (use serving-llms-on-instinct), multi-node, or pre-Zen4 + EPYC without AVX-512 (Naples/Rome/Milan). allowed-tools: Bash, Read --- @@ -52,10 +52,16 @@ python3 scripts/detect.py # add --host user@box for a remote host Returns `cpu_model`, `is_amd_epyc`, `epyc_generation` (Naples/Rome/Milan/Genoa/Bergamo/Siena/Turin), `zen_arch`, `avx512`, -`logical_cores`, `physical_cores`, `sockets`, `numa_nodes`, `memory_gb`. If -`is_amd_epyc` is `false`, stop: this skill targets AMD EPYC. (Other x86 may work -but is unsupported here.) Carry `epyc_generation` / `avx512` through the later -phases -- e.g. AVX-512 + bf16 land on Zen4+ (Genoa/Turin), and Turin packs up to +`logical_cores`, `physical_cores`, `sockets`, `numa_nodes`, `memory_gb`. + +Two hard gates -- stop if either fails: +- `is_amd_epyc` is `false` -> stop: this skill targets AMD EPYC. (Other x86 may work + but is unsupported here.) +- `avx512` is `false` -> stop: the zentorch CPU path **requires AVX-512**, i.e. Zen4+ + (Genoa / Bergamo / Siena / Turin) or newer. Pre-Zen4 EPYC (Naples / Rome / Milan) + is not supported -- say so and stop rather than launching into a load-time failure. + +Carry `epyc_generation` / `avx512` through the later phases -- e.g. Turin packs up to 128 cores/socket, which the thread-binding in Step 5 sizes from. ## Step 2: Validate the runtime and environment @@ -166,9 +172,10 @@ auto-selects the CPU platform and `vllm serve` rejects the flag. Only add it if including the pull. `RT` is the resolved runtime verbatim: ```bash RT="" +$RT rm -f vllm-epyc 2>/dev/null # clear any leftover container from a prior run (name collision otherwise) $RT pull # agent pulls; do not ask the user to $RT run -d --name vllm-epyc \ - # --ipc=host --shm-size=16g --network=host + # --ipc=host --network=host (NO --shm-size: it conflicts with --ipc=host on podman) \ # --cpuset-cpus= --cpuset-mems= --env VLLM_CPU_OMP_THREADS_BIND="$VLLM_CPU_OMP_THREADS_BIND" \ @@ -244,10 +251,26 @@ See [reference.md](reference.md) for the full list. The load-bearing ones: zentorch 2.11 (`AssertionError: expected OutputCode, got function`). It only works with `VLLM_USE_AOT_COMPILE=0` set alongside it. Never set one without the other. -- **`--shm-size`**: vLLM needs a large `/dev/shm`; the container default (64MB) - is too small. Use `--shm-size=16g` (in `data/epyc.json`). +- **`/dev/shm` — use `--ipc=host`, not `--shm-size`.** vLLM needs a large + `/dev/shm` (the 64MB container default is too small). The base recipe uses + `--ipc=host`, which shares the host's large shared memory. **Do not also pass + `--shm-size`**: podman errors with *"cannot set shmsize when running in the host + IPC Namespace"*, and it is redundant on docker. If you instead isolate IPC (drop + `--ipc=host`), then add `--shm-size=16g` — one or the other, never both. - **NUMA / socket**: one instance is pinned to **one socket plus its memory** -- CPU bind + `--cpuset-mems` (container) / `numactl --membind` (conda), with KV sized from that socket's local RAM. On a dual-socket host `cpu_tune.py` picks a free socket by load and `warning`s if both are busy. NPS2/NPS4 (multi-node socket) gets an `nps_note` that finer per-node binding could add more. +- **Rootless podman + `--cpuset-cpus`/`--cpuset-mems`**: these are cgroup limits and + may be **ignored or rejected** on rootless podman without cpuset cgroup delegation + (cgroup v1, or v2 without the controller delegated). This is **not fatal**: CPU + thread binding still applies via `VLLM_CPU_OMP_THREADS_BIND` inside the container; + only the container-level memory pin is lost (reduced NUMA locality). If the run + errors specifically on the cpuset flags, drop them and proceed -- do not treat it + as a launch failure. +- **HF cache mount**: the default mounts `~/.cache/huggingface`. If `HF_HOME` points + elsewhere (common on shared hosts, e.g. `/proj/.../vllm`), mount **that** path to + `/root/.cache/huggingface` instead, or the model re-downloads inside the container. +- **Container name reuse**: a leftover `vllm-epyc` from a prior run makes `run` fail + with "name already in use" -- Step 6 clears it first with `$RT rm -f vllm-epyc`. diff --git a/skills/serving-llms-on-epyc/data/epyc.json b/skills/serving-llms-on-epyc/data/epyc.json index deb67f4..96ce5fd 100644 --- a/skills/serving-llms-on-epyc/data/epyc.json +++ b/skills/serving-llms-on-epyc/data/epyc.json @@ -6,13 +6,12 @@ "comment": "Public vLLM + zentorch CPU image on Docker Hub (amdih/zendnn_zentorch) -- no internal-registry access needed. Tags are vllm_v_zentorch_v__; prefer the newest ubuntu22.04 stable. Both docker and podman are supported; the skill prefers docker and falls back to podman.", "run_flags": [ "--ipc=host", - "--shm-size=16g", "--network=host" ], "hf_cache_mount": "-v ~/.cache/huggingface:/root/.cache/huggingface", "flag_notes": { - "--ipc=host": "vLLM workers use host IPC/shared memory.", - "--shm-size=16g": "vLLM needs a large /dev/shm; default 64MB is not enough.", + "--ipc=host": "vLLM workers need a large /dev/shm; --ipc=host shares the host's (large) shared memory, which covers it. Do NOT also pass --shm-size: podman rejects '--shm-size' together with '--ipc=host' (cannot set shmsize in host IPC namespace), and it is redundant on docker too.", + "shm_alternative": "If you must isolate IPC (drop --ipc=host), then add --shm-size=16g instead (the 64MB container default is too small for vLLM). Use one or the other, never both.", "--network=host": "Expose the served port directly. Alternative: -p :.", "numa": "A single instance is pinned to ONE socket plus its memory. cpu_tune.py picks a free socket by CPU load on dual-socket hosts (warns if both busy; --socket N forces), sizes KV from that socket's local RAM, and emits --cpuset-cpus + --cpuset-mems (container) or numactl --cpunodebind/--membind (conda). True multi-socket scaling = multiple instances (one per socket), out of scope here." } diff --git a/skills/serving-llms-on-epyc/reference.md b/skills/serving-llms-on-epyc/reference.md index 4a12ee1..5d70d36 100644 --- a/skills/serving-llms-on-epyc/reference.md +++ b/skills/serving-llms-on-epyc/reference.md @@ -41,8 +41,8 @@ From `data/epyc.json`. Unlike the Instinct (GPU) skill there are **no** | Flag | Why | |---|---| -| `--ipc=host` | vLLM workers use host IPC / shared memory | -| `--shm-size=16g` | vLLM needs a large `/dev/shm`; the 64MB default is too small | +| `--ipc=host` | vLLM workers need a large `/dev/shm`; sharing the host IPC namespace provides it. **Do not also pass `--shm-size`** -- podman rejects the combination, and it is redundant on docker | +| `--shm-size=16g` | **only if you drop `--ipc=host`** (isolated IPC). The 64MB container default is too small for vLLM. Use one or the other, never both | | `--network=host` | expose the served port directly (or use `-p :`) | | `--cpuset-cpus` / `--cpuset-mems` | pin the container to the chosen socket's physical cores and its NUMA node(s); from `cpu_tune.py` | | `-v ~/.cache/huggingface:/root/.cache/huggingface` | reuse the host model cache | @@ -107,8 +107,11 @@ between the failing and passing runs was `VLLM_USE_AOT_COMPILE`. Never set `FREEZING=1` without `VLLM_USE_AOT_COMPILE=0`. The base recipe leaves both unset. **`/dev/shm` too small** -Without `--shm-size=16g` (or `--ipc=host`), vLLM workers fail to allocate shared -memory at startup. +vLLM workers need a large `/dev/shm` or they fail to allocate shared memory at +startup. The base recipe uses `--ipc=host` (shares the host's large shared memory). +**Do not combine `--ipc=host` with `--shm-size`** -- podman errors *"cannot set +shmsize when running in the host IPC Namespace"*, and it is redundant on docker. If +you drop `--ipc=host`, use `--shm-size=16g` instead -- one or the other, never both. **RAM is the ceiling, not VRAM** CPU serving keeps weights + KV cache in system RAM. `estimate_memory.py` checks diff --git a/skills/serving-llms-on-epyc/scripts/detect.py b/skills/serving-llms-on-epyc/scripts/detect.py index c0c3340..b49d0d7 100644 --- a/skills/serving-llms-on-epyc/scripts/detect.py +++ b/skills/serving-llms-on-epyc/scripts/detect.py @@ -49,11 +49,13 @@ def _lscpu_field(lscpu_out, label): def _epyc_generation(model): """Map an AMD EPYC model name to (generation, zen_arch). - EPYC numbering encodes the generation: 7xx1=Naples (Zen1), 7xx2=Rome (Zen2), - 7xx3=Milan (Zen3), 8xx4=Siena (Zen4c), 97x4=Bergamo (Zen4c), 9xx4=Genoa (Zen4), - 9xx5=Turin (Zen5). The agent should carry this through every phase (e.g. AVX-512 - + bf16 land on Zen4+, Turin has up to 128 cores per socket -> thread binding).""" - m = re.search(r"EPYC\s+(\d{4})", model.upper()) + EPYC numbering encodes the generation by its first and last digit: 7xx1=Naples + (Zen1), 7xx2=Rome (Zen2), 7xx3=Milan (Zen3), 8xx4=Siena (Zen4c), 97x4=Bergamo + (Zen4c), 9xx4=Genoa (Zen4), 9xx5=Turin (Zen5). Some SKUs carry a letter in the + middle (e.g. 9B45 -> 9__5 -> Turin), so we match 4 alphanumerics whose first and + last chars are digits and key off those. The agent carries this through every + phase (e.g. AVX-512 + bf16 land on Zen4+, Turin has up to 128 cores/socket).""" + m = re.search(r"EPYC\s+(\d[0-9A-Z]{2}\d)", model.upper()) if not m: return "unknown", "unknown" num = m.group(1) diff --git a/walkthroughs/README.md b/walkthroughs/README.md index 8ca0289..b335827 100644 --- a/walkthroughs/README.md +++ b/walkthroughs/README.md @@ -9,4 +9,5 @@ Participatns using other are still encouraged to participate. Just please note t Please choose a skill to get started. * [local-ai-use](./local-ai-use.md): Teach your agent how to run image generation locally. -* [local-ai-app-integration](./local-ai-app-integration.md): Add a local AI mode to a cloud-only app. \ No newline at end of file +* [local-ai-app-integration](./local-ai-app-integration.md): Add a local AI mode to a cloud-only app. +* [serving-llms-on-epyc](./serving-llms-on-epyc.md): Bring up a vLLM + zentorch LLM endpoint on an AMD EPYC CPU. \ No newline at end of file diff --git a/walkthroughs/serving-llms-on-epyc.md b/walkthroughs/serving-llms-on-epyc.md new file mode 100644 index 0000000..5b263cb --- /dev/null +++ b/walkthroughs/serving-llms-on-epyc.md @@ -0,0 +1,84 @@ +# AMD Skills Walkthroughs: `serving-llms-on-epyc` + +The goal of this skill is to teach your AI agent to bring up a vLLM OpenAI-compatible +endpoint on an **AMD EPYC CPU** host using the zentorch backend — detecting the CPU, +validating the environment, checking the model fits, sizing the runtime to the +hardware, launching, and verifying the endpoint responds. + +**What you'll end up with:** a running `vllm serve` endpoint on your EPYC box (in a +Docker/Podman container, or a conda env), sized to a single socket and ready to answer +OpenAI `/v1/chat/completions` requests. + +## Prerequisites + +- An **AMD EPYC CPU with AVX-512 support** — i.e. **Zen4+ (Genoa / Bergamo / Siena / Turin) or newer**. This is CPU serving (no GPU required); AVX-512 is required for the zentorch CPU path, and `detect.py` reports it (`avx512`). +- A container runtime — **Docker** or **Podman** — *or* a conda env with `vllm` + `zentorch` installed. +- Enough host RAM for the model (weights + KV cache both live in RAM on CPU). +- A HuggingFace token in `HF_TOKEN` **only** for gated models (Llama, Gemma). The default model (Qwen3) needs none. +- **Node.js ≥ 18** — required by the `skills` CLI used in Step 2 (`npx skills ...`). Check with `node -v`; on older hosts install a newer Node (e.g. `conda create -n node20 -c conda-forge 'nodejs>=20'`). + +## Step 1 - Understanding which skills are available + +* Run `claude "Which skills can you see?" --model sonnet`. You should see a list of skills that does **not** include anything about serving LLMs on EPYC / CPU. +* Make sure there is no `AGENTS.md` file in your local folder. + +## Step 2 - Enabling claude to see `serving-llms-on-epyc` + +* Install the skill with the [`skills` CLI](https://github.com/vercel-labs/skills): + +```bash +npx skills add amd/skills --skill serving-llms-on-epyc --agent claude-code +``` + +* Run `claude "Which skills can you see?" --model sonnet`. You should see a list of skills that now includes `serving-llms-on-epyc`. + +## Step 3 - Running the skill + +Run `claude --model sonnet` on your EPYC host with this prompt: + +``` +Serve Qwen/Qwen3-0.6B on this AMD EPYC box with vLLM and zentorch. +``` + +Claude should: + +1. **Detect the CPU** — confirm it is AMD EPYC and read the generation (Genoa/Turin/…), AVX-512, physical cores, NUMA layout, and RAM. +2. **Validate the environment** — find an accessible runtime (Docker or Podman, else the conda path), check the image, `HF_TOKEN`, and RAM; report any perf-library advisories. +3. **Check vLLM supports the model** — verify the architecture against vLLM's model registry (it does not blanket-block multimodal; it rejects non-chat models like embeddings/rerankers). +4. **Check it fits host RAM** — weights + KV cache + headroom vs available RAM. +5. **Size the runtime to the hardware** — bind to one socket's physical cores, size the KV cache from that socket's local RAM, and bind memory to that socket (this is **single-socket serving**; vLLM scales poorly across sockets). +6. **Confirm the plan with you** — present a sized summary (model, path, precision, fit, CPU sizing, port) and wait for you to approve before launching. +7. **Launch and verify** — pull the public `amdih/zendnn_zentorch` image, run `vllm serve`, poll `/health`, and prove `/v1/chat/completions` works. + +On any failure it reports the cause + logs and **stops** — it does not retry or start a debugging loop. + +## Step 4 - Talk to the endpoint + +Once Claude reports the endpoint is healthy, call it — use the **port from Claude's +connection table** (it uses `8000` by default): + +```bash +curl -s http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{"model":"Qwen/Qwen3-0.6B","messages":[{"role":"user","content":"Hello"}]}' +``` + +## Step 5 - (Optional) Going beyond + +* **A real workload:** ask for a larger model once the flow is proven, e.g. *"Serve Qwen/Qwen3-8B ..."*. Claude re-checks the RAM fit and re-sizes. +* **Gated models:** `export HF_TOKEN=...` (and accept the model license on HuggingFace), then ask for `meta-llama/Llama-3.1-8B-Instruct`. +* **Pick a socket:** on a dual-socket box Claude picks a free socket by load; you can steer it (*"serve it on socket 1"*). + +## Step 6 - (Optional) Try to get things done without AMD Skills + +Remove the added skill and rerun the experiment above. The `skills` CLI installs a +copy under **both** `.claude/skills/serving-llms-on-epyc` **and** +`.agents/skills/serving-llms-on-epyc`, so delete both (otherwise the leftover copy +keeps the skill active and the comparison isn't clean). Without the skill, common +issues include: + +* Passing `--device cpu` to `vllm serve` (removed in vLLM ≥ 0.20 with the zentorch plugin) — the server errors out on launch. +* Guessing at a container image or using a GPU/CUDA image instead of the public CPU `amdih/zendnn_zentorch` one. +* No hardware-aware sizing — spreading threads across both sockets and sizing the KV cache from whole-system RAM, so the KV pool spills cross-socket and throughput tanks. +* Launching a model that does not fit host RAM (or an embedding/reranker model that has no chat endpoint) and then looping on the failure. +* Providing a knowledge article instead of actually bringing up a working endpoint.