Add serving-llms-on-epyc walkthrough and EPYC serving fixes

amd-lalithnc · amd-lalithnc · commit 40e8397bd842 · 2026-07-01T04:45:39.000-06:00
- Add walkthroughs/serving-llms-on-epyc.md (+ README link) for issue #82. - Re-register the skill in the marketplace (needs a walkthrough to be listed). - Require AVX-512 (Zen4+): hard gate in Step 1; scope in the description. - Fix launch: drop --shm-size (conflicts with --ipc=host on podman). - detect.py: recognize lettered EPYC SKUs (e.g. 9B45 -> Turin/Zen5). - Note re-run name collision, rootless-podman cpuset, and HF_HOME cache mount. Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com> Change-Id: Ia2bf2b8f40c2c709f8ad3b3d394a7946d4949b26
diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json
@@ -24,6 +24,11 @@
       "source": "./skills/magpie-kernel-evaluator",
       "description": "Performs GPU kernel correctness and performance evaluation and LLM inference benchmarking with Magpie. Analyzes single or multiple kernels (HIP/CUDA/PyTorch), compares kernel implementations, runs vLLM/SGLang benchmarks with profiling and TraceLens, and runs gap analysis on torch traces."
     },
+    {
+      "name": "serving-llms-on-epyc",
+      "source": "./skills/serving-llms-on-epyc",
+      "description": "Serve LLMs on AMD EPYC CPUs (AVX-512 / Zen4+) with vLLM + zentorch, in a container (Docker/Podman) or conda. Detects the CPU, validates runtime/env, checks vLLM model support and RAM fit, sizes threads/KV and pins one socket + its memory, launches, and verifies. Single instance; reports and stops on failure."
+    },
     {
       "name": "serving-llms-on-instinct",
       "source": "./skills/serving-llms-on-instinct",
diff --git a/.cursor-plugin/marketplace.json b/.cursor-plugin/marketplace.json
@@ -24,6 +24,11 @@
       "source": "./skills/magpie-kernel-evaluator",
       "description": "Performs GPU kernel correctness and performance evaluation and LLM inference benchmarking with Magpie. Analyzes single or multiple kernels (HIP/CUDA/PyTorch), compares kernel implementations, runs vLLM/SGLang benchmarks with profiling and TraceLens, and runs gap analysis on torch traces."
     },
+    {
+      "name": "serving-llms-on-epyc",
+      "source": "./skills/serving-llms-on-epyc",
+      "description": "Serve LLMs on AMD EPYC CPUs (AVX-512 / Zen4+) with vLLM + zentorch, in a container (Docker/Podman) or conda. Detects the CPU, validates runtime/env, checks vLLM model support and RAM fit, sizes threads/KV and pins one socket + its memory, launches, and verifies. Single instance; reports and stops on failure."
+    },
     {
       "name": "serving-llms-on-instinct",
       "source": "./skills/serving-llms-on-instinct",
diff --git a/skills/serving-llms-on-epyc/SKILL.md b/skills/serving-llms-on-epyc/SKILL.md
@@ -5,15 +5,15 @@ description: >-
   backend, in a container (Docker or Podman) or a conda env. Use whenever the
   user wants to run, serve, deploy, start, host, or launch an LLM on AMD EPYC,
   Zen CPU, "vLLM on CPU", "zentorch serving", or "serve a model without a GPU".
-  Use for "serve Qwen on EPYC", "start a CPU vLLM endpoint", "run an OpenAI
-  server on my EPYC box", or similar. Handles the full single-instance flow:
+  Handles the full single-instance flow:
   detect the CPU (incl. EPYC generation), validate the runtime/env, check vLLM
   supports the model (via vLLM's registry, not a modality blocklist), check it
   fits host RAM, size CPU threads/KV/NUMA from the hardware, confirm the plan with
   the user, launch, and poll until the endpoint is responsive. Single instance,
   single socket (pinned to one socket + its memory; vLLM scales poorly across
   sockets). Does NOT debug failures and does NOT retry -- it reports and stops. Do
-  not use for GPU/Instinct (use serving-llms-on-instinct) or multi-node.
+  not use for GPU/Instinct (use serving-llms-on-instinct), multi-node, or pre-Zen4
+  EPYC without AVX-512 (Naples/Rome/Milan).
 allowed-tools: Bash, Read
 ---
 
@@ -52,10 +52,16 @@ python3 scripts/detect.py            # add --host user@box for a remote host
 
 Returns `cpu_model`, `is_amd_epyc`, `epyc_generation`
 (Naples/Rome/Milan/Genoa/Bergamo/Siena/Turin), `zen_arch`, `avx512`,
-`logical_cores`, `physical_cores`, `sockets`, `numa_nodes`, `memory_gb`. If
-`is_amd_epyc` is `false`, stop: this skill targets AMD EPYC. (Other x86 may work
-but is unsupported here.) Carry `epyc_generation` / `avx512` through the later
-phases -- e.g. AVX-512 + bf16 land on Zen4+ (Genoa/Turin), and Turin packs up to
+`logical_cores`, `physical_cores`, `sockets`, `numa_nodes`, `memory_gb`.
+
+Two hard gates -- stop if either fails:
+- `is_amd_epyc` is `false` -> stop: this skill targets AMD EPYC. (Other x86 may work
+  but is unsupported here.)
+- `avx512` is `false` -> stop: the zentorch CPU path **requires AVX-512**, i.e. Zen4+
+  (Genoa / Bergamo / Siena / Turin) or newer. Pre-Zen4 EPYC (Naples / Rome / Milan)
+  is not supported -- say so and stop rather than launching into a load-time failure.
+
+Carry `epyc_generation` / `avx512` through the later phases -- e.g. Turin packs up to
 128 cores/socket, which the thread-binding in Step 5 sizes from.
 
 ## Step 2: Validate the runtime and environment
@@ -166,9 +172,10 @@ auto-selects the CPU platform and `vllm serve` rejects the flag. Only add it if
 including the pull. `RT` is the resolved runtime verbatim:
 ```bash
 RT="<runtime from validate.py: docker | podman>"
+$RT rm -f vllm-epyc 2>/dev/null               # clear any leftover container from a prior run (name collision otherwise)
 $RT pull <image from data/epyc.json>          # agent pulls; do not ask the user to
 $RT run -d --name vllm-epyc \
-  <run_flags from data/epyc.json>            # --ipc=host --shm-size=16g --network=host
+  <run_flags from data/epyc.json>            # --ipc=host --network=host (NO --shm-size: it conflicts with --ipc=host on podman)
   <hf_cache_mount> \
   <container_cpuset from cpu_tune>             # --cpuset-cpus=<cores> --cpuset-mems=<nodes>
   --env VLLM_CPU_OMP_THREADS_BIND="$VLLM_CPU_OMP_THREADS_BIND" \
@@ -244,10 +251,26 @@ See [reference.md](reference.md) for the full list. The load-bearing ones:
   zentorch 2.11 (`AssertionError: expected OutputCode, got function`). It only
   works with `VLLM_USE_AOT_COMPILE=0` set alongside it. Never set one without
   the other.
-- **`--shm-size`**: vLLM needs a large `/dev/shm`; the container default (64MB)
-  is too small. Use `--shm-size=16g` (in `data/epyc.json`).
+- **`/dev/shm` — use `--ipc=host`, not `--shm-size`.** vLLM needs a large
+  `/dev/shm` (the 64MB container default is too small). The base recipe uses
+  `--ipc=host`, which shares the host's large shared memory. **Do not also pass
+  `--shm-size`**: podman errors with *"cannot set shmsize when running in the host
+  IPC Namespace"*, and it is redundant on docker. If you instead isolate IPC (drop
+  `--ipc=host`), then add `--shm-size=16g` — one or the other, never both.
 - **NUMA / socket**: one instance is pinned to **one socket plus its memory** --
   CPU bind + `--cpuset-mems` (container) / `numactl --membind` (conda), with KV sized
   from that socket's local RAM. On a dual-socket host `cpu_tune.py` picks a free socket
   by load and `warning`s if both are busy. NPS2/NPS4 (multi-node socket) gets an
   `nps_note` that finer per-node binding could add more.
+- **Rootless podman + `--cpuset-cpus`/`--cpuset-mems`**: these are cgroup limits and
+  may be **ignored or rejected** on rootless podman without cpuset cgroup delegation
+  (cgroup v1, or v2 without the controller delegated). This is **not fatal**: CPU
+  thread binding still applies via `VLLM_CPU_OMP_THREADS_BIND` inside the container;
+  only the container-level memory pin is lost (reduced NUMA locality). If the run
+  errors specifically on the cpuset flags, drop them and proceed -- do not treat it
+  as a launch failure.
+- **HF cache mount**: the default mounts `~/.cache/huggingface`. If `HF_HOME` points
+  elsewhere (common on shared hosts, e.g. `/proj/.../vllm`), mount **that** path to
+  `/root/.cache/huggingface` instead, or the model re-downloads inside the container.
+- **Container name reuse**: a leftover `vllm-epyc` from a prior run makes `run` fail
+  with "name already in use" -- Step 6 clears it first with `$RT rm -f vllm-epyc`.
diff --git a/skills/serving-llms-on-epyc/data/epyc.json b/skills/serving-llms-on-epyc/data/epyc.json
@@ -6,13 +6,12 @@
     "comment": "Public vLLM + zentorch CPU image on Docker Hub (amdih/zendnn_zentorch) -- no internal-registry access needed. Tags are vllm_v<ver>_zentorch_v<ver>_<os>_<build>; prefer the newest ubuntu22.04 stable. Both docker and podman are supported; the skill prefers docker and falls back to podman.",
     "run_flags": [
       "--ipc=host",
-      "--shm-size=16g",
       "--network=host"
     ],
     "hf_cache_mount": "-v ~/.cache/huggingface:/root/.cache/huggingface",
     "flag_notes": {
-      "--ipc=host": "vLLM workers use host IPC/shared memory.",
-      "--shm-size=16g": "vLLM needs a large /dev/shm; default 64MB is not enough.",
+      "--ipc=host": "vLLM workers need a large /dev/shm; --ipc=host shares the host's (large) shared memory, which covers it. Do NOT also pass --shm-size: podman rejects '--shm-size' together with '--ipc=host' (cannot set shmsize in host IPC namespace), and it is redundant on docker too.",
+      "shm_alternative": "If you must isolate IPC (drop --ipc=host), then add --shm-size=16g instead (the 64MB container default is too small for vLLM). Use one or the other, never both.",
       "--network=host": "Expose the served port directly. Alternative: -p <port>:<port>.",
       "numa": "A single instance is pinned to ONE socket plus its memory. cpu_tune.py picks a free socket by CPU load on dual-socket hosts (warns if both busy; --socket N forces), sizes KV from that socket's local RAM, and emits --cpuset-cpus + --cpuset-mems (container) or numactl --cpunodebind/--membind (conda). True multi-socket scaling = multiple instances (one per socket), out of scope here."
     }
diff --git a/skills/serving-llms-on-epyc/reference.md b/skills/serving-llms-on-epyc/reference.md
@@ -41,8 +41,8 @@ From `data/epyc.json`. Unlike the Instinct (GPU) skill there are **no**
 
 | Flag | Why |
 |---|---|
-| `--ipc=host` | vLLM workers use host IPC / shared memory |
-| `--shm-size=16g` | vLLM needs a large `/dev/shm`; the 64MB default is too small |
+| `--ipc=host` | vLLM workers need a large `/dev/shm`; sharing the host IPC namespace provides it. **Do not also pass `--shm-size`** -- podman rejects the combination, and it is redundant on docker |
+| `--shm-size=16g` | **only if you drop `--ipc=host`** (isolated IPC). The 64MB container default is too small for vLLM. Use one or the other, never both |
 | `--network=host` | expose the served port directly (or use `-p <port>:<port>`) |
 | `--cpuset-cpus` / `--cpuset-mems` | pin the container to the chosen socket's physical cores and its NUMA node(s); from `cpu_tune.py` |
 | `-v ~/.cache/huggingface:/root/.cache/huggingface` | reuse the host model cache |
@@ -107,8 +107,11 @@ between the failing and passing runs was `VLLM_USE_AOT_COMPILE`. Never set
 `FREEZING=1` without `VLLM_USE_AOT_COMPILE=0`. The base recipe leaves both unset.
 
 **`/dev/shm` too small**
-Without `--shm-size=16g` (or `--ipc=host`), vLLM workers fail to allocate shared
-memory at startup.
+vLLM workers need a large `/dev/shm` or they fail to allocate shared memory at
+startup. The base recipe uses `--ipc=host` (shares the host's large shared memory).
+**Do not combine `--ipc=host` with `--shm-size`** -- podman errors *"cannot set
+shmsize when running in the host IPC Namespace"*, and it is redundant on docker. If
+you drop `--ipc=host`, use `--shm-size=16g` instead -- one or the other, never both.
 
 **RAM is the ceiling, not VRAM**
 CPU serving keeps weights + KV cache in system RAM. `estimate_memory.py` checks
diff --git a/skills/serving-llms-on-epyc/scripts/detect.py b/skills/serving-llms-on-epyc/scripts/detect.py
@@ -49,11 +49,13 @@ def _lscpu_field(lscpu_out, label):
 def _epyc_generation(model):
     """Map an AMD EPYC model name to (generation, zen_arch).
 
-    EPYC numbering encodes the generation: 7xx1=Naples (Zen1), 7xx2=Rome (Zen2),
-    7xx3=Milan (Zen3), 8xx4=Siena (Zen4c), 97x4=Bergamo (Zen4c), 9xx4=Genoa (Zen4),
-    9xx5=Turin (Zen5). The agent should carry this through every phase (e.g. AVX-512
-    + bf16 land on Zen4+, Turin has up to 128 cores per socket -> thread binding)."""
-    m = re.search(r"EPYC\s+(\d{4})", model.upper())
+    EPYC numbering encodes the generation by its first and last digit: 7xx1=Naples
+    (Zen1), 7xx2=Rome (Zen2), 7xx3=Milan (Zen3), 8xx4=Siena (Zen4c), 97x4=Bergamo
+    (Zen4c), 9xx4=Genoa (Zen4), 9xx5=Turin (Zen5). Some SKUs carry a letter in the
+    middle (e.g. 9B45 -> 9__5 -> Turin), so we match 4 alphanumerics whose first and
+    last chars are digits and key off those. The agent carries this through every
+    phase (e.g. AVX-512 + bf16 land on Zen4+, Turin has up to 128 cores/socket)."""
+    m = re.search(r"EPYC\s+(\d[0-9A-Z]{2}\d)", model.upper())
     if not m:
         return "unknown", "unknown"
     num = m.group(1)
diff --git a/walkthroughs/README.md b/walkthroughs/README.md
@@ -9,4 +9,5 @@ Participatns using other are still encouraged to participate. Just please note t
 Please choose a skill to get started.
 
 * [local-ai-use](./local-ai-use.md): Teach your agent how to run image generation locally.
-* [local-ai-app-integration](./local-ai-app-integration.md): Add a local AI mode to a cloud-only app.
+* [local-ai-app-integration](./local-ai-app-integration.md): Add a local AI mode to a cloud-only app.
+* [serving-llms-on-epyc](./serving-llms-on-epyc.md): Bring up a vLLM + zentorch LLM endpoint on an AMD EPYC CPU.
diff --git a/walkthroughs/serving-llms-on-epyc.md b/walkthroughs/serving-llms-on-epyc.md
@@ -0,0 +1,84 @@
+# AMD Skills Walkthroughs: `serving-llms-on-epyc`
+
+The goal of this skill is to teach your AI agent to bring up a vLLM OpenAI-compatible
+endpoint on an **AMD EPYC CPU** host using the zentorch backend — detecting the CPU,
+validating the environment, checking the model fits, sizing the runtime to the
+hardware, launching, and verifying the endpoint responds.
+
+**What you'll end up with:** a running `vllm serve` endpoint on your EPYC box (in a
+Docker/Podman container, or a conda env), sized to a single socket and ready to answer
+OpenAI `/v1/chat/completions` requests.
+
+## Prerequisites
+
+- An **AMD EPYC CPU with AVX-512 support** — i.e. **Zen4+ (Genoa / Bergamo / Siena / Turin) or newer**. This is CPU serving (no GPU required); AVX-512 is required for the zentorch CPU path, and `detect.py` reports it (`avx512`).
+- A container runtime — **Docker** or **Podman** — *or* a conda env with `vllm` + `zentorch` installed.
+- Enough host RAM for the model (weights + KV cache both live in RAM on CPU).
+- A HuggingFace token in `HF_TOKEN` **only** for gated models (Llama, Gemma). The default model (Qwen3) needs none.
+- **Node.js ≥ 18** — required by the `skills` CLI used in Step 2 (`npx skills ...`). Check with `node -v`; on older hosts install a newer Node (e.g. `conda create -n node20 -c conda-forge 'nodejs>=20'`).
+
+## Step 1 - Understanding which skills are available
+
+* Run `claude "Which skills can you see?" --model sonnet`. You should see a list of skills that does **not** include anything about serving LLMs on EPYC / CPU.
+* Make sure there is no `AGENTS.md` file in your local folder.
+
+## Step 2 - Enabling claude to see `serving-llms-on-epyc`
+
+* Install the skill with the [`skills` CLI](https://github.com/vercel-labs/skills):
+
+```bash
+npx skills add amd/skills --skill serving-llms-on-epyc --agent claude-code
+```
+
+* Run `claude "Which skills can you see?" --model sonnet`. You should see a list of skills that now includes `serving-llms-on-epyc`.
+
+## Step 3 - Running the skill
+
+Run `claude --model sonnet` on your EPYC host with this prompt:
+
+```
+Serve Qwen/Qwen3-0.6B on this AMD EPYC box with vLLM and zentorch.
+```
+
+Claude should:
+
+1. **Detect the CPU** — confirm it is AMD EPYC and read the generation (Genoa/Turin/…), AVX-512, physical cores, NUMA layout, and RAM.
+2. **Validate the environment** — find an accessible runtime (Docker or Podman, else the conda path), check the image, `HF_TOKEN`, and RAM; report any perf-library advisories.
+3. **Check vLLM supports the model** — verify the architecture against vLLM's model registry (it does not blanket-block multimodal; it rejects non-chat models like embeddings/rerankers).
+4. **Check it fits host RAM** — weights + KV cache + headroom vs available RAM.
+5. **Size the runtime to the hardware** — bind to one socket's physical cores, size the KV cache from that socket's local RAM, and bind memory to that socket (this is **single-socket serving**; vLLM scales poorly across sockets).
+6. **Confirm the plan with you** — present a sized summary (model, path, precision, fit, CPU sizing, port) and wait for you to approve before launching.
+7. **Launch and verify** — pull the public `amdih/zendnn_zentorch` image, run `vllm serve`, poll `/health`, and prove `/v1/chat/completions` works.
+
+On any failure it reports the cause + logs and **stops** — it does not retry or start a debugging loop.
+
+## Step 4 - Talk to the endpoint
+
+Once Claude reports the endpoint is healthy, call it — use the **port from Claude's
+connection table** (it uses `8000` by default):
+
+```bash
+curl -s http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model":"Qwen/Qwen3-0.6B","messages":[{"role":"user","content":"Hello"}]}'
+```
+
+## Step 5 - (Optional) Going beyond
+
+* **A real workload:** ask for a larger model once the flow is proven, e.g. *"Serve Qwen/Qwen3-8B ..."*. Claude re-checks the RAM fit and re-sizes.
+* **Gated models:** `export HF_TOKEN=...` (and accept the model license on HuggingFace), then ask for `meta-llama/Llama-3.1-8B-Instruct`.
+* **Pick a socket:** on a dual-socket box Claude picks a free socket by load; you can steer it (*"serve it on socket 1"*).
+
+## Step 6 - (Optional) Try to get things done without AMD Skills
+
+Remove the added skill and rerun the experiment above. The `skills` CLI installs a
+copy under **both** `.claude/skills/serving-llms-on-epyc` **and**
+`.agents/skills/serving-llms-on-epyc`, so delete both (otherwise the leftover copy
+keeps the skill active and the comparison isn't clean). Without the skill, common
+issues include:
+
+* Passing `--device cpu` to `vllm serve` (removed in vLLM ≥ 0.20 with the zentorch plugin) — the server errors out on launch.
+* Guessing at a container image or using a GPU/CUDA image instead of the public CPU `amdih/zendnn_zentorch` one.
+* No hardware-aware sizing — spreading threads across both sockets and sizing the KV cache from whole-system RAM, so the KV pool spills cross-socket and throughput tanks.
+* Launching a model that does not fit host RAM (or an embedding/reranker model that has no chat endpoint) and then looping on the failure.
+* Providing a knowledge article instead of actually bringing up a working endpoint.