Skip to content

Commit 3cb2f89

Browse files
committed
Add serving-llms-on-instinct skill
1 parent 466410b commit 3cb2f89

12 files changed

Lines changed: 12521 additions & 1 deletion

File tree

.claude-plugin/marketplace.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,11 @@
3333
"name": "rocm-doctor",
3434
"source": "./skills/rocm-doctor",
3535
"description": "Diagnose why ROCm, PyTorch, or llama.cpp isn't working on an AMD GPU. Matches the symptom against a fixed list of twelve known misconfigurations and proposes the next step."
36+
},
37+
{
38+
"name": "serving-llms-on-instinct",
39+
"source": "./skills/serving-llms-on-instinct",
40+
"description": "Serve LLMs on AMD Instinct GPUs (MI300X/MI325X/MI350X/MI355X) with vLLM on ROCm. Handles GPU detection, environment validation, vLLM configuration, launch, and health verification."
3641
}
3742
]
3843
}

.cursor-plugin/marketplace.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,11 @@
3333
"name": "rocm-doctor",
3434
"source": "./skills/rocm-doctor",
3535
"description": "Diagnose why ROCm, PyTorch, or llama.cpp isn't working on an AMD GPU. Matches the symptom against a fixed list of twelve known misconfigurations and proposes the next step."
36+
},
37+
{
38+
"name": "serving-llms-on-instinct",
39+
"source": "./skills/serving-llms-on-instinct",
40+
"description": "Serve LLMs on AMD Instinct GPUs (MI300X/MI325X/MI350X/MI355X) with vLLM on ROCm. Handles GPU detection, environment validation, vLLM configuration, launch, and health verification."
3641
}
3742
]
3843
}

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,7 @@ Bring existing workloads onto AMD.
8888
| --- | --- | --- |
8989
| `cuda-to-hip` | Port CUDA kernels with `hipify` and flag anything that needs manual review. | _planned_ |
9090
| `vllm-rocm` | Stand up vLLM on AMD with the right environment variables and model configurations. | _planned_ |
91-
| `serving-llms-on-instinct` | Deploy LLM inference on AMD Instinct GPUs end-to-end: detect hardware (or onboard via AMD Developer Cloud), validate model fit, apply the right vLLM recipe, and launch a benchmarked endpoint. SGLang and engine/backend selection in later phases. | _planned_ |
91+
| [`serving-llms-on-instinct`](skills/serving-llms-on-instinct/SKILL.md) | Deploy LLM inference on AMD Instinct GPUs end-to-end: detect hardware (or onboard via AMD Developer Cloud), validate model fit, apply the right vLLM recipe, and launch a benchmarked endpoint. SGLang and engine/backend selection in later phases. | in-repo |
9292

9393
### Performance & delivery
9494

Lines changed: 342 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,342 @@
1+
---
2+
name: serving-llms-on-instinct
3+
description: >-
4+
Serves AI models on AMD Instinct GPU hardware using vLLM. Use this skill
5+
whenever the user wants to run, serve, deploy, start, host, or launch a
6+
language model on an AMD GPU, AMD Instinct, MI300X, MI325X, MI350X, or MI355X.
7+
Also use when the user mentions vLLM on ROCm, vLLM on AMD, serving on HBM,
8+
or asks how to get a model running on AMD data center hardware. Use when the
9+
user asks "run Qwen3", "serve DeepSeek", "start a vLLM endpoint", "get a
10+
model running on my AMD machine", or any similar phrasing. Handles the full
11+
flow: GPU detection, environment validation, vLLM configuration, launch, and
12+
health verification. Do not use for NVIDIA GPUs, consumer AMD GPUs (RX
13+
series, Radeon), Ryzen AI, NPU, MI250X, or MI100.
14+
allowed-tools: Bash, Read
15+
---
16+
17+
# Serving LLMs on AMD Instinct
18+
19+
Get a vLLM endpoint running on AMD Instinct GPU hardware.
20+
21+
## Prerequisites
22+
23+
- ROCm driver and `amd-smi` installed on the GPU host
24+
- Docker running and accessible (check with `docker ps`)
25+
- `/dev/kfd` and `/dev/dri` present on the GPU host
26+
- HuggingFace token in `HF_TOKEN` env var (required for gated models; not
27+
required for Qwen3 or Gemma). For gated models (Llama 3.2, Gemma, etc.),
28+
the HF token must belong to an account that has accepted the model's license
29+
at `huggingface.co/<model_id>`. A valid token without license acceptance will
30+
fail with an opaque "Engine core initialization failed" error.
31+
- For remote GPU: SSH key access configured (`ssh <user>@<host>` must work
32+
without a password prompt). If only password access is available, set up
33+
keys first: `ssh-copy-id <user>@<host>`
34+
35+
## Data files
36+
37+
Read these files directly to get model and GPU configuration:
38+
39+
- **`data/recipes_cache.json`** -- model configs synced from
40+
[vllm-project/recipes](https://github.com/vllm-project/recipes). Each entry
41+
under `models.<HF_ID>.recipe` contains the full recipe with `model.base_args`,
42+
`model.base_env`, `features.tool_calling.args`, `features.reasoning.args`,
43+
`hardware_overrides.amd.extra_args`, `hardware_overrides.amd.extra_env`.
44+
The top-level `docker_image` field has the latest resolved vLLM ROCm image.
45+
46+
- **`data/gpu_overrides.json`** -- GPU-specific configuration. Contains
47+
`docker_flags` (mandatory for all AMD Instinct), `gpu_configs` keyed by
48+
gfx_version with `env_defaults` and `workarounds`, and `legacy_models` for
49+
models not yet in vLLM recipes.
50+
51+
- **`data/blacklist.json`** -- models in vLLM recipes that cannot be served
52+
as LLM endpoints. Includes diffusion/image/audio generation models, embedding
53+
models, rerankers, ASR models needing audio pipelines, and models requiring
54+
unreleased vLLM nightly builds. Check this before attempting to serve a model.
55+
If the user requests a blacklisted model, explain why it won't work and
56+
suggest an alternative.
57+
58+
If the user doesn't specify a model, default to **Qwen/Qwen3.5-9B**: dense
59+
multimodal with MTP, Apache 2.0 license (no HF token needed), fits on a single
60+
GPU, strong reasoning and tool-calling.
61+
62+
## Step 1: Detect the GPU
63+
64+
```bash
65+
python3 scripts/detect.py
66+
# Remote:
67+
python3 scripts/detect.py --host user@hostname
68+
```
69+
70+
Returns JSON with `gfx_version`, `vram_gb`, `gpu_count`, `rocm_version`.
71+
72+
| gfx_version | Hardware | VRAM |
73+
|---|---|---|
74+
| gfx950 | MI350X / MI355X | 288 GB HBM3E |
75+
| gfx942 | MI300X (192 GB) / MI325X (256 GB) / MI300A (128 GB) | varies |
76+
77+
If `gfx_version` is `unknown`: `amd-smi` ran but found no GPU. Check
78+
`lsmod | grep amdgpu`.
79+
80+
## Step 2: Validate the environment
81+
82+
```bash
83+
python3 scripts/validate.py --auto-fix
84+
# Remote:
85+
python3 scripts/validate.py --auto-fix --host user@hostname
86+
```
87+
88+
Returns JSON with `ready` (bool), `errors`, `warnings`, `fixes_applied`.
89+
Do not proceed if `ready` is `false`.
90+
91+
## Step 3: Refresh recipes (if stale)
92+
93+
Check `fetched_at` in `data/recipes_cache.json`. If older than 24 hours or
94+
the file is missing, refresh:
95+
96+
```bash
97+
python3 scripts/sync_recipes.py
98+
```
99+
100+
This shallow-clones vllm-project/recipes from GitHub and fetches the latest
101+
Docker tag from Docker Hub. Takes ~10 seconds. If it fails, the existing
102+
cache still works.
103+
104+
## Step 4: Construct the Docker command
105+
106+
Read `data/recipes_cache.json` and `data/gpu_overrides.json` directly.
107+
Build the Docker command by combining:
108+
109+
1. **Docker flags** from `gpu_overrides.json > docker_flags` (mandatory for all AMD GPUs)
110+
2. **HF cache mount**: `-v ~/.cache/huggingface:/root/.cache/huggingface`
111+
(if a shared model cache directory exists on the host, check whether
112+
`models--*` directories are at the cache root or inside a `hub/`
113+
subdirectory -- mount accordingly to `/root/.cache/huggingface` or
114+
`/root/.cache/huggingface/hub`)
115+
3. **Port**: `-p <port>:<port>` (default 8000)
116+
4. **Environment variables**: merge `gpu_configs.<gfx_version>.env_defaults`
117+
with the recipe's `model.base_env` and `hardware_overrides.amd.extra_env`.
118+
Always add `--env HF_TOKEN=${HF_TOKEN}`.
119+
5. **Docker image**: use `docker_image` from `recipes_cache.json` top level
120+
(unless the model needs a pinned image, e.g. GLM-4.5 needs `v0.15.1`).
121+
If the user specifies a Docker image version, check it against the recipe's
122+
`model.min_vllm_version`. Warn if the image is older -- the model may crash
123+
on startup with an opaque "Engine core initialization failed" error.
124+
6. **Model ID**: `--model <HF_ID>`
125+
7. **vLLM args**: combine the recipe's `model.base_args` +
126+
`hardware_overrides.amd.extra_args` + `features.tool_calling.args` +
127+
`features.reasoning.args`. Add `--enable-auto-tool-choice` if not present.
128+
For multi-GPU, add `--tensor-parallel-size N` (see VRAM estimation below).
129+
For MoE models on multi-GPU, also add `--distributed-executor-backend mp`.
130+
8. **Port arg**: `--port <port>`
131+
132+
If the exact model ID is not in `recipes_cache.json`, check for a base model
133+
match by stripping date/version suffixes (e.g., `Kimi-K2-Instruct` matches
134+
`Kimi-K2-Instruct-0905`). Use the base model's recipe if found.
135+
136+
If no recipe match, check `legacy_models` in `gpu_overrides.json`. If not
137+
there either, use a generic config with
138+
`--enable-auto-tool-choice --trust-remote-code --tool-call-parser hermes`.
139+
140+
**Precision variant selection:** Recipes may offer variants (default, fp8,
141+
nvfp4). Check `gpu_configs.<gfx_version>.precision.native` in
142+
`gpu_overrides.json` before selecting a variant. On gfx942 (MI300X), only
143+
`bf16`, `fp16`, `fp8_fnuz`, and `int8` are hardware-native. MXFP4 and NVFP4
144+
compute is emulated (dequant to BF16 during matmul), but weights stay
145+
compressed in VRAM so quantized models still fit in less memory.
146+
On gfx950 (MI350X), MXFP4 is hardware-native.
147+
148+
**VRAM estimation and fit check:** Before constructing the Docker command,
149+
estimate whether the model fits the available hardware:
150+
```bash
151+
python3 scripts/estimate_vram.py --model-id <HF_ID> --vram-gb <per_gpu_vram> --tp <N>
152+
```
153+
This queries the HuggingFace Hub API (no model download) and returns JSON with:
154+
- `weight_memory_gb` -- total weight size
155+
- `kv_cache_bytes_per_token` -- KV cache cost per token at BF16
156+
- `fit.weights_fit` -- whether weights fit at the given TP
157+
- `fit.recommended_max_model_len` -- max context the GPU can serve
158+
- `fit.context_limited` -- true if KV cache limits context below the
159+
model's native max
160+
- `fit.min_tp_required` -- minimum TP needed (only if weights don't fit)
161+
162+
**Understanding the overhead:** The script reserves ~4 GB for vLLM's runtime
163+
overhead (activation profiling, HIP graph capture, internal buffers). During
164+
startup, vLLM runs a profiling forward pass to measure peak activations, then
165+
captures HIP graphs for optimized decode. This startup peak is higher than
166+
steady-state. The `remaining_for_kv_gb` field reflects what's left after
167+
weights and this overhead.
168+
169+
Use `remaining_for_kv_gb` to decide:
170+
171+
1. **`remaining_for_kv_gb >= 6`**: safe to run. If `context_limited: true`,
172+
add `--max-model-len <recommended_max_model_len>` to the vLLM args.
173+
Mention the FP8 KV cache option (`--kv-cache-dtype fp8`) if the user
174+
needs longer context (`fit.max_seq_len_fp8_kv` shows the gain).
175+
2. **`remaining_for_kv_gb` between 2 and 6**: tight but worth trying. Launch
176+
normally. If vLLM OOMs during HIP graph capture (check container logs for
177+
"out of memory" after "capturing CUDA/HIP graphs"), retry with
178+
`--enforce-eager` added to the vLLM args. This skips graph capture and
179+
frees 1-2 GB. The only cost is slightly higher decode latency.
180+
3. **`remaining_for_kv_gb < 2`**: too tight. Will likely OOM during the
181+
activation profiling step. Do not attempt.
182+
4. **`weights_fit: false` with multiple GPUs**: re-run with
183+
`--tp <min_tp_required>` and check again.
184+
5. **`weights_fit: false`, not enough GPUs**: look for quantized
185+
alternatives in this order:
186+
a. **Recipe variants**: the recipe may have `fp8` or `mxfp4` variants
187+
with a different `model_id` that points to a quantized checkpoint.
188+
b. **Same provider**: many providers release quantized versions alongside
189+
the base model (e.g. `Qwen/Qwen3.5-122B-FP8` from Qwen). Search
190+
HuggingFace for `<provider>/<model-name>` with FP8/GPTQ/AWQ suffixes.
191+
c. **AMD quantized**: AMD's Quark team publishes quantized models under
192+
the `amd/` org on HuggingFace (e.g. `amd/Kimi-K2-Instruct-w-mxfp4-a-fp8`).
193+
Search for `amd/<model-name>` variants.
194+
Run `estimate_vram.py` on the quantized model ID to verify it fits,
195+
then use that model ID instead.
196+
6. **Still doesn't fit**: tell the user the model requires more VRAM than
197+
available and suggest either a smaller model or multi-GPU hardware.
198+
Do not attempt to launch.
199+
200+
Docker command template:
201+
```
202+
docker run -d --name vllm-<model-slug> \
203+
<docker_flags> \
204+
-v <hf_cache_mount> \
205+
-p <port>:<port> \
206+
--env <key>=<value> (for each env var) \
207+
--env HF_TOKEN=${HF_TOKEN} \
208+
<docker_image> \
209+
--model <model_id> \
210+
<vllm_args> \
211+
--port <port>
212+
```
213+
214+
## Step 5: Confirm with the user
215+
216+
Before launching, present a summary and ask the user to confirm:
217+
- **Model**: full HuggingFace ID (e.g. `Qwen/Qwen3.5-122B-Instruct`)
218+
- **Precision**: variant being used (e.g. BF16, FP8) and why
219+
- **Weight memory**: from estimate_vram.py
220+
- **GPU**: detected hardware and VRAM
221+
- **TP**: tensor parallelism degree (1, 2, 4, 8)
222+
- **Context**: max achievable context length (and whether it's limited)
223+
- **Port**: which port the endpoint will be on
224+
225+
If a quantized alternative was selected (Step 4 fit check), explain that
226+
the original model doesn't fit and which alternative is being used.
227+
228+
Wait for the user's confirmation before proceeding.
229+
230+
## Step 6: Launch and verify
231+
232+
Before launching, check for port conflicts:
233+
```bash
234+
ss -tlnp 2>/dev/null | grep ':<port> '
235+
```
236+
If a Docker container is on that port, stop it with `docker rm -f <name>`.
237+
238+
Run the Docker command. Then poll health using this loop:
239+
240+
```bash
241+
while docker inspect --format='{{.State.Running}}' <container_name> 2>/dev/null | grep -q true; do
242+
curl -sf http://localhost:<port>/health && echo "READY" && exit 0
243+
sleep 60
244+
done
245+
echo "FAILED -- container exited"
246+
```
247+
248+
A 503 during loading is normal. Choose the polling strategy based on
249+
model size (weight memory from hf-mem):
250+
251+
- **Small models (< 100 GB weights)**: run the poll as a blocking command
252+
with the Bash tool's `timeout` set to 600000 (10 minutes). Most cached
253+
models are ready within 2-5 minutes.
254+
- **Large models (>= 100 GB weights)**: run the poll with the Bash tool's
255+
`run_in_background` set to `true`. Then use `TaskOutput` with
256+
`block: true` and `timeout: 600000` to wait up to 10 minutes per check.
257+
If the task is still running after that, call `TaskOutput` again with
258+
the same parameters. This uses only 1 turn per 10-minute wait instead
259+
of burning a turn every check. The background loop runs until the
260+
container is healthy or dies.
261+
262+
After health returns 200, send a warmup request (triggers HIP kernel compilation,
263+
~40-45 seconds on gfx942):
264+
```bash
265+
curl -s http://localhost:<port>/v1/chat/completions \
266+
-H "Content-Type: application/json" \
267+
-d '{"model":"<model_id>","messages":[{"role":"user","content":"say hi"}],"max_tokens":5}'
268+
```
269+
270+
Return to the user:
271+
- `base_url`: `http://<host>:8000/v1`
272+
- `api_key`: none required for local
273+
- `model`: the model ID used
274+
275+
## Remote vs. local
276+
277+
All scripts accept `--host user@hostname`. When given, they SSH to the target.
278+
Set `ROCM_SSH_HOST` and `ROCM_SSH_USER` env vars to avoid passing `--host`
279+
every time.
280+
281+
For remote Docker commands, run them over SSH:
282+
```bash
283+
ssh user@host 'docker run -d ...'
284+
```
285+
Use `localhost` for health/warmup curl URLs (curl runs on the remote host).
286+
287+
## Gotchas
288+
289+
**`CUDA_VISIBLE_DEVICES` set to empty string** -- ROCm maps this variable to
290+
`HIP_VISIBLE_DEVICES`. Setting it to an empty string hides all GPUs.
291+
`CUDA_VISIBLE_DEVICES=0,1` works fine for restricting GPUs (same as
292+
`HIP_VISIBLE_DEVICES=0,1`). If the host has it set to empty, unset it:
293+
`unset CUDA_VISIBLE_DEVICES`. Do not pass `--env CUDA_VISIBLE_DEVICES=` (empty)
294+
into Docker -- that also hides all GPUs inside the container.
295+
296+
**FP4BMM crash on gfx942 (MI300X)** -- If the container exits immediately
297+
with a segfault or illegal instruction: `VLLM_ROCM_USE_AITER_FP4BMM` must be
298+
`0` on gfx942. This is set correctly in `gpu_overrides.json` for gfx942.
299+
See vLLM issue #34641.
300+
301+
**`HIP error: no kernel image`** -- The Docker image has no compiled kernel
302+
for your GPU's gfx version. Use `vllm/vllm-openai-rocm:latest`; it includes
303+
gfx942 and gfx950 kernels.
304+
305+
**MLA models need `--block-size 1`** -- DeepSeek-R1/V3, Kimi-K2.5.
306+
Without it the MLA attention backend silently falls back to a slower path.
307+
This is in the recipe args for these models.
308+
309+
**MoE models on multi-GPU need `--distributed-executor-backend mp`** --
310+
Qwen3-235B, GLM-4.5, MiniMax-M2. The default distributed executor does not
311+
work reliably with MoE on ROCm.
312+
313+
**OOM during HIP graph capture** -- If the container logs show "out of memory"
314+
after "capturing CUDA graphs" or "capturing HIP graphs", the model fits in
315+
VRAM but there isn't enough headroom for graph capture. Retry with
316+
`--enforce-eager` added to the vLLM args. This disables graph capture and
317+
frees 1-2 GB. Trade-off: slightly higher decode latency, but the model runs.
318+
319+
**"Engine core initialization failed"** -- This opaque error means the engine
320+
core subprocess died. Check early container logs: `docker logs <name> 2>&1 |
321+
head -50`. Common causes: gated model access denied (license not accepted on
322+
HF), unsupported architecture on this vLLM version, OOM during weight loading,
323+
missing `--trust-remote-code` for custom architectures, or vLLM version too old
324+
for the model (check `min_vllm_version` in the recipe).
325+
326+
**`/dev/kfd` permission denied** -- User is not in the `video` or `render`
327+
group. Fix: `sudo usermod -aG video,render $USER` (requires re-login).
328+
329+
**SSH key not configured** -- The scripts use `BatchMode=yes` SSH. If SSH
330+
fails with `Permission denied (publickey)`, configure key-based access first.
331+
332+
**Restricting GPUs on shared hosts** -- Use `--env HIP_VISIBLE_DEVICES=0,1`
333+
or `--env CUDA_VISIBLE_DEVICES=0,1` to target specific GPUs by index.
334+
`HIP_VISIBLE_DEVICES` is the canonical AMD variable; `CUDA_VISIBLE_DEVICES`
335+
also works (ROCm maps it). Never set either to an empty string.
336+
337+
---
338+
339+
## Reference
340+
341+
Precision compatibility, VRAM estimation, Docker flags, and known quirks:
342+
[reference.md](reference.md)

0 commit comments

Comments
 (0)