|
| 1 | +--- |
| 2 | +name: serving-llms-on-instinct |
| 3 | +description: >- |
| 4 | + Serves AI models on AMD Instinct GPU hardware using vLLM. Use this skill |
| 5 | + whenever the user wants to run, serve, deploy, start, host, or launch a |
| 6 | + language model on an AMD GPU, AMD Instinct, MI300X, MI325X, MI350X, or MI355X. |
| 7 | + Also use when the user mentions vLLM on ROCm, vLLM on AMD, serving on HBM, |
| 8 | + or asks how to get a model running on AMD data center hardware. Use when the |
| 9 | + user asks "run Qwen3", "serve DeepSeek", "start a vLLM endpoint", "get a |
| 10 | + model running on my AMD machine", or any similar phrasing. Handles the full |
| 11 | + flow: GPU detection, environment validation, vLLM configuration, launch, and |
| 12 | + health verification. Do not use for NVIDIA GPUs, consumer AMD GPUs (RX |
| 13 | + series, Radeon), Ryzen AI, NPU, MI250X, or MI100. |
| 14 | +allowed-tools: Bash, Read |
| 15 | +--- |
| 16 | + |
| 17 | +# Serving LLMs on AMD Instinct |
| 18 | + |
| 19 | +Get a vLLM endpoint running on AMD Instinct GPU hardware. |
| 20 | + |
| 21 | +## Prerequisites |
| 22 | + |
| 23 | +- ROCm driver and `amd-smi` installed on the GPU host |
| 24 | +- Docker running and accessible (check with `docker ps`) |
| 25 | +- `/dev/kfd` and `/dev/dri` present on the GPU host |
| 26 | +- HuggingFace token in `HF_TOKEN` env var (required for gated models; not |
| 27 | + required for Qwen3 or Gemma). For gated models (Llama 3.2, Gemma, etc.), |
| 28 | + the HF token must belong to an account that has accepted the model's license |
| 29 | + at `huggingface.co/<model_id>`. A valid token without license acceptance will |
| 30 | + fail with an opaque "Engine core initialization failed" error. |
| 31 | +- For remote GPU: SSH key access configured (`ssh <user>@<host>` must work |
| 32 | + without a password prompt). If only password access is available, set up |
| 33 | + keys first: `ssh-copy-id <user>@<host>` |
| 34 | + |
| 35 | +## Data files |
| 36 | + |
| 37 | +Read these files directly to get model and GPU configuration: |
| 38 | + |
| 39 | +- **`data/recipes_cache.json`** -- model configs synced from |
| 40 | + [vllm-project/recipes](https://github.com/vllm-project/recipes). Each entry |
| 41 | + under `models.<HF_ID>.recipe` contains the full recipe with `model.base_args`, |
| 42 | + `model.base_env`, `features.tool_calling.args`, `features.reasoning.args`, |
| 43 | + `hardware_overrides.amd.extra_args`, `hardware_overrides.amd.extra_env`. |
| 44 | + The top-level `docker_image` field has the latest resolved vLLM ROCm image. |
| 45 | + |
| 46 | +- **`data/gpu_overrides.json`** -- GPU-specific configuration. Contains |
| 47 | + `docker_flags` (mandatory for all AMD Instinct), `gpu_configs` keyed by |
| 48 | + gfx_version with `env_defaults` and `workarounds`, and `legacy_models` for |
| 49 | + models not yet in vLLM recipes. |
| 50 | + |
| 51 | +- **`data/blacklist.json`** -- models in vLLM recipes that cannot be served |
| 52 | + as LLM endpoints. Includes diffusion/image/audio generation models, embedding |
| 53 | + models, rerankers, ASR models needing audio pipelines, and models requiring |
| 54 | + unreleased vLLM nightly builds. Check this before attempting to serve a model. |
| 55 | + If the user requests a blacklisted model, explain why it won't work and |
| 56 | + suggest an alternative. |
| 57 | + |
| 58 | +If the user doesn't specify a model, default to **Qwen/Qwen3.5-9B**: dense |
| 59 | +multimodal with MTP, Apache 2.0 license (no HF token needed), fits on a single |
| 60 | +GPU, strong reasoning and tool-calling. |
| 61 | + |
| 62 | +## Step 1: Detect the GPU |
| 63 | + |
| 64 | +```bash |
| 65 | +python3 scripts/detect.py |
| 66 | +# Remote: |
| 67 | +python3 scripts/detect.py --host user@hostname |
| 68 | +``` |
| 69 | + |
| 70 | +Returns JSON with `gfx_version`, `vram_gb`, `gpu_count`, `rocm_version`. |
| 71 | + |
| 72 | +| gfx_version | Hardware | VRAM | |
| 73 | +|---|---|---| |
| 74 | +| gfx950 | MI350X / MI355X | 288 GB HBM3E | |
| 75 | +| gfx942 | MI300X (192 GB) / MI325X (256 GB) / MI300A (128 GB) | varies | |
| 76 | + |
| 77 | +If `gfx_version` is `unknown`: `amd-smi` ran but found no GPU. Check |
| 78 | +`lsmod | grep amdgpu`. |
| 79 | + |
| 80 | +## Step 2: Validate the environment |
| 81 | + |
| 82 | +```bash |
| 83 | +python3 scripts/validate.py --auto-fix |
| 84 | +# Remote: |
| 85 | +python3 scripts/validate.py --auto-fix --host user@hostname |
| 86 | +``` |
| 87 | + |
| 88 | +Returns JSON with `ready` (bool), `errors`, `warnings`, `fixes_applied`. |
| 89 | +Do not proceed if `ready` is `false`. |
| 90 | + |
| 91 | +## Step 3: Refresh recipes (if stale) |
| 92 | + |
| 93 | +Check `fetched_at` in `data/recipes_cache.json`. If older than 24 hours or |
| 94 | +the file is missing, refresh: |
| 95 | + |
| 96 | +```bash |
| 97 | +python3 scripts/sync_recipes.py |
| 98 | +``` |
| 99 | + |
| 100 | +This shallow-clones vllm-project/recipes from GitHub and fetches the latest |
| 101 | +Docker tag from Docker Hub. Takes ~10 seconds. If it fails, the existing |
| 102 | +cache still works. |
| 103 | + |
| 104 | +## Step 4: Construct the Docker command |
| 105 | + |
| 106 | +Read `data/recipes_cache.json` and `data/gpu_overrides.json` directly. |
| 107 | +Build the Docker command by combining: |
| 108 | + |
| 109 | +1. **Docker flags** from `gpu_overrides.json > docker_flags` (mandatory for all AMD GPUs) |
| 110 | +2. **HF cache mount**: `-v ~/.cache/huggingface:/root/.cache/huggingface` |
| 111 | + (if a shared model cache directory exists on the host, check whether |
| 112 | + `models--*` directories are at the cache root or inside a `hub/` |
| 113 | + subdirectory -- mount accordingly to `/root/.cache/huggingface` or |
| 114 | + `/root/.cache/huggingface/hub`) |
| 115 | +3. **Port**: `-p <port>:<port>` (default 8000) |
| 116 | +4. **Environment variables**: merge `gpu_configs.<gfx_version>.env_defaults` |
| 117 | + with the recipe's `model.base_env` and `hardware_overrides.amd.extra_env`. |
| 118 | + Always add `--env HF_TOKEN=${HF_TOKEN}`. |
| 119 | +5. **Docker image**: use `docker_image` from `recipes_cache.json` top level |
| 120 | + (unless the model needs a pinned image, e.g. GLM-4.5 needs `v0.15.1`). |
| 121 | + If the user specifies a Docker image version, check it against the recipe's |
| 122 | + `model.min_vllm_version`. Warn if the image is older -- the model may crash |
| 123 | + on startup with an opaque "Engine core initialization failed" error. |
| 124 | +6. **Model ID**: `--model <HF_ID>` |
| 125 | +7. **vLLM args**: combine the recipe's `model.base_args` + |
| 126 | + `hardware_overrides.amd.extra_args` + `features.tool_calling.args` + |
| 127 | + `features.reasoning.args`. Add `--enable-auto-tool-choice` if not present. |
| 128 | + For multi-GPU, add `--tensor-parallel-size N` (see VRAM estimation below). |
| 129 | + For MoE models on multi-GPU, also add `--distributed-executor-backend mp`. |
| 130 | +8. **Port arg**: `--port <port>` |
| 131 | + |
| 132 | +If the exact model ID is not in `recipes_cache.json`, check for a base model |
| 133 | +match by stripping date/version suffixes (e.g., `Kimi-K2-Instruct` matches |
| 134 | +`Kimi-K2-Instruct-0905`). Use the base model's recipe if found. |
| 135 | + |
| 136 | +If no recipe match, check `legacy_models` in `gpu_overrides.json`. If not |
| 137 | +there either, use a generic config with |
| 138 | +`--enable-auto-tool-choice --trust-remote-code --tool-call-parser hermes`. |
| 139 | + |
| 140 | +**Precision variant selection:** Recipes may offer variants (default, fp8, |
| 141 | +nvfp4). Check `gpu_configs.<gfx_version>.precision.native` in |
| 142 | +`gpu_overrides.json` before selecting a variant. On gfx942 (MI300X), only |
| 143 | +`bf16`, `fp16`, `fp8_fnuz`, and `int8` are hardware-native. MXFP4 and NVFP4 |
| 144 | +compute is emulated (dequant to BF16 during matmul), but weights stay |
| 145 | +compressed in VRAM so quantized models still fit in less memory. |
| 146 | +On gfx950 (MI350X), MXFP4 is hardware-native. |
| 147 | + |
| 148 | +**VRAM estimation and fit check:** Before constructing the Docker command, |
| 149 | +estimate whether the model fits the available hardware: |
| 150 | +```bash |
| 151 | +python3 scripts/estimate_vram.py --model-id <HF_ID> --vram-gb <per_gpu_vram> --tp <N> |
| 152 | +``` |
| 153 | +This queries the HuggingFace Hub API (no model download) and returns JSON with: |
| 154 | +- `weight_memory_gb` -- total weight size |
| 155 | +- `kv_cache_bytes_per_token` -- KV cache cost per token at BF16 |
| 156 | +- `fit.weights_fit` -- whether weights fit at the given TP |
| 157 | +- `fit.recommended_max_model_len` -- max context the GPU can serve |
| 158 | +- `fit.context_limited` -- true if KV cache limits context below the |
| 159 | + model's native max |
| 160 | +- `fit.min_tp_required` -- minimum TP needed (only if weights don't fit) |
| 161 | + |
| 162 | +**Understanding the overhead:** The script reserves ~4 GB for vLLM's runtime |
| 163 | +overhead (activation profiling, HIP graph capture, internal buffers). During |
| 164 | +startup, vLLM runs a profiling forward pass to measure peak activations, then |
| 165 | +captures HIP graphs for optimized decode. This startup peak is higher than |
| 166 | +steady-state. The `remaining_for_kv_gb` field reflects what's left after |
| 167 | +weights and this overhead. |
| 168 | + |
| 169 | +Use `remaining_for_kv_gb` to decide: |
| 170 | + |
| 171 | +1. **`remaining_for_kv_gb >= 6`**: safe to run. If `context_limited: true`, |
| 172 | + add `--max-model-len <recommended_max_model_len>` to the vLLM args. |
| 173 | + Mention the FP8 KV cache option (`--kv-cache-dtype fp8`) if the user |
| 174 | + needs longer context (`fit.max_seq_len_fp8_kv` shows the gain). |
| 175 | +2. **`remaining_for_kv_gb` between 2 and 6**: tight but worth trying. Launch |
| 176 | + normally. If vLLM OOMs during HIP graph capture (check container logs for |
| 177 | + "out of memory" after "capturing CUDA/HIP graphs"), retry with |
| 178 | + `--enforce-eager` added to the vLLM args. This skips graph capture and |
| 179 | + frees 1-2 GB. The only cost is slightly higher decode latency. |
| 180 | +3. **`remaining_for_kv_gb < 2`**: too tight. Will likely OOM during the |
| 181 | + activation profiling step. Do not attempt. |
| 182 | +4. **`weights_fit: false` with multiple GPUs**: re-run with |
| 183 | + `--tp <min_tp_required>` and check again. |
| 184 | +5. **`weights_fit: false`, not enough GPUs**: look for quantized |
| 185 | + alternatives in this order: |
| 186 | + a. **Recipe variants**: the recipe may have `fp8` or `mxfp4` variants |
| 187 | + with a different `model_id` that points to a quantized checkpoint. |
| 188 | + b. **Same provider**: many providers release quantized versions alongside |
| 189 | + the base model (e.g. `Qwen/Qwen3.5-122B-FP8` from Qwen). Search |
| 190 | + HuggingFace for `<provider>/<model-name>` with FP8/GPTQ/AWQ suffixes. |
| 191 | + c. **AMD quantized**: AMD's Quark team publishes quantized models under |
| 192 | + the `amd/` org on HuggingFace (e.g. `amd/Kimi-K2-Instruct-w-mxfp4-a-fp8`). |
| 193 | + Search for `amd/<model-name>` variants. |
| 194 | + Run `estimate_vram.py` on the quantized model ID to verify it fits, |
| 195 | + then use that model ID instead. |
| 196 | +6. **Still doesn't fit**: tell the user the model requires more VRAM than |
| 197 | + available and suggest either a smaller model or multi-GPU hardware. |
| 198 | + Do not attempt to launch. |
| 199 | + |
| 200 | +Docker command template: |
| 201 | +``` |
| 202 | +docker run -d --name vllm-<model-slug> \ |
| 203 | + <docker_flags> \ |
| 204 | + -v <hf_cache_mount> \ |
| 205 | + -p <port>:<port> \ |
| 206 | + --env <key>=<value> (for each env var) \ |
| 207 | + --env HF_TOKEN=${HF_TOKEN} \ |
| 208 | + <docker_image> \ |
| 209 | + --model <model_id> \ |
| 210 | + <vllm_args> \ |
| 211 | + --port <port> |
| 212 | +``` |
| 213 | + |
| 214 | +## Step 5: Confirm with the user |
| 215 | + |
| 216 | +Before launching, present a summary and ask the user to confirm: |
| 217 | +- **Model**: full HuggingFace ID (e.g. `Qwen/Qwen3.5-122B-Instruct`) |
| 218 | +- **Precision**: variant being used (e.g. BF16, FP8) and why |
| 219 | +- **Weight memory**: from estimate_vram.py |
| 220 | +- **GPU**: detected hardware and VRAM |
| 221 | +- **TP**: tensor parallelism degree (1, 2, 4, 8) |
| 222 | +- **Context**: max achievable context length (and whether it's limited) |
| 223 | +- **Port**: which port the endpoint will be on |
| 224 | + |
| 225 | +If a quantized alternative was selected (Step 4 fit check), explain that |
| 226 | +the original model doesn't fit and which alternative is being used. |
| 227 | + |
| 228 | +Wait for the user's confirmation before proceeding. |
| 229 | + |
| 230 | +## Step 6: Launch and verify |
| 231 | + |
| 232 | +Before launching, check for port conflicts: |
| 233 | +```bash |
| 234 | +ss -tlnp 2>/dev/null | grep ':<port> ' |
| 235 | +``` |
| 236 | +If a Docker container is on that port, stop it with `docker rm -f <name>`. |
| 237 | + |
| 238 | +Run the Docker command. Then poll health using this loop: |
| 239 | + |
| 240 | +```bash |
| 241 | +while docker inspect --format='{{.State.Running}}' <container_name> 2>/dev/null | grep -q true; do |
| 242 | + curl -sf http://localhost:<port>/health && echo "READY" && exit 0 |
| 243 | + sleep 60 |
| 244 | +done |
| 245 | +echo "FAILED -- container exited" |
| 246 | +``` |
| 247 | + |
| 248 | +A 503 during loading is normal. Choose the polling strategy based on |
| 249 | +model size (weight memory from hf-mem): |
| 250 | + |
| 251 | +- **Small models (< 100 GB weights)**: run the poll as a blocking command |
| 252 | + with the Bash tool's `timeout` set to 600000 (10 minutes). Most cached |
| 253 | + models are ready within 2-5 minutes. |
| 254 | +- **Large models (>= 100 GB weights)**: run the poll with the Bash tool's |
| 255 | + `run_in_background` set to `true`. Then use `TaskOutput` with |
| 256 | + `block: true` and `timeout: 600000` to wait up to 10 minutes per check. |
| 257 | + If the task is still running after that, call `TaskOutput` again with |
| 258 | + the same parameters. This uses only 1 turn per 10-minute wait instead |
| 259 | + of burning a turn every check. The background loop runs until the |
| 260 | + container is healthy or dies. |
| 261 | + |
| 262 | +After health returns 200, send a warmup request (triggers HIP kernel compilation, |
| 263 | +~40-45 seconds on gfx942): |
| 264 | +```bash |
| 265 | +curl -s http://localhost:<port>/v1/chat/completions \ |
| 266 | + -H "Content-Type: application/json" \ |
| 267 | + -d '{"model":"<model_id>","messages":[{"role":"user","content":"say hi"}],"max_tokens":5}' |
| 268 | +``` |
| 269 | + |
| 270 | +Return to the user: |
| 271 | +- `base_url`: `http://<host>:8000/v1` |
| 272 | +- `api_key`: none required for local |
| 273 | +- `model`: the model ID used |
| 274 | + |
| 275 | +## Remote vs. local |
| 276 | + |
| 277 | +All scripts accept `--host user@hostname`. When given, they SSH to the target. |
| 278 | +Set `ROCM_SSH_HOST` and `ROCM_SSH_USER` env vars to avoid passing `--host` |
| 279 | +every time. |
| 280 | + |
| 281 | +For remote Docker commands, run them over SSH: |
| 282 | +```bash |
| 283 | +ssh user@host 'docker run -d ...' |
| 284 | +``` |
| 285 | +Use `localhost` for health/warmup curl URLs (curl runs on the remote host). |
| 286 | + |
| 287 | +## Gotchas |
| 288 | + |
| 289 | +**`CUDA_VISIBLE_DEVICES` set to empty string** -- ROCm maps this variable to |
| 290 | +`HIP_VISIBLE_DEVICES`. Setting it to an empty string hides all GPUs. |
| 291 | +`CUDA_VISIBLE_DEVICES=0,1` works fine for restricting GPUs (same as |
| 292 | +`HIP_VISIBLE_DEVICES=0,1`). If the host has it set to empty, unset it: |
| 293 | +`unset CUDA_VISIBLE_DEVICES`. Do not pass `--env CUDA_VISIBLE_DEVICES=` (empty) |
| 294 | +into Docker -- that also hides all GPUs inside the container. |
| 295 | + |
| 296 | +**FP4BMM crash on gfx942 (MI300X)** -- If the container exits immediately |
| 297 | +with a segfault or illegal instruction: `VLLM_ROCM_USE_AITER_FP4BMM` must be |
| 298 | +`0` on gfx942. This is set correctly in `gpu_overrides.json` for gfx942. |
| 299 | +See vLLM issue #34641. |
| 300 | + |
| 301 | +**`HIP error: no kernel image`** -- The Docker image has no compiled kernel |
| 302 | +for your GPU's gfx version. Use `vllm/vllm-openai-rocm:latest`; it includes |
| 303 | +gfx942 and gfx950 kernels. |
| 304 | + |
| 305 | +**MLA models need `--block-size 1`** -- DeepSeek-R1/V3, Kimi-K2.5. |
| 306 | +Without it the MLA attention backend silently falls back to a slower path. |
| 307 | +This is in the recipe args for these models. |
| 308 | + |
| 309 | +**MoE models on multi-GPU need `--distributed-executor-backend mp`** -- |
| 310 | +Qwen3-235B, GLM-4.5, MiniMax-M2. The default distributed executor does not |
| 311 | +work reliably with MoE on ROCm. |
| 312 | + |
| 313 | +**OOM during HIP graph capture** -- If the container logs show "out of memory" |
| 314 | +after "capturing CUDA graphs" or "capturing HIP graphs", the model fits in |
| 315 | +VRAM but there isn't enough headroom for graph capture. Retry with |
| 316 | +`--enforce-eager` added to the vLLM args. This disables graph capture and |
| 317 | +frees 1-2 GB. Trade-off: slightly higher decode latency, but the model runs. |
| 318 | + |
| 319 | +**"Engine core initialization failed"** -- This opaque error means the engine |
| 320 | +core subprocess died. Check early container logs: `docker logs <name> 2>&1 | |
| 321 | +head -50`. Common causes: gated model access denied (license not accepted on |
| 322 | +HF), unsupported architecture on this vLLM version, OOM during weight loading, |
| 323 | +missing `--trust-remote-code` for custom architectures, or vLLM version too old |
| 324 | +for the model (check `min_vllm_version` in the recipe). |
| 325 | + |
| 326 | +**`/dev/kfd` permission denied** -- User is not in the `video` or `render` |
| 327 | +group. Fix: `sudo usermod -aG video,render $USER` (requires re-login). |
| 328 | + |
| 329 | +**SSH key not configured** -- The scripts use `BatchMode=yes` SSH. If SSH |
| 330 | +fails with `Permission denied (publickey)`, configure key-based access first. |
| 331 | + |
| 332 | +**Restricting GPUs on shared hosts** -- Use `--env HIP_VISIBLE_DEVICES=0,1` |
| 333 | +or `--env CUDA_VISIBLE_DEVICES=0,1` to target specific GPUs by index. |
| 334 | +`HIP_VISIBLE_DEVICES` is the canonical AMD variable; `CUDA_VISIBLE_DEVICES` |
| 335 | +also works (ROCm maps it). Never set either to an empty string. |
| 336 | + |
| 337 | +--- |
| 338 | + |
| 339 | +## Reference |
| 340 | + |
| 341 | +Precision compatibility, VRAM estimation, Docker flags, and known quirks: |
| 342 | +[reference.md](reference.md) |
0 commit comments