Skip to content

Commit beaad18

Browse files
Edwardf0t1claude
andcommitted
[skills] address PR #1595 second-pass review (cjluo-nv)
- Blackwell image: simplify to 'B300/GB300 -> append -cu130 to the (multi-arch) image tag' (e.g. v0.19.1-cu130); keep a one-line nightly fallback for archs a pinned release predates (qwen3_5). Applied in eval + deployment skills. - gres: defer to NEL's internal/slurm/<cluster> execution configs (PR #1599) when present (they pre-fill gres/hostname/partition); keep the manual fallback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
1 parent 2fa127e commit beaad18

3 files changed

Lines changed: 24 additions & 26 deletions

File tree

.claude/skills/deployment/SKILL.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -125,17 +125,18 @@ python -m vllm.entrypoints.openai.api_server \
125125

126126
For NVFP4 checkpoints, use `--quantization modelopt_fp4`.
127127

128-
> **NVFP4 on Blackwell needs the CUDA-13 vLLM build.** On B200/B300/GB200/GB300
129-
> (compute capability sm_100/sm_103), use `vllm/vllm-openai:cu130-nightly-<arch>`
130-
> (`-x86_64`, or `-aarch64` on Grace). The common `v0.19.1` / any `cu129`
131-
> (CUDA 12.9) build has **no sm_103 FP4 kernels** — vLLM loads the checkpoint
128+
> **NVFP4 on Blackwell B300/GB300 (sm_103): append `-cu130` to the image tag**
129+
> (e.g. `vllm/vllm-openai:v0.19.1-cu130` — release tags are multi-arch). The
130+
> default cu12 build has **no sm_103 FP4 kernel**, so vLLM loads the checkpoint
132131
> then dies at engine init with `CUDA error: no kernel image is available for
133132
> execution on the device` (affects the `flashinfer` and `cutlass` NVFP4
134-
> backends; `marlin` separately fails on non-64-divisible layer dims). Verify the
135-
> image via `recipes.vllm.ai/<org>/<model>?hardware=b300` (JS-rendered — fetch the
136-
> raw markdown at `github.com/vllm-project/recipes/blob/main/<org>/<model>.md`).
137-
> For multimodal models on sm_103, also pass `--mm-encoder-attn-backend
138-
> TRITON_ATTN` (the default CuTe ViT flash-attn asserts "Only SM 10.x and 11.x").
133+
> backends; `marlin` separately fails on non-64-divisible layer dims). If a
134+
> pinned release predates the model's arch, use `cu130-nightly-<arch>` instead
135+
> (Qwen3.5-9B's `qwen3_5` needed it). Cross-check via
136+
> `recipes.vllm.ai/<org>/<model>?hardware=b300` (JS-rendered — fetch the raw
137+
> markdown at `github.com/vllm-project/recipes/blob/main/<org>/<model>.md`). For
138+
> multimodal models on sm_103, also pass `--mm-encoder-attn-backend TRITON_ATTN`
139+
> (the default CuTe ViT flash-attn asserts "Only SM 10.x and 11.x").
139140
140141
#### SGLang
141142

.claude/skills/evaluation/SKILL.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -144,7 +144,7 @@ For how to choose `--tensor-parallel-size` / `--data-parallel-size` / `--pipelin
144144

145145
**Image / vLLM version.** Default `image: vllm/vllm-openai:v0.19.1` (pinned for reproducibility). If `recipes.vllm.ai` states a higher minimum version for the chosen variant (e.g. "vLLM >= 0.20.0"), bump the image tag accordingly (e.g. `v0.20.0`) — do **not** stay on `0.19.1` when the recipe explicitly requires newer. Do **not** use `:latest` (drifts across re-runs, breaks reproducibility). The version is part of the cross-check: surface to the user when bumping.
146146

147-
> **NVFP4 on Blackwell needs a CUDA-13 vLLM build.** On B200/B300/GB200/GB300 (sm_100/sm_103) the pinned `v0.19.1` and all `cu129` builds lack sm_103 FP4 kernels — engine init dies with `CUDA error: no kernel image is available for execution on the device`. Use `vllm/vllm-openai:v0.19.1-cu130` (pinned, matches the default image), and bump to `cu130-nightly-<arch>` only if it lacks the model's arch (Qwen3.5-9B's `qwen3_5` needed the nightly). Multimodal on sm_103 also needs `--mm-encoder-attn-backend TRITON_ATTN` (ViT flash-attn workaround). Full note + the `recipes.vllm.ai ?hardware=b300` lookup are in `recipes/examples/example_eval.yaml`.
147+
> **NVFP4 on Blackwell B300/GB300 (sm_103): append `-cu130` to the image tag** (e.g. `vllm/vllm-openai:v0.19.1-cu130` — release tags are multi-arch). The default cu12 build has no sm_103 FP4 kernel, so engine init dies with `CUDA error: no kernel image is available`. If a pinned release predates the model's arch, use `cu130-nightly-<arch>` (Qwen3.5-9B's `qwen3_5` needed it, vLLM 0.19.2rc1.dev134). Multimodal on sm_103 may also need `--mm-encoder-attn-backend TRITON_ATTN`. Full note in `recipes/examples/example_eval.yaml`.
148148

149149
#### vLLM-backend defaults — always include unless the recipe *contradicts*
150150

@@ -200,7 +200,7 @@ Reasoning models: prefer reasoning mode (highest scores). For lower variance / c
200200
- Find every `???` left. Ask the user only for what can't be inferred (SLURM hostname/account/output_dir, MLflow tracking URI, etc.). Don't propose defaults; let them give plain text.
201201
- **`parallelism`** — size it yourself from the run shape (total requests = `dataset_size × repeats` vs GPU serving capacity), and set `--max-num-seqs` to match. Read `references/parallelism.md` for the decision rule and worked examples; only ask the user if a non-GPU cap (e.g. judge rate limit) is unknown.
202202
- Ask about other defaults they may want to change (partition, walltime, MLflow tags).
203-
- **`execution.gres`** — NEL defaults to `gpu:8`. Set it to the cluster's per-node GPU count (and what the QOS permits), and match `--data-parallel-size`/`--tensor-parallel-size` to it. A mismatch makes `sbatch` reject the job with *"Requested node configuration is not available"* (e.g. `gpu:8` on 4-GPU GB300 nodes → set `gres: gpu:4`). Confirm the node GPU count with `sinfo -o '%P %G'` on the target cluster.
203+
- **`execution.gres`** — if your NEL install ships an `internal/slurm/<cluster>` execution config, prefer it (it pre-fills `gres`/hostname/partition/node-exclusivity). Otherwise NEL defaults to `gpu:8`; set it to the node's GPU count (and match `--data-parallel-size`/`--tensor-parallel-size`), or `sbatch` rejects the job with *"Requested node configuration is not available"* (e.g. `gpu:8` on 4-GPU GB300 nodes → `gres: gpu:4`; check with `sinfo -o '%P %G'`).
204204

205205
**Walltime cap: 4 hours.** Always `execution.walltime: "04:00:00"`. The cluster does not schedule jobs longer than 4h — this is a hard limit, not a preference.
206206

.claude/skills/evaluation/recipes/examples/example_eval.yaml

Lines changed: 12 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -50,9 +50,10 @@ execution:
5050
account: ???
5151
output_dir: ???
5252
walltime: "04:00:00"
53-
# gres defaults to gpu:8. Set it to the cluster's per-node GPU count (and what
54-
# the QOS allows) or sbatch fails "Requested node configuration is not available".
55-
# gres: gpu:4 # e.g. GB300 (4 GPUs); match --tensor-parallel-size/--data-parallel-size to it (see references/parallelism.md)
53+
# gres defaults to gpu:8. If your NEL install ships an internal/slurm/<cluster>
54+
# execution config, prefer it (it pre-fills gres/hostname/partition). Otherwise
55+
# set gres to the node's GPU count or sbatch fails "Requested node configuration
56+
# is not available"; match --tensor/--data-parallel-size to it (references/parallelism.md).
5657
mounts:
5758
mount_home: false
5859
auto_export: # REQUIRED trigger for auto-export. Without this, the
@@ -65,18 +66,14 @@ deployment:
6566
hf_model_handle:
6667
served_model_name: ???
6768
image: vllm/vllm-openai:v0.19.1
68-
# NVFP4 on Blackwell (B200/B300/GB200/GB300, sm_100/sm_103): the v0.19.1 and
69-
# any cu129 builds have NO sm_103 FP4 kernels (deploy dies with CUDA
70-
# "no kernel image is available"). Use a CUDA-13 build — prefer the pinned
71-
# release matching this image, bump to nightly only if it lacks the arch:
72-
# image: vllm/vllm-openai:v0.19.1-cu130 # reproducible; verify arch support
73-
# image: vllm/vllm-openai:cu130-nightly-x86_64 # newest (-aarch64 on Grace)
74-
# (Qwen3.5-9B's qwen3_5 arch needed the nightly = vLLM 0.19.2rc1.dev134; the
75-
# pinned v0.19.1-cu130 is untested for it.) Confirm via recipes.vllm.ai
76-
# ?hardware=b300 (JS-rendered; fetch raw markdown at
77-
# github.com/vllm-project/recipes/blob/main/<org>/<model>.md).
78-
# Multimodal on sm_103: the ViT CuTe flash-attn asserts "Only SM 10.x/11.x";
79-
# workaround `--mm-encoder-attn-backend TRITON_ATTN` (may be unneeded on newer builds).
69+
# NVFP4 on Blackwell B300/GB300 (sm_103): append `-cu130` to the (multi-arch)
70+
# image tag — the default cu12 build has no sm_103 FP4 kernel, so the deploy
71+
# dies at engine init with CUDA "no kernel image is available". e.g.:
72+
# image: vllm/vllm-openai:v0.19.1-cu130
73+
# If a pinned release predates your model's arch, use the nightly instead
74+
# (Qwen3.5-9B's qwen3_5 needed vllm/vllm-openai:cu130-nightly-x86_64, 0.19.2rc1.dev134).
75+
# Multimodal on sm_103 may also need `--mm-encoder-attn-backend TRITON_ATTN`
76+
# (ViT flash-attn workaround; drop if the encoder loads without it).
8077
#
8178
# `--served-model-name ${deployment.served_model_name}` (in command below) is
8279
# REQUIRED: without it vLLM registers the model as `/checkpoint` and every eval

0 commit comments

Comments
 (0)