[skills] address PR #1595 second-pass review (cjluo-nv)

Edwardf0t1 · claude · Edwardf0t1 · commit beaad18745b8 · 2026-06-02T23:00:05.000-07:00
- Blackwell image: simplify to 'B300/GB300 -> append -cu130 to the (multi-arch) image tag' (e.g. v0.19.1-cu130); keep a one-line nightly fallback for archs a pinned release predates (qwen3_5). Applied in eval + deployment skills. - gres: defer to NEL's internal/slurm/<cluster> execution configs (PR #1599) when present (they pre-fill gres/hostname/partition); keep the manual fallback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
diff --git a/.claude/skills/deployment/SKILL.md b/.claude/skills/deployment/SKILL.md
@@ -125,17 +125,18 @@ python -m vllm.entrypoints.openai.api_server \
 
 For NVFP4 checkpoints, use `--quantization modelopt_fp4`.
 
-> **NVFP4 on Blackwell needs the CUDA-13 vLLM build.** On B200/B300/GB200/GB300
-> (compute capability sm_100/sm_103), use `vllm/vllm-openai:cu130-nightly-<arch>`
-> (`-x86_64`, or `-aarch64` on Grace). The common `v0.19.1` / any `cu129`
-> (CUDA 12.9) build has **no sm_103 FP4 kernels** — vLLM loads the checkpoint
+> **NVFP4 on Blackwell B300/GB300 (sm_103): append `-cu130` to the image tag**
+> (e.g. `vllm/vllm-openai:v0.19.1-cu130` — release tags are multi-arch). The
+> default cu12 build has **no sm_103 FP4 kernel**, so vLLM loads the checkpoint
 > then dies at engine init with `CUDA error: no kernel image is available for
 > execution on the device` (affects the `flashinfer` and `cutlass` NVFP4
-> backends; `marlin` separately fails on non-64-divisible layer dims). Verify the
-> image via `recipes.vllm.ai/<org>/<model>?hardware=b300` (JS-rendered — fetch the
-> raw markdown at `github.com/vllm-project/recipes/blob/main/<org>/<model>.md`).
-> For multimodal models on sm_103, also pass `--mm-encoder-attn-backend
-> TRITON_ATTN` (the default CuTe ViT flash-attn asserts "Only SM 10.x and 11.x").
+> backends; `marlin` separately fails on non-64-divisible layer dims). If a
+> pinned release predates the model's arch, use `cu130-nightly-<arch>` instead
+> (Qwen3.5-9B's `qwen3_5` needed it). Cross-check via
+> `recipes.vllm.ai/<org>/<model>?hardware=b300` (JS-rendered — fetch the raw
+> markdown at `github.com/vllm-project/recipes/blob/main/<org>/<model>.md`). For
+> multimodal models on sm_103, also pass `--mm-encoder-attn-backend TRITON_ATTN`
+> (the default CuTe ViT flash-attn asserts "Only SM 10.x and 11.x").
 
 #### SGLang
 
diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md
@@ -144,7 +144,7 @@ For how to choose `--tensor-parallel-size` / `--data-parallel-size` / `--pipelin
 
 **Image / vLLM version.** Default `image: vllm/vllm-openai:v0.19.1` (pinned for reproducibility). If `recipes.vllm.ai` states a higher minimum version for the chosen variant (e.g. "vLLM >= 0.20.0"), bump the image tag accordingly (e.g. `v0.20.0`) — do **not** stay on `0.19.1` when the recipe explicitly requires newer. Do **not** use `:latest` (drifts across re-runs, breaks reproducibility). The version is part of the cross-check: surface to the user when bumping.
 
-> **NVFP4 on Blackwell needs a CUDA-13 vLLM build.** On B200/B300/GB200/GB300 (sm_100/sm_103) the pinned `v0.19.1` and all `cu129` builds lack sm_103 FP4 kernels — engine init dies with `CUDA error: no kernel image is available for execution on the device`. Use `vllm/vllm-openai:v0.19.1-cu130` (pinned, matches the default image), and bump to `cu130-nightly-<arch>` only if it lacks the model's arch (Qwen3.5-9B's `qwen3_5` needed the nightly). Multimodal on sm_103 also needs `--mm-encoder-attn-backend TRITON_ATTN` (ViT flash-attn workaround). Full note + the `recipes.vllm.ai ?hardware=b300` lookup are in `recipes/examples/example_eval.yaml`.
+> **NVFP4 on Blackwell B300/GB300 (sm_103): append `-cu130` to the image tag** (e.g. `vllm/vllm-openai:v0.19.1-cu130` — release tags are multi-arch). The default cu12 build has no sm_103 FP4 kernel, so engine init dies with `CUDA error: no kernel image is available`. If a pinned release predates the model's arch, use `cu130-nightly-<arch>` (Qwen3.5-9B's `qwen3_5` needed it, vLLM 0.19.2rc1.dev134). Multimodal on sm_103 may also need `--mm-encoder-attn-backend TRITON_ATTN`. Full note in `recipes/examples/example_eval.yaml`.
 
 #### vLLM-backend defaults — always include unless the recipe *contradicts*
 
@@ -200,7 +200,7 @@ Reasoning models: prefer reasoning mode (highest scores). For lower variance / c
 - Find every `???` left. Ask the user only for what can't be inferred (SLURM hostname/account/output_dir, MLflow tracking URI, etc.). Don't propose defaults; let them give plain text.
 - **`parallelism`** — size it yourself from the run shape (total requests = `dataset_size × repeats` vs GPU serving capacity), and set `--max-num-seqs` to match. Read `references/parallelism.md` for the decision rule and worked examples; only ask the user if a non-GPU cap (e.g. judge rate limit) is unknown.
 - Ask about other defaults they may want to change (partition, walltime, MLflow tags).
-- **`execution.gres`** — NEL defaults to `gpu:8`. Set it to the cluster's per-node GPU count (and what the QOS permits), and match `--data-parallel-size`/`--tensor-parallel-size` to it. A mismatch makes `sbatch` reject the job with *"Requested node configuration is not available"* (e.g. `gpu:8` on 4-GPU GB300 nodes → set `gres: gpu:4`). Confirm the node GPU count with `sinfo -o '%P %G'` on the target cluster.
+- **`execution.gres`** — if your NEL install ships an `internal/slurm/<cluster>` execution config, prefer it (it pre-fills `gres`/hostname/partition/node-exclusivity). Otherwise NEL defaults to `gpu:8`; set it to the node's GPU count (and match `--data-parallel-size`/`--tensor-parallel-size`), or `sbatch` rejects the job with *"Requested node configuration is not available"* (e.g. `gpu:8` on 4-GPU GB300 nodes → `gres: gpu:4`; check with `sinfo -o '%P %G'`).
 
 **Walltime cap: 4 hours.** Always `execution.walltime: "04:00:00"`. The cluster does not schedule jobs longer than 4h — this is a hard limit, not a preference.
 
diff --git a/.claude/skills/evaluation/recipes/examples/example_eval.yaml b/.claude/skills/evaluation/recipes/examples/example_eval.yaml
@@ -50,9 +50,10 @@ execution:
   account: ???
   output_dir: ???
   walltime: "04:00:00"
-  # gres defaults to gpu:8. Set it to the cluster's per-node GPU count (and what
-  # the QOS allows) or sbatch fails "Requested node configuration is not available".
-  # gres: gpu:4   # e.g. GB300 (4 GPUs); match --tensor-parallel-size/--data-parallel-size to it (see references/parallelism.md)
+  # gres defaults to gpu:8. If your NEL install ships an internal/slurm/<cluster>
+  # execution config, prefer it (it pre-fills gres/hostname/partition). Otherwise
+  # set gres to the node's GPU count or sbatch fails "Requested node configuration
+  # is not available"; match --tensor/--data-parallel-size to it (references/parallelism.md).
   mounts:
     mount_home: false
   auto_export:          # REQUIRED trigger for auto-export. Without this, the
@@ -65,18 +66,14 @@ deployment:
   hf_model_handle:
   served_model_name: ???
   image: vllm/vllm-openai:v0.19.1
-  # NVFP4 on Blackwell (B200/B300/GB200/GB300, sm_100/sm_103): the v0.19.1 and
-  # any cu129 builds have NO sm_103 FP4 kernels (deploy dies with CUDA
-  # "no kernel image is available"). Use a CUDA-13 build — prefer the pinned
-  # release matching this image, bump to nightly only if it lacks the arch:
-  #   image: vllm/vllm-openai:v0.19.1-cu130          # reproducible; verify arch support
-  #   image: vllm/vllm-openai:cu130-nightly-x86_64   # newest (-aarch64 on Grace)
-  # (Qwen3.5-9B's qwen3_5 arch needed the nightly = vLLM 0.19.2rc1.dev134; the
-  # pinned v0.19.1-cu130 is untested for it.) Confirm via recipes.vllm.ai
-  # ?hardware=b300 (JS-rendered; fetch raw markdown at
-  # github.com/vllm-project/recipes/blob/main/<org>/<model>.md).
-  # Multimodal on sm_103: the ViT CuTe flash-attn asserts "Only SM 10.x/11.x";
-  # workaround `--mm-encoder-attn-backend TRITON_ATTN` (may be unneeded on newer builds).
+  # NVFP4 on Blackwell B300/GB300 (sm_103): append `-cu130` to the (multi-arch)
+  # image tag — the default cu12 build has no sm_103 FP4 kernel, so the deploy
+  # dies at engine init with CUDA "no kernel image is available". e.g.:
+  #   image: vllm/vllm-openai:v0.19.1-cu130
+  # If a pinned release predates your model's arch, use the nightly instead
+  # (Qwen3.5-9B's qwen3_5 needed vllm/vllm-openai:cu130-nightly-x86_64, 0.19.2rc1.dev134).
+  # Multimodal on sm_103 may also need `--mm-encoder-attn-backend TRITON_ATTN`
+  # (ViT flash-attn workaround; drop if the encoder loads without it).
   #
   # `--served-model-name ${deployment.served_model_name}` (in command below) is
   # REQUIRED: without it vLLM registers the model as `/checkpoint` and every eval