Skip to content

Commit 9ccf032

Browse files
Edwardf0t1claude
andcommitted
[skills] gres: defer to #1599's internal/slurm/<cluster>, keep one-line fallback
Reduce the gres guidance to a single fallback note (the slurm/default case), deferring the predefined per-cluster config path to PR #1599 (which pre-fills gres/hostname/partition). Fills the gap #1599's fallback branch leaves (it does not mention gres for the no-internal-package case). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
1 parent beaad18 commit 9ccf032

2 files changed

Lines changed: 4 additions & 5 deletions

File tree

.claude/skills/evaluation/SKILL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -200,7 +200,7 @@ Reasoning models: prefer reasoning mode (highest scores). For lower variance / c
200200
- Find every `???` left. Ask the user only for what can't be inferred (SLURM hostname/account/output_dir, MLflow tracking URI, etc.). Don't propose defaults; let them give plain text.
201201
- **`parallelism`** — size it yourself from the run shape (total requests = `dataset_size × repeats` vs GPU serving capacity), and set `--max-num-seqs` to match. Read `references/parallelism.md` for the decision rule and worked examples; only ask the user if a non-GPU cap (e.g. judge rate limit) is unknown.
202202
- Ask about other defaults they may want to change (partition, walltime, MLflow tags).
203-
- **`execution.gres`** — if your NEL install ships an `internal/slurm/<cluster>` execution config, prefer it (it pre-fills `gres`/hostname/partition/node-exclusivity). Otherwise NEL defaults to `gpu:8`; set it to the node's GPU count (and match `--data-parallel-size`/`--tensor-parallel-size`), or `sbatch` rejects the job with *"Requested node configuration is not available"* (e.g. `gpu:8` on 4-GPU GB300 nodes → `gres: gpu:4`; check with `sinfo -o '%P %G'`).
203+
- **`execution.gres`** — auto-set if you used a predefined `internal/slurm/<cluster>` config (above). On the `slurm/default` fallback it's `gpu:8`, so set it to the node's GPU count (and match `--data-parallel-size`/`--tensor-parallel-size`) or `sbatch` rejects the job with *"Requested node configuration is not available"* (e.g. 4-GPU GB300 → `gres: gpu:4`; check with `sinfo -o '%P %G'`).
204204

205205
**Walltime cap: 4 hours.** Always `execution.walltime: "04:00:00"`. The cluster does not schedule jobs longer than 4h — this is a hard limit, not a preference.
206206

.claude/skills/evaluation/recipes/examples/example_eval.yaml

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -50,10 +50,9 @@ execution:
5050
account: ???
5151
output_dir: ???
5252
walltime: "04:00:00"
53-
# gres defaults to gpu:8. If your NEL install ships an internal/slurm/<cluster>
54-
# execution config, prefer it (it pre-fills gres/hostname/partition). Otherwise
55-
# set gres to the node's GPU count or sbatch fails "Requested node configuration
56-
# is not available"; match --tensor/--data-parallel-size to it (references/parallelism.md).
53+
# gres: a predefined internal/slurm/<cluster> config (see SKILL Step 4) sets this.
54+
# On the slurm/default fallback it's gpu:8 — set to the node's GPU count or sbatch
55+
# fails "Requested node configuration is not available" (4-GPU GB300 -> gpu:4).
5756
mounts:
5857
mount_home: false
5958
auto_export: # REQUIRED trigger for auto-export. Without this, the

0 commit comments

Comments
 (0)