Skip to content

Commit 16113f8

Browse files
committed
fix(sglang/dsv4/8k1k recipes): set cpus-per-task=144 and mem=0
slurm assigns 1 CPU/task by default; `scontrol show job 613` from a running CI job confirmed `NumCPUs=4 NumTasks=4 CPUs/Task=1` with 4 nodes — one core per worker. The dynamo `hash:` cold source install rebuilds ~500 rust crates (kube-client, tonic, hf-hub, image codecs ravif/exr, the pyo3 stack) and at one core takes 30+ min just for the cold build, which dominates total CI time even with the new `/configs/dynamo-wheels` cache (the cache only helps after the first cold run). Match yangminl's working manual setup on the same gb300-cw cluster (`/mnt/home/yangminl/srt-slurm/recipes/dsv4-pro/sglang/gb300-fp4/all-dynamo.yaml`) which sets: sbatch_directives: cpus-per-task: "144" mem: "0" cargo then gets the full 144-core GB300 host and finishes maturin in a few minutes; mem=0 hands the worker the entire node's RAM so the dynamo build + DSV4-Pro 671B FP4 weight load fit without OOM.
1 parent 28d03e8 commit 16113f8

2 files changed

Lines changed: 18 additions & 6 deletions

File tree

benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb300-2p1d-dep4-dep8.yaml

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,11 +34,17 @@ dynamo:
3434
slurm:
3535
time_limit: "8:00:00"
3636

37-
# Without cpus-per-task slurm gives 1 CPU/task; the dynamo cold source
38-
# build (~500 rust crates including ravif/exr/zip) is otherwise serial
39-
# and takes 30+ min. Match yangminl's all-dynamo.yaml which uses 144.
37+
# Match yangminl's working all-dynamo.yaml on the same gb300-cw cluster:
38+
# cpus-per-task=144 — without this slurm hands out 1 CPU/task, which
39+
# turns the dynamo `hash:` cold source build (~500 rust crates,
40+
# ravif/exr/zip/pyo3 stack) into a 30+ min serial compile. With 144
41+
# cargo finishes in ~5 min.
42+
# mem=0 — slurm's "give the whole node's memory"; needed
43+
# for sglang loading 671B FP4 weights + dynamo build at the same
44+
# time without OOM.
4045
sbatch_directives:
4146
cpus-per-task: "144"
47+
mem: "0"
4248

4349
health_check:
4450
max_attempts: 1440

benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb300-7p1d-dep4-dep8.yaml

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,11 +34,17 @@ dynamo:
3434
slurm:
3535
time_limit: "8:00:00"
3636

37-
# Without cpus-per-task slurm gives 1 CPU/task; the dynamo cold source
38-
# build (~500 rust crates including ravif/exr/zip) is otherwise serial
39-
# and takes 30+ min. Match yangminl's all-dynamo.yaml which uses 144.
37+
# Match yangminl's working all-dynamo.yaml on the same gb300-cw cluster:
38+
# cpus-per-task=144 — without this slurm hands out 1 CPU/task, which
39+
# turns the dynamo `hash:` cold source build (~500 rust crates,
40+
# ravif/exr/zip/pyo3 stack) into a 30+ min serial compile. With 144
41+
# cargo finishes in ~5 min.
42+
# mem=0 — slurm's "give the whole node's memory"; needed
43+
# for sglang loading 671B FP4 weights + dynamo build at the same
44+
# time without OOM.
4045
sbatch_directives:
4146
cpus-per-task: "144"
47+
mem: "0"
4248

4349
health_check:
4450
max_attempts: 1440

0 commit comments

Comments
 (0)