Commit 16113f8
committed
fix(sglang/dsv4/8k1k recipes): set cpus-per-task=144 and mem=0
slurm assigns 1 CPU/task by default; `scontrol show job 613` from a
running CI job confirmed `NumCPUs=4 NumTasks=4 CPUs/Task=1` with 4
nodes — one core per worker. The dynamo `hash:` cold source install
rebuilds ~500 rust crates (kube-client, tonic, hf-hub, image codecs
ravif/exr, the pyo3 stack) and at one core takes 30+ min just for the
cold build, which dominates total CI time even with the new
`/configs/dynamo-wheels` cache (the cache only helps after the first
cold run).
Match yangminl's working manual setup on the same gb300-cw cluster
(`/mnt/home/yangminl/srt-slurm/recipes/dsv4-pro/sglang/gb300-fp4/all-dynamo.yaml`)
which sets:
sbatch_directives:
cpus-per-task: "144"
mem: "0"
cargo then gets the full 144-core GB300 host and finishes maturin in a
few minutes; mem=0 hands the worker the entire node's RAM so the
dynamo build + DSV4-Pro 671B FP4 weight load fit without OOM.1 parent 28d03e8 commit 16113f8
2 files changed
Lines changed: 18 additions & 6 deletions
File tree
- benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k
Lines changed: 9 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
37 | | - | |
38 | | - | |
39 | | - | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
40 | 45 | | |
41 | 46 | | |
| 47 | + | |
42 | 48 | | |
43 | 49 | | |
44 | 50 | | |
| |||
Lines changed: 9 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
37 | | - | |
38 | | - | |
39 | | - | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
40 | 45 | | |
41 | 46 | | |
| 47 | + | |
42 | 48 | | |
43 | 49 | | |
44 | 50 | | |
| |||
0 commit comments