together_runner/slurm-disagg: portable 2-node SGLang PD-disagg benchmark harness#2
Open
Johnsonms wants to merge 2 commits into
Open
together_runner/slurm-disagg: portable 2-node SGLang PD-disagg benchmark harness#2Johnsonms wants to merge 2 commits into
Johnsonms wants to merge 2 commits into
Conversation
…ark harness Productizes the ClusterMAX disagg phase-0 bring-up into a harness that runs unmodified on any Slurm + enroot/pyxis cluster — no node names, IB lists, or peermem flags hardcoded. Allocation model: - One 2-node 'salloc --no-shell' allocation; prefill/decode/router/bench run as overlap steps into it, so Slurm picks the nodes. Partition auto-selected (>=2 idle GPU nodes); node names/IPs resolved from the allocation and persisted to disagg_nodes.env. GPUs/node and TP pinned to 8 (B200, on purpose). teardown = scancel the one allocation. RDMA preflight (01_preflight.sh), all auto-detected, with overrides: - IB_DEVICES from 'nvidia-smi topo -m' + /sys link_layer (GPU-ordered, majority-fabric filter: IB drops Ethernet storage NICs, RoCE keeps them). - WITH_NVIDIA_PEERMEM decision: peermem present -> default path; absent + driver>=535 -> 0 (Mooncake dmabuf). Verifies IB ports ACTIVE on both nodes + /dev/infiniband bind-mount + dmabuf export inside the container; optional PROBE_MOONCAKE register_memory probe. - enroot import temp auto-detected (first node-local non-overlay fs); ENROOT_DIR/MODELS_ROOT default to $HOME (must be cross-node-shared). Hardening: - enroot nvidia-hook patch is idempotent + post-patch verified + clear error w/o sudo. User-level hook override and ENROOT_SYSCONF_PATH redirect both proven unworkable on enroot 4.0.1 + pyxis (see PORTABILITY-ANALYSIS.md) — patching the system hook is the only option on this stack. Validated end-to-end on slinky (Qwen3-32B 1P1D, allocation model): conc 16/64/256 match the 06-29 baseline (2156/4614/4892 tok/s), 0 failed requests across the sweep. conc=128 is a reproducible 1P1D dynamics dip (~3.3k); documented in BENCHMARK-RECORD as a known artifact. Docs: README, PORTABILITY-ANALYSIS.md (root-cause table + decisions), BENCHMARK-RECORD.
- INVESTIGATE-conc128.md: documents the reproducible conc=128 throughput dip for later follow-up. Records the decisive probe that EXONERATES the refactor (old separate-job and new allocation-overlap-step models both grant 2 CPUs/node — identical), the likely cause (1P1D dynamics variance), a separate CPU-starvation lever (--exclusive/--cpus-per-task), and the exact OLD-model launch commands so the 06-29 baseline stays reproducible. - CLAUDE.md: quick debug/ramp-up reference — architecture, auto-vs-pinned, resolved-env files, hard-won gotchas, debug entry points, current state (PR #2, conc=128 open).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
together_runner/slurm-disagg: portable 2-node SGLang PD-disagg benchmark harness
What
Adds
together_runner/slurm-disagg/— a reproducible harness that deploys a 2-nodeprefill/decode-disaggregated SGLang endpoint on a Slurm + enroot/pyxis cluster, transfers
KV cross-node over RDMA, and benchmarks it. This is the productized form of the ClusterMAX
inference-disagg phase-0 readiness proof.
Designed to run unmodified on a new cluster — no node names, IB device lists, or
peermem flags hardcoded.
How it works
salloc --no-shell); prefill/decode/router/bench run as--jobid=$ALLOC --overlapsteps into it, so Slurm picks the nodes. Partitionauto-selected (first with ≥2 idle GPU nodes); nodes/IPs persisted to
disagg_nodes.env.GPUs/node and TP are pinned to 8 (B200, deliberate). Teardown =
scancelthe one allocation.01_preflight.shauto-detects + verifies the RDMA/KV path (seconds, before anymulti-minute server start), all overridable:
IB_DEVICESfromnvidia-smi topo -m+/syslink_layer — GPU-ordered, majority-fabricfilter (InfiniBand drops the Ethernet storage NICs; RoCE keeps Ethernet).
WITH_NVIDIA_PEERMEMdecision:nvidia_peermempresent → default path; absent +driver ≥535 →
0(Mooncake dmabuf)./dev/infinibandbind-mount (libibverbsdevice count ==
/sys), and dmabuf export; optionalPROBE_MOONCAKE=1register_memory probe.ENROOT_DIR/MODELS_ROOTdefault to
$HOME(must be cross-node-shared; preflight-checked).Run it
Validation (slinky, Qwen3-32B 1P1D, 1k/1k, allocation model)
0 failed requests across the whole sweep. Auto-detected IB list matched the
hand-derived one exactly.
conc=128 note: reproducibly ~3.3–3.4k (below conc=64) — not a code regression (serving
args/nodes/IB byte-for-byte unchanged; conc 16/64/256 match baseline) and not noise (reproduces).
Read as a 1P1D prefill/decode interleave artifact at this concurrency; documented in
BENCHMARK-RECORD-qwen3-32b-disagg.md. A candidate to revisit with NP1D scaling / chunked-prefill tuning.Notes for reviewers
00_setup.sh, needs sudo) is unavoidable on this stack:on enroot 4.0.1 both system+user
hooks.drun with no basename dedup (a user hook can'toverride the system one), and pyxis ignores a per-job
ENROOT_SYSCONF_PATHredirect — bothverified experimentally. The patch is idempotent + post-patch verified. See
PORTABILITY-ANALYSIS.md.run-sweep.yml,benchmark_lib.sh).$ENROOT_DIR; nothing machine-specific is committed.Related: ClusterMAX inference-disagg phase-0 readiness.