together_runner/slurm-disagg: portable 2-node SGLang PD-disagg benchmark harness by Johnsonms · Pull Request #2 · togethercomputer/InferenceX

Johnsonms · 2026-06-30T06:20:07Z

together_runner/slurm-disagg: portable 2-node SGLang PD-disagg benchmark harness

What

Adds together_runner/slurm-disagg/ — a reproducible harness that deploys a 2-node
prefill/decode-disaggregated SGLang endpoint on a Slurm + enroot/pyxis cluster, transfers
KV cross-node over RDMA, and benchmarks it. This is the productized form of the ClusterMAX
inference-disagg phase-0 readiness proof.

Designed to run unmodified on a new cluster — no node names, IB device lists, or
peermem flags hardcoded.

How it works

One 2-node allocation (salloc --no-shell); prefill/decode/router/bench run as
--jobid=$ALLOC --overlap steps into it, so Slurm picks the nodes. Partition
auto-selected (first with ≥2 idle GPU nodes); nodes/IPs persisted to disagg_nodes.env.
GPUs/node and TP are pinned to 8 (B200, deliberate). Teardown = scancel the one allocation.
01_preflight.sh auto-detects + verifies the RDMA/KV path (seconds, before any
multi-minute server start), all overridable:
- IB_DEVICES from nvidia-smi topo -m + /sys link_layer — GPU-ordered, majority-fabric
  filter (InfiniBand drops the Ethernet storage NICs; RoCE keeps Ethernet).
- WITH_NVIDIA_PEERMEM decision: nvidia_peermem present → default path; absent +
  driver ≥535 → 0 (Mooncake dmabuf).
- Verifies IB ports ACTIVE/LinkUp on both nodes, /dev/infiniband bind-mount (libibverbs
  device count == /sys), and dmabuf export; optional PROBE_MOONCAKE=1 register_memory probe.
- enroot import temp auto-detected (first node-local non-overlay fs). ENROOT_DIR/MODELS_ROOT
  default to $HOME (must be cross-node-shared; preflight-checked).

Run it

cd together_runner/slurm-disagg
bash run_all.sh        # 00_setup -> 01_preflight -> 10_launch -> 20_benchmark
bash teardown.sh

Validation (slinky, Qwen3-32B 1P1D, 1k/1k, allocation model)

0 failed requests across the whole sweep. Auto-detected IB list matched the
hand-derived one exactly.

conc	total tok/s	vs prior baseline
16	2,156	+3%
64	4,614	+0.5%
128	3,298	−27% — see note
256	4,892	−0.1%

conc=128 note: reproducibly ~3.3–3.4k (below conc=64) — not a code regression (serving
args/nodes/IB byte-for-byte unchanged; conc 16/64/256 match baseline) and not noise (reproduces).
Read as a 1P1D prefill/decode interleave artifact at this concurrency; documented in
BENCHMARK-RECORD-qwen3-32b-disagg.md. A candidate to revisit with NP1D scaling / chunked-prefill tuning.

Notes for reviewers

The enroot nvidia-hook sed-patch (00_setup.sh, needs sudo) is unavoidable on this stack:
on enroot 4.0.1 both system+user hooks.d run with no basename dedup (a user hook can't
override the system one), and pyxis ignores a per-job ENROOT_SYSCONF_PATH redirect — both
verified experimentally. The patch is idempotent + post-patch verified. See PORTABILITY-ANALYSIS.md.
Standalone harness — does not touch the CI sweep (run-sweep.yml, benchmark_lib.sh).
Per-run outputs (results/logs) live outside the repo under $ENROOT_DIR; nothing machine-specific is committed.

Related: ClusterMAX inference-disagg phase-0 readiness.

…ark harness Productizes the ClusterMAX disagg phase-0 bring-up into a harness that runs unmodified on any Slurm + enroot/pyxis cluster — no node names, IB lists, or peermem flags hardcoded. Allocation model: - One 2-node 'salloc --no-shell' allocation; prefill/decode/router/bench run as overlap steps into it, so Slurm picks the nodes. Partition auto-selected (>=2 idle GPU nodes); node names/IPs resolved from the allocation and persisted to disagg_nodes.env. GPUs/node and TP pinned to 8 (B200, on purpose). teardown = scancel the one allocation. RDMA preflight (01_preflight.sh), all auto-detected, with overrides: - IB_DEVICES from 'nvidia-smi topo -m' + /sys link_layer (GPU-ordered, majority-fabric filter: IB drops Ethernet storage NICs, RoCE keeps them). - WITH_NVIDIA_PEERMEM decision: peermem present -> default path; absent + driver>=535 -> 0 (Mooncake dmabuf). Verifies IB ports ACTIVE on both nodes + /dev/infiniband bind-mount + dmabuf export inside the container; optional PROBE_MOONCAKE register_memory probe. - enroot import temp auto-detected (first node-local non-overlay fs); ENROOT_DIR/MODELS_ROOT default to $HOME (must be cross-node-shared). Hardening: - enroot nvidia-hook patch is idempotent + post-patch verified + clear error w/o sudo. User-level hook override and ENROOT_SYSCONF_PATH redirect both proven unworkable on enroot 4.0.1 + pyxis (see PORTABILITY-ANALYSIS.md) — patching the system hook is the only option on this stack. Validated end-to-end on slinky (Qwen3-32B 1P1D, allocation model): conc 16/64/256 match the 06-29 baseline (2156/4614/4892 tok/s), 0 failed requests across the sweep. conc=128 is a reproducible 1P1D dynamics dip (~3.3k); documented in BENCHMARK-RECORD as a known artifact. Docs: README, PORTABILITY-ANALYSIS.md (root-cause table + decisions), BENCHMARK-RECORD.

- INVESTIGATE-conc128.md: documents the reproducible conc=128 throughput dip for later follow-up. Records the decisive probe that EXONERATES the refactor (old separate-job and new allocation-overlap-step models both grant 2 CPUs/node — identical), the likely cause (1P1D dynamics variance), a separate CPU-starvation lever (--exclusive/--cpus-per-task), and the exact OLD-model launch commands so the 06-29 baseline stays reproducible. - CLAUDE.md: quick debug/ramp-up reference — architecture, auto-vs-pinned, resolved-env files, hard-won gotchas, debug entry points, current state (PR #2, conc=128 open).

Johnsonms added 2 commits June 30, 2026 06:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

together_runner/slurm-disagg: portable 2-node SGLang PD-disagg benchmark harness#2

together_runner/slurm-disagg: portable 2-node SGLang PD-disagg benchmark harness#2
Johnsonms wants to merge 2 commits into
mainfrom
together-runner-slurm-disagg

Johnsonms commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Johnsonms commented Jun 30, 2026

together_runner/slurm-disagg: portable 2-node SGLang PD-disagg benchmark harness

What

How it works

Run it

Validation (slinky, Qwen3-32B 1P1D, 1k/1k, allocation model)

Notes for reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant