Skip to content

together_runner/slurm-disagg: portable 2-node SGLang PD-disagg benchmark harness#2

Open
Johnsonms wants to merge 2 commits into
mainfrom
together-runner-slurm-disagg
Open

together_runner/slurm-disagg: portable 2-node SGLang PD-disagg benchmark harness#2
Johnsonms wants to merge 2 commits into
mainfrom
together-runner-slurm-disagg

Conversation

@Johnsonms

Copy link
Copy Markdown
Collaborator

together_runner/slurm-disagg: portable 2-node SGLang PD-disagg benchmark harness

What

Adds together_runner/slurm-disagg/ — a reproducible harness that deploys a 2-node
prefill/decode-disaggregated SGLang endpoint on a Slurm + enroot/pyxis cluster, transfers
KV cross-node over RDMA, and benchmarks it. This is the productized form of the ClusterMAX
inference-disagg phase-0 readiness proof.

Designed to run unmodified on a new cluster — no node names, IB device lists, or
peermem flags hardcoded.

How it works

  • One 2-node allocation (salloc --no-shell); prefill/decode/router/bench run as
    --jobid=$ALLOC --overlap steps into it, so Slurm picks the nodes. Partition
    auto-selected (first with ≥2 idle GPU nodes); nodes/IPs persisted to disagg_nodes.env.
    GPUs/node and TP are pinned to 8 (B200, deliberate). Teardown = scancel the one allocation.
  • 01_preflight.sh auto-detects + verifies the RDMA/KV path (seconds, before any
    multi-minute server start), all overridable:
    • IB_DEVICES from nvidia-smi topo -m + /sys link_layer — GPU-ordered, majority-fabric
      filter (InfiniBand drops the Ethernet storage NICs; RoCE keeps Ethernet).
    • WITH_NVIDIA_PEERMEM decision: nvidia_peermem present → default path; absent +
      driver ≥535 → 0 (Mooncake dmabuf).
    • Verifies IB ports ACTIVE/LinkUp on both nodes, /dev/infiniband bind-mount (libibverbs
      device count == /sys), and dmabuf export; optional PROBE_MOONCAKE=1 register_memory probe.
    • enroot import temp auto-detected (first node-local non-overlay fs). ENROOT_DIR/MODELS_ROOT
      default to $HOME (must be cross-node-shared; preflight-checked).

Run it

cd together_runner/slurm-disagg
bash run_all.sh        # 00_setup -> 01_preflight -> 10_launch -> 20_benchmark
bash teardown.sh

Validation (slinky, Qwen3-32B 1P1D, 1k/1k, allocation model)

0 failed requests across the whole sweep. Auto-detected IB list matched the
hand-derived one exactly.

conc total tok/s vs prior baseline
16 2,156 +3%
64 4,614 +0.5%
128 3,298 −27% — see note
256 4,892 −0.1%

conc=128 note: reproducibly ~3.3–3.4k (below conc=64) — not a code regression (serving
args/nodes/IB byte-for-byte unchanged; conc 16/64/256 match baseline) and not noise (reproduces).
Read as a 1P1D prefill/decode interleave artifact at this concurrency; documented in
BENCHMARK-RECORD-qwen3-32b-disagg.md. A candidate to revisit with NP1D scaling / chunked-prefill tuning.

Notes for reviewers

  • The enroot nvidia-hook sed-patch (00_setup.sh, needs sudo) is unavoidable on this stack:
    on enroot 4.0.1 both system+user hooks.d run with no basename dedup (a user hook can't
    override the system one), and pyxis ignores a per-job ENROOT_SYSCONF_PATH redirect — both
    verified experimentally. The patch is idempotent + post-patch verified. See PORTABILITY-ANALYSIS.md.
  • Standalone harness — does not touch the CI sweep (run-sweep.yml, benchmark_lib.sh).
  • Per-run outputs (results/logs) live outside the repo under $ENROOT_DIR; nothing machine-specific is committed.

Related: ClusterMAX inference-disagg phase-0 readiness.

…ark harness

Productizes the ClusterMAX disagg phase-0 bring-up into a harness that runs unmodified
on any Slurm + enroot/pyxis cluster — no node names, IB lists, or peermem flags hardcoded.

Allocation model:
- One 2-node 'salloc --no-shell' allocation; prefill/decode/router/bench run as overlap
  steps into it, so Slurm picks the nodes. Partition auto-selected (>=2 idle GPU nodes);
  node names/IPs resolved from the allocation and persisted to disagg_nodes.env. GPUs/node
  and TP pinned to 8 (B200, on purpose). teardown = scancel the one allocation.

RDMA preflight (01_preflight.sh), all auto-detected, with overrides:
- IB_DEVICES from 'nvidia-smi topo -m' + /sys link_layer (GPU-ordered, majority-fabric
  filter: IB drops Ethernet storage NICs, RoCE keeps them).
- WITH_NVIDIA_PEERMEM decision: peermem present -> default path; absent + driver>=535 ->
  0 (Mooncake dmabuf). Verifies IB ports ACTIVE on both nodes + /dev/infiniband bind-mount
  + dmabuf export inside the container; optional PROBE_MOONCAKE register_memory probe.
- enroot import temp auto-detected (first node-local non-overlay fs); ENROOT_DIR/MODELS_ROOT
  default to $HOME (must be cross-node-shared).

Hardening:
- enroot nvidia-hook patch is idempotent + post-patch verified + clear error w/o sudo.
  User-level hook override and ENROOT_SYSCONF_PATH redirect both proven unworkable on
  enroot 4.0.1 + pyxis (see PORTABILITY-ANALYSIS.md) — patching the system hook is the
  only option on this stack.

Validated end-to-end on slinky (Qwen3-32B 1P1D, allocation model): conc 16/64/256 match the
06-29 baseline (2156/4614/4892 tok/s), 0 failed requests across the sweep. conc=128 is a
reproducible 1P1D dynamics dip (~3.3k); documented in BENCHMARK-RECORD as a known artifact.

Docs: README, PORTABILITY-ANALYSIS.md (root-cause table + decisions), BENCHMARK-RECORD.
- INVESTIGATE-conc128.md: documents the reproducible conc=128 throughput dip for later
  follow-up. Records the decisive probe that EXONERATES the refactor (old separate-job and
  new allocation-overlap-step models both grant 2 CPUs/node — identical), the likely cause
  (1P1D dynamics variance), a separate CPU-starvation lever (--exclusive/--cpus-per-task),
  and the exact OLD-model launch commands so the 06-29 baseline stays reproducible.
- CLAUDE.md: quick debug/ramp-up reference — architecture, auto-vs-pinned, resolved-env
  files, hard-won gotchas, debug entry points, current state (PR #2, conc=128 open).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant