[NOT RATE MATCHED]Add NVFP4 WideEP disaggregated DEP8/DEP16/DEP32 recipes for Qwen3.5-397B-A17B by xiaoweiw-nv · Pull Request #167 · NVIDIA/srt-slurm

xiaoweiw-nv · 2026-05-20T02:42:01Z

Summary

Adds reproducible recipes for Qwen3.5-397B-A17B-NVFP4 disaggregated benchmarks on GB200 (aws-dfw). Covers three decode scale points (DEP8/DEP16/DEP32) with DEP4 prefill workers,
mooncake KV transfer, DeepEP low-latency MoE all-to-all, and dp-attention. One recipe per decode scale, full concurrency sweep in each, so anyone can reproduce our perf numbers with
srtctl run <recipe.yaml>.
  
Files
  
recipes/qwen3.5/nvfp4/disagg/stp_prefix_off/
  dep4-dep8.yaml      # 2P1D, 4 nodes (2 prefill + 2 decode), cc 1..2048
  dep4-dep16.yaml     # 4P1D, 8 nodes (4 prefill + 4 decode), cc 1..4096
  dep4-dep32.yaml     # 8P1D, 16 nodes (8 prefill + 8 decode), cc 1..8192
  configs/qwen35-nvfp4-wideep-setup.sh   # setup script

Topology (per recipe)

┌────────────┬─────────────────────────────┬────────────────┬───────┬──────┐
│   Recipe   │ Prefill workers (DEP4 each) │ Decode worker  │ Nodes │ GPUs │
├────────────┼─────────────────────────────┼────────────────┼───────┼──────┤
│ dep4-dep8  │ 2 × 4 GPU                   │ DEP8 / 2 node  │ 4     │ 16   │
├────────────┼─────────────────────────────┼────────────────┼───────┼──────┤
│ dep4-dep16 │ 4 × 4 GPU                   │ DEP16 / 4 node │ 8     │ 32   │
├────────────┼─────────────────────────────┼────────────────┼───────┼──────┤
│ dep4-dep32 │ 8 × 4 GPU                   │ DEP32 / 8 node │ 16    │ 64   │
└────────────┴─────────────────────────────┴────────────────┴───────┴──────┘

Backend (common)
- Frontend: dynamo (multi-prefill orchestration)
- Prefill: attention-backend=trtllm_mha, moe-runner-backend=flashinfer_trtllm, no DeepEP, disable-cuda-graph=true, chunked-prefill-size=65536, context-length=2020
- Decode: attention-backend=trtllm_mha, moe-runner-backend=flashinfer_cutedsl, moe-a2a-backend=deepep (low_latency), cuda-graph-max-bs=512, enable-dp-attention=true,
prefill-round-robin-balance=true
- KV transfer: mooncake (default), SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True
- Model: qwen3.5-nvfp4 alias; container: dev alias (docker://lmsysorg/sglang:dev-cu13)

Notable knobs / known issues

1. prefill.disable-flashinfer-autotune: true — speeds prefill worker boot when cache is primed (autotune redundant on subsequent runs).
2. max-running-requests / max-mamba-cache-size: dep8 capped at 2048 (top of sweep); dep16/dep32 at 8192 to handle cc=4096/8192 with headroom. SGLang divides these by
data-parallel-size for per-DP slot counts, so per-rank mamba budget stays bounded.

Scaling observations (motivating future work, not blocking this PR)

DEP8 and DEP16 scale near-linearly to peak efficiency (~92-93%) at cc=2048-4096. DEP32 saturates at ~63% efficiency at cc=4096 and drops to ~40% at cc=8192 despite 2× the GPUs of
DEP16. Investigation still ongoing, likely unrelated to prefill or decode compute. This is mentioned for context; the recipes faithfully reproduce the current state of the stack.

Test plan

- srtctl run recipes/qwen3.5/nvfp4/disagg/stp_prefix_off/dep4-dep8.yaml completes full sweep on GB200
- srtctl run recipes/qwen3.5/nvfp4/disagg/stp_prefix_off/dep4-dep16.yaml completes full sweep on GB200
- srtctl run recipes/qwen3.5/nvfp4/disagg/stp_prefix_off/dep4-dep32.yaml completes full sweep on GB200
- gen_throughput.csv and results_concurrency_*.json produced under logs/
- Numbers within ±5% of the baseline table above

Drops misleading pr-main prefix; name now reflects workload scope (Qwen3.5 NVFP4 WideEP). Updates setup_script reference in the three dep4-dep{8,16,32} disagg recipes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

One DEP4 prefill worker sustains ~89K tok/s input at ISL=OSL=1000; only cc>=2048 in DEP16/DEP32 exceeds that and needs a second prefill worker. Split DEP16/DEP32 into lowcc (1pw, cc=1..1024) and highcc (2pw, cc=2048..N) variants. DEP8 stays single-config with 1 prefill worker (top out ~50K tok/s, never saturates one worker). Saves nodes on low-cc sweeps: DEP8: 4 -> 3 nodes DEP16-lowcc: 8 -> 5 nodes DEP16-highcc 8 -> 6 nodes DEP32-lowcc 16 -> 9 nodes DEP32-highcc 16 -> 10 nodes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

samuellees · 2026-05-21T03:03:31Z

            f.write(f'jetstream {{ store_dir: "{nats_store_dir}" }}\n')
        logger.info("Starting NATS server (max_payload: %dMB)...", max_payload_mb)
-        cmd = [binary_path, "-c", nats_config_path]
+        cmd = ["taskset", "-c", "140-143", binary_path, "-c", nats_config_path]  # OMC_CPU_PIN_PATCH_APPLIED


Why do need to change this, please?

this is to pin ETCD to CPU cores, otherwise it would easily run into timeout. ETCD has a heartbeat mechanism, if the CPU is too busy (JIT/warmup) and missed the heartbeat window then ETCD would exit with error. Pin ETCD CPU can prevent CPU starvation.

cc @ishandhanani does this change look good to you? Can merge if you are OK with the change in src/srtctl/cli/setup_head.py

cc @ishandhanani does this change look good to you? Can merge if you are OK with the change in src/srtctl/cli/setup_head.py

Another workaround here is to set a longer ETCD heartbeat interval, if you don't want the code change here

Can we possibly make this tunable? We should not hardcode

Can we possibly make this tunable? We should not hardcode

Added a etcd_cpu_affinity field, user can specify the cpu affinity by:

infra: etcd_cpu_affinity: "140-143"

Recent runs show prefill warmup completes in <60s with prebuilt-v3 container + persisted flashinfer/deepgemm caches, well below the upstream 1800s default. The 7200s bump never triggered. Remove configs/qwen35-nvfp4-wideep-setup.sh and setup_script: references from the five disagg recipes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov-commenter · 2026-05-22T17:01:16Z

Codecov Report

❌ Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@3df8ed5). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/srtctl/cli/setup_head.py	0.00%	4 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #167   +/-   ##
=======================================
  Coverage        ?   65.10%           
=======================================
  Files           ?       67           
  Lines           ?     8217           
  Branches        ?        0           
=======================================
  Hits            ?     5350           
  Misses          ?     2867           
  Partials        ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

xiaoweiw-nv and others added 2 commits May 19, 2026 03:16

Add Qwen3.5 NVFP4 WideEP DeepEP+CuteDSL_v1 recipe

8ce6a17

xiaoweiw-nv requested review from alec-flowers, csahithi, hjjq, ishandhanani, kedarpotdar-nv, kyleliang-nv, nlevin-ui and qiching as code owners May 20, 2026 02:42

samuellees reviewed May 21, 2026

View reviewed changes

YAMY1234 changed the title ~~Add NVFP4 WideEP disaggregated DEP8/DEP16/DEP32 recipes for Qwen3.5-397B-A17B~~ [NOT RATE MATCHED]Add NVFP4 WideEP disaggregated DEP8/DEP16/DEP32 recipes for Qwen3.5-397B-A17B May 22, 2026

Add field to pin etcd cpus

43d5292

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NOT RATE MATCHED]Add NVFP4 WideEP disaggregated DEP8/DEP16/DEP32 recipes for Qwen3.5-397B-A17B#167

[NOT RATE MATCHED]Add NVFP4 WideEP disaggregated DEP8/DEP16/DEP32 recipes for Qwen3.5-397B-A17B#167
xiaoweiw-nv wants to merge 5 commits into
NVIDIA:mainfrom
xiaoweiw-nv:qwen3.5-nvfp4-wideep

xiaoweiw-nv commented May 20, 2026

Uh oh!

Uh oh!

Uh oh!

samuellees May 21, 2026

Uh oh!

xiaoweiw-nv May 21, 2026

Uh oh!

YAMY1234 May 26, 2026

Uh oh!

xiaoweiw-nv May 27, 2026

Uh oh!

ishandhanani May 27, 2026

Uh oh!

xiaoweiw-nv May 27, 2026

Uh oh!

codecov-commenter commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

xiaoweiw-nv commented May 20, 2026

Uh oh!

Uh oh!

Uh oh!

samuellees May 21, 2026

Choose a reason for hiding this comment

Uh oh!

xiaoweiw-nv May 21, 2026

Choose a reason for hiding this comment

Uh oh!

YAMY1234 May 26, 2026

Choose a reason for hiding this comment

Uh oh!

xiaoweiw-nv May 27, 2026

Choose a reason for hiding this comment

Uh oh!

ishandhanani May 27, 2026

Choose a reason for hiding this comment

Uh oh!

xiaoweiw-nv May 27, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 22, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants