Skip to content

[NOT RATE MATCHED]Add NVFP4 WideEP disaggregated DEP8/DEP16/DEP32 recipes for Qwen3.5-397B-A17B#167

Open
xiaoweiw-nv wants to merge 5 commits into
NVIDIA:mainfrom
xiaoweiw-nv:qwen3.5-nvfp4-wideep
Open

[NOT RATE MATCHED]Add NVFP4 WideEP disaggregated DEP8/DEP16/DEP32 recipes for Qwen3.5-397B-A17B#167
xiaoweiw-nv wants to merge 5 commits into
NVIDIA:mainfrom
xiaoweiw-nv:qwen3.5-nvfp4-wideep

Conversation

@xiaoweiw-nv
Copy link
Copy Markdown

Summary

Adds reproducible recipes for Qwen3.5-397B-A17B-NVFP4 disaggregated benchmarks on GB200 (aws-dfw). Covers three decode scale points (DEP8/DEP16/DEP32) with DEP4 prefill workers,
mooncake KV transfer, DeepEP low-latency MoE all-to-all, and dp-attention. One recipe per decode scale, full concurrency sweep in each, so anyone can reproduce our perf numbers with
srtctl run <recipe.yaml>.
  
Files
  
recipes/qwen3.5/nvfp4/disagg/stp_prefix_off/
  dep4-dep8.yaml      # 2P1D, 4 nodes (2 prefill + 2 decode), cc 1..2048
  dep4-dep16.yaml     # 4P1D, 8 nodes (4 prefill + 4 decode), cc 1..4096
  dep4-dep32.yaml     # 8P1D, 16 nodes (8 prefill + 8 decode), cc 1..8192
  configs/qwen35-nvfp4-wideep-setup.sh   # setup script

Topology (per recipe)

┌────────────┬─────────────────────────────┬────────────────┬───────┬──────┐
│   Recipe   │ Prefill workers (DEP4 each) │ Decode worker  │ Nodes │ GPUs │
├────────────┼─────────────────────────────┼────────────────┼───────┼──────┤
│ dep4-dep8  │ 2 × 4 GPU                   │ DEP8 / 2 node  │ 4     │ 16   │
├────────────┼─────────────────────────────┼────────────────┼───────┼──────┤
│ dep4-dep16 │ 4 × 4 GPU                   │ DEP16 / 4 node │ 8     │ 32   │
├────────────┼─────────────────────────────┼────────────────┼───────┼──────┤
│ dep4-dep32 │ 8 × 4 GPU                   │ DEP32 / 8 node │ 16    │ 64   │
└────────────┴─────────────────────────────┴────────────────┴───────┴──────┘

Backend (common)
- Frontend: dynamo (multi-prefill orchestration)
- Prefill: attention-backend=trtllm_mha, moe-runner-backend=flashinfer_trtllm, no DeepEP, disable-cuda-graph=true, chunked-prefill-size=65536, context-length=2020
- Decode: attention-backend=trtllm_mha, moe-runner-backend=flashinfer_cutedsl, moe-a2a-backend=deepep (low_latency), cuda-graph-max-bs=512, enable-dp-attention=true,
prefill-round-robin-balance=true
- KV transfer: mooncake (default), SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True
- Model: qwen3.5-nvfp4 alias; container: dev alias (docker://lmsysorg/sglang:dev-cu13)

Notable knobs / known issues

1. prefill.disable-flashinfer-autotune: true — speeds prefill worker boot when cache is primed (autotune redundant on subsequent runs).
2. max-running-requests / max-mamba-cache-size: dep8 capped at 2048 (top of sweep); dep16/dep32 at 8192 to handle cc=4096/8192 with headroom. SGLang divides these by
data-parallel-size for per-DP slot counts, so per-rank mamba budget stays bounded.

Scaling observations (motivating future work, not blocking this PR)

DEP8 and DEP16 scale near-linearly to peak efficiency (~92-93%) at cc=2048-4096. DEP32 saturates at ~63% efficiency at cc=4096 and drops to ~40% at cc=8192 despite 2× the GPUs of
DEP16. Investigation still ongoing, likely unrelated to prefill or decode compute. This is mentioned for context; the recipes faithfully reproduce the current state of the stack.

Test plan

- srtctl run recipes/qwen3.5/nvfp4/disagg/stp_prefix_off/dep4-dep8.yaml completes full sweep on GB200
- srtctl run recipes/qwen3.5/nvfp4/disagg/stp_prefix_off/dep4-dep16.yaml completes full sweep on GB200
- srtctl run recipes/qwen3.5/nvfp4/disagg/stp_prefix_off/dep4-dep32.yaml completes full sweep on GB200
- gen_throughput.csv and results_concurrency_*.json produced under logs/
- Numbers within ±5% of the baseline table above

xiaoweiw-nv and others added 2 commits May 19, 2026 03:16
Drops misleading pr-main prefix; name now reflects workload scope
(Qwen3.5 NVFP4 WideEP). Updates setup_script reference in the three
dep4-dep{8,16,32} disagg recipes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
One DEP4 prefill worker sustains ~89K tok/s input at ISL=OSL=1000;
only cc>=2048 in DEP16/DEP32 exceeds that and needs a second prefill
worker. Split DEP16/DEP32 into lowcc (1pw, cc=1..1024) and highcc
(2pw, cc=2048..N) variants. DEP8 stays single-config with 1 prefill
worker (top out ~50K tok/s, never saturates one worker). Saves nodes
on low-cc sweeps:
  DEP8:        4 -> 3 nodes
  DEP16-lowcc: 8 -> 5 nodes
  DEP16-highcc 8 -> 6 nodes
  DEP32-lowcc 16 -> 9 nodes
  DEP32-highcc 16 -> 10 nodes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread configs/qwen35-nvfp4-wideep-setup.sh Outdated
Comment thread recipes/qwen3.5/nvfp4/disagg/stp_prefix_off/dep4-dep16-highcc.yaml Outdated
Comment thread src/srtctl/cli/setup_head.py Outdated
f.write(f'jetstream {{ store_dir: "{nats_store_dir}" }}\n')
logger.info("Starting NATS server (max_payload: %dMB)...", max_payload_mb)
cmd = [binary_path, "-c", nats_config_path]
cmd = ["taskset", "-c", "140-143", binary_path, "-c", nats_config_path] # OMC_CPU_PIN_PATCH_APPLIED
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do need to change this, please?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to pin ETCD to CPU cores, otherwise it would easily run into timeout. ETCD has a heartbeat mechanism, if the CPU is too busy (JIT/warmup) and missed the heartbeat window then ETCD would exit with error. Pin ETCD CPU can prevent CPU starvation.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ishandhanani does this change look good to you? Can merge if you are OK with the change in src/srtctl/cli/setup_head.py

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ishandhanani does this change look good to you? Can merge if you are OK with the change in src/srtctl/cli/setup_head.py

Another workaround here is to set a longer ETCD heartbeat interval, if you don't want the code change here

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we possibly make this tunable? We should not hardcode

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we possibly make this tunable? We should not hardcode

Added a etcd_cpu_affinity field, user can specify the cpu affinity by:

infra:
  etcd_cpu_affinity: "140-143"

Recent runs show prefill warmup completes in <60s with prebuilt-v3
container + persisted flashinfer/deepgemm caches, well below the
upstream 1800s default. The 7200s bump never triggered. Remove
configs/qwen35-nvfp4-wideep-setup.sh and setup_script: references
from the five disagg recipes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@YAMY1234 YAMY1234 changed the title Add NVFP4 WideEP disaggregated DEP8/DEP16/DEP32 recipes for Qwen3.5-397B-A17B [NOT RATE MATCHED]Add NVFP4 WideEP disaggregated DEP8/DEP16/DEP32 recipes for Qwen3.5-397B-A17B May 22, 2026
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@3df8ed5). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/srtctl/cli/setup_head.py 0.00% 4 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #167   +/-   ##
=======================================
  Coverage        ?   65.10%           
=======================================
  Files           ?       67           
  Lines           ?     8217           
  Branches        ?        0           
=======================================
  Hits            ?     5350           
  Misses          ?     2867           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants