Skip to content

recipes(qwen3.5): refresh fp8 mtp-off wideep configs#149

Open
zhengd-nv wants to merge 2 commits into
NVIDIA:mainfrom
zhengd-nv:qwen3.5-wideep-recipes-refresh
Open

recipes(qwen3.5): refresh fp8 mtp-off wideep configs#149
zhengd-nv wants to merge 2 commits into
NVIDIA:mainfrom
zhengd-nv:qwen3.5-wideep-recipes-refresh

Conversation

@zhengd-nv
Copy link
Copy Markdown
Contributor

Summary

Refresh of the Qwen3.5 FP8 mtp-off wideep recipes introduced by #128 driven by decode-side performance
sweeps on GB200.

Changes per recipe

  • setup_script: switched from rebuild-deepep.sh to setup-router-and-deepep.sh. Even though sglang-router is not actually used by these recipes, this setup script additionally seeds the decode SGLANG_DG_CACHE_DIR (/tmp/deepgemm-cache, node-local) from the shared /configs cache; the underlying deepep-rebuild step is the same.
  • identity: declares the sglang repo, container image, and framework version actually exercised (lmsysorg/sglang:0.5.10.post1), so jobs that apply these recipes can verify they're running the intended version.
  • name: prefixed with wideep- so the runtime job name matches the file name (e.g. qwen3.5-wideep-1p1d-dep8-dpep-ccsweep).
  • Drop deprecated decode option prefill-round-robin-balance: true.
  • DEP16+: lower decode mem-fraction-static from 0.80 to 0.75. DeepEP buffers scale linearly with ep_size; at 0.80 CUDA graph capture OOMs inside cuda_graph_runner.capture_one_batch_size for DEP16/DEP32.
  • Add SGLANG_HEALTH_STARTING_OK / SGLANG_ENABLE_HEALTH_ENDPOINT_GENERATION to both prefill and decode env; align decode SGLANG_DG_CACHE_DIR and SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK.
  • Extend 1P ccsweep ranges so both the prefill-unsaturated and prefill-saturated regimes are sampled (DEP16 → cc=4096, DEP32 → cc=8192).

New recipe

  • wideep-3p1d-dep16-dpep-cc4096.yaml: mirrors wideep-3p1d-dep32-dpep-cc4096.yaml but for DEP16. At cc=4096 (per-DP=256), 1P and 2P prefill remain saturated; only 3P unlocks the decode at full per-DP capacity, and this point produces the highest realized out/gpu of all tested configurations.

Caveat

6p1d-dep32-cc8192 (and any other multi-prefill recipe that fans out past ~5 workers) tends to hit zmq.error.ZMQError: Address already in use during sglang engine init when run on plain main. #134 (port jitter / odd-port allocation) resolves this; on plain main, multiple resubmissions may be required for the 6P+ configurations to start cleanly.

Test plan

  • srtctl dry-run -f <recipe> for all 8 recipes (node counts + identity fields verified)
  • srtctl apply -f wideep-1p1d-dep8-dpep-ccsweep.yaml: completes the full cc sweep (8 → 2048); output throughput matches the prior locally-generated config within <1 %
  • srtctl apply -f wideep-6p1d-dep32-dpep-cc8192.yaml (on a branch carrying Sglang port jitter #134): initializes cleanly (no ZMQError) and completes the cc=8192 point

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@3183b4c). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #149   +/-   ##
=======================================
  Coverage        ?   65.07%           
=======================================
  Files           ?       67           
  Lines           ?     8214           
  Branches        ?        0           
=======================================
  Hits            ?     5345           
  Misses          ?     2869           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Minimal refresh of the wideep recipes from NVIDIA#128 (d571e42) driven by
recent decode-side sweeps on GB200. Each existing recipe gets exactly
the same four-line set of functional changes:

- name: drop the `-router` suffix and add a `wideep-` prefix so the
  runtime job name matches the file name.
- setup_script: rebuild-deepep.sh -> setup-router-and-deepep.sh. The
  new script additionally seeds the decode-side SGLANG_DG_CACHE_DIR
  (/tmp/deepgemm-cache, node-local) from the shared /configs cache;
  the deepep-rebuild step itself is unchanged. sglang-router is not
  actually exercised by these recipes.
- identity: declare the sglang repo + container image + framework
  version actually exercised (lmsysorg/sglang:0.5.10.post1), so jobs
  that apply these recipes can verify they are running the intended
  version.
- Drop deprecated decode option `prefill-round-robin-balance: true`.

wideep-1p1d-dep8-dpep-ccsweep additionally aligns its decode
SGLANG_DG_CACHE_DIR with the other recipes (/configs -> /tmp), since
the new setup_script seeds the node-local path.

New file:
- wideep-3p1d-dep16-dpep-cc4096.yaml — mirrors wideep-3p1d-dep32-dpep-cc4096.yaml
  but targets DEP16. At cc=4096 (per-DP=256), 1P and 2P prefill remain
  saturated; only 3P unlocks decode at full per-DP capacity, and this
  point produces the highest realized out/gpu of all tested
  configurations.

Known caveat:
- 6p1d-dep32-cc8192 (and any other recipe that fans out past ~5
  prefill workers) is prone to `zmq.error.ZMQError: Address already
  in use` during sglang engine init. NVIDIA#134 (port jitter / odd-port
  allocation) resolves this; without it multiple resubmissions may
  be required for the 6P+ configurations to start cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zhengd-nv zhengd-nv force-pushed the qwen3.5-wideep-recipes-refresh branch from 6279c0c to efb282b Compare May 15, 2026 08:43
DEP16 DeepEP buffers scale with ep_size the same way they do for
DEP32; 0.80 OOMs during cuda-graph capture on real cc=4096 runs.
Other DEP16+ recipes (2p1d-dep16, 3p1d-dep16, 1p1d-dep32, ...) are
already at 0.75; this brings the 1p1d-dep16 ccsweep recipe into the
same class.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants