recipes(qwen3.5): refresh fp8 mtp-off wideep configs#149
Open
zhengd-nv wants to merge 2 commits into
Open
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #149 +/- ##
=======================================
Coverage ? 65.07%
=======================================
Files ? 67
Lines ? 8214
Branches ? 0
=======================================
Hits ? 5345
Misses ? 2869
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Minimal refresh of the wideep recipes from NVIDIA#128 (d571e42) driven by recent decode-side sweeps on GB200. Each existing recipe gets exactly the same four-line set of functional changes: - name: drop the `-router` suffix and add a `wideep-` prefix so the runtime job name matches the file name. - setup_script: rebuild-deepep.sh -> setup-router-and-deepep.sh. The new script additionally seeds the decode-side SGLANG_DG_CACHE_DIR (/tmp/deepgemm-cache, node-local) from the shared /configs cache; the deepep-rebuild step itself is unchanged. sglang-router is not actually exercised by these recipes. - identity: declare the sglang repo + container image + framework version actually exercised (lmsysorg/sglang:0.5.10.post1), so jobs that apply these recipes can verify they are running the intended version. - Drop deprecated decode option `prefill-round-robin-balance: true`. wideep-1p1d-dep8-dpep-ccsweep additionally aligns its decode SGLANG_DG_CACHE_DIR with the other recipes (/configs -> /tmp), since the new setup_script seeds the node-local path. New file: - wideep-3p1d-dep16-dpep-cc4096.yaml — mirrors wideep-3p1d-dep32-dpep-cc4096.yaml but targets DEP16. At cc=4096 (per-DP=256), 1P and 2P prefill remain saturated; only 3P unlocks decode at full per-DP capacity, and this point produces the highest realized out/gpu of all tested configurations. Known caveat: - 6p1d-dep32-cc8192 (and any other recipe that fans out past ~5 prefill workers) is prone to `zmq.error.ZMQError: Address already in use` during sglang engine init. NVIDIA#134 (port jitter / odd-port allocation) resolves this; without it multiple resubmissions may be required for the 6P+ configurations to start cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6279c0c to
efb282b
Compare
DEP16 DeepEP buffers scale with ep_size the same way they do for DEP32; 0.80 OOMs during cuda-graph capture on real cc=4096 runs. Other DEP16+ recipes (2p1d-dep16, 3p1d-dep16, 1p1d-dep32, ...) are already at 0.75; this brings the 1p1d-dep16 ccsweep recipe into the same class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Refresh of the Qwen3.5 FP8 mtp-off wideep recipes introduced by #128 driven by decode-side performance
sweeps on GB200.
Changes per recipe
setup_script: switched fromrebuild-deepep.shtosetup-router-and-deepep.sh. Even though sglang-router is not actually used by these recipes, this setup script additionally seeds the decodeSGLANG_DG_CACHE_DIR(/tmp/deepgemm-cache, node-local) from the shared/configscache; the underlying deepep-rebuild step is the same.identity: declares the sglang repo, container image, and framework version actually exercised (lmsysorg/sglang:0.5.10.post1), so jobs that apply these recipes can verify they're running the intended version.name: prefixed withwideep-so the runtime job name matches the file name (e.g.qwen3.5-wideep-1p1d-dep8-dpep-ccsweep).prefill-round-robin-balance: true.mem-fraction-staticfrom 0.80 to 0.75. DeepEP buffers scale linearly withep_size; at 0.80 CUDA graph capture OOMs insidecuda_graph_runner.capture_one_batch_sizefor DEP16/DEP32.SGLANG_HEALTH_STARTING_OK/SGLANG_ENABLE_HEALTH_ENDPOINT_GENERATIONto both prefill and decode env; align decodeSGLANG_DG_CACHE_DIRandSGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK.New recipe
wideep-3p1d-dep16-dpep-cc4096.yaml: mirrorswideep-3p1d-dep32-dpep-cc4096.yamlbut for DEP16. At cc=4096 (per-DP=256), 1P and 2P prefill remain saturated; only 3P unlocks the decode at full per-DP capacity, and this point produces the highest realizedout/gpuof all tested configurations.Caveat
6p1d-dep32-cc8192(and any other multi-prefill recipe that fans out past ~5 workers) tends to hitzmq.error.ZMQError: Address already in useduring sglang engine init when run on plainmain. #134 (port jitter / odd-port allocation) resolves this; on plainmain, multiple resubmissions may be required for the 6P+ configurations to start cleanly.Test plan
srtctl dry-run -f <recipe>for all 8 recipes (node counts + identity fields verified)srtctl apply -f wideep-1p1d-dep8-dpep-ccsweep.yaml: completes the full cc sweep (8 → 2048); output throughput matches the prior locally-generated config within <1 %srtctl apply -f wideep-6p1d-dep32-dpep-cc8192.yaml(on a branch carrying Sglang port jitter #134): initializes cleanly (no ZMQError) and completes the cc=8192 point