[feat] Init true on policy with qwen_moe#3
Draft
maocheng23 wants to merge 6 commits intofeat/true_on_policy_qwen_densefrom
Draft
[feat] Init true on policy with qwen_moe#3maocheng23 wants to merge 6 commits intofeat/true_on_policy_qwen_densefrom
maocheng23 wants to merge 6 commits intofeat/true_on_policy_qwen_densefrom
Conversation
This was referenced May 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the Qwen3-MoE true-on-policy contract on top of the dense stack. Stacked on
feat/true_on_policy_qwen_dense— diff in this PR is only the MoE delta. This is fork-side review; upstream PR will retarget tosgl-project:mainonce dense lands.This is one of three tightly-coupled MoE PRs that must land together — they share a single contract identifier
qwen3_moe_true_on_policy_v1defined by a vendored schema in each repo.Companion PRs (MoE, must land in lockstep):
Stacked on dense (must land first):
Target
Bit-identical (exact-zero) logprob parity between the SGLang rollout engine and the Megatron trainer for every scored response token at TP=1/EP=4/CP=2 for Qwen3-30B-A3B (MoE).
Validated on H200 x8:
train_rollout_logprob_abs_diff = 0.0for 3 steps with both the full deterministic decode path and the prefill-only fast-decode path.Design
The MoE contract reuses the dense three-layer architecture (Miles -> SGLang -> Megatron), adding a new
qwen3_moe_true_on_policy_v1contract object. SGLang remains the numerical source of truth — Megatron's training forward reproduces SGLang's MoE numerics in a differentiable surface that delegates the no-grad inference path to SGLang's fused-experts kernel.New runtime-policy fields used to keep the rollout/train forward identical under EP:
deterministic_moe_routing— fp32 router pre-softmax, deterministic top-k tie-breakmoe_topk_tiebreak— explicit policy hook for tie-break orderingdeterministic_moe_dispatch— deterministic permute (no permute fusion under contract)deterministic_moe_combine— deterministic combine pathep_invariant_moe— engages whenep_size > 1In this PR (SGLang)
python/sglang/srt/true_on_policy/:schema.py— addsQWEN3_MOE_TRUE_ON_POLICY_V1_SCHEMAandqwen3_moe_sglang_mathkernel contract (byte-identical with Megatron and Miles copies)contracts.py— addsQWEN3_MOE_TRUE_ON_POLICY_V1registry entry;policy_for(server_args)derivesep_invariant_moefromep_size > 1and emits MoE-only fields whenmodel_family == 'qwen3_moe'config.py— adds prefill-only-deterministic helpers and per-forward-pass runtime-policy scope so decode CUDA-graph capture can run non-deterministic while logprob-recompute prefill stays under the contractpython/sglang/srt/tp_invariant_ops/— exposes the deterministic K-block matmul under a stable name for MoE expert grouped GEMMspython/sglang/srt/models/qwen3_moe.py:python/sglang/srt/model_executor/model_runner.py,forward_batch_info.py—patch_prefill_only_deterministic_attention_backend(...)temporarily forces FA3 prefillnum_splits=1during deterministic prefill and restores afterwards; decode stays graph-captured and non-deterministicpython/sglang/srt/server_args.py—--enable-prefill-only-deterministic-inferenceno longer auto-upgrades to fullenable_deterministic_inferencepython/sglang/srt/managers/tp_worker.py,layers/communicator.py,layers/dp_attention.py,layers/quantization/unquant.py,distributed/communication_op.py— minimal call-site changes to honour the per-forward runtime-policy scopeValidation
Remote H200 x8, container
miles-maocheng, Qwen3-30B-A3B, TP=1/EP=4/CP=2:Source:
recovery/qwen3_moe_clean/journal/2026-04-30-moe-onpolicy-normal-validation.md.CPU/unit tests:
test/registered/core/test_on_policy_wiring.py— extended with MoE coverage (graph-capture policy scope,qwen3_moe_attention_uses_dense_qk_dtype_contract,qwen3_moe_experts_use_weight_dtype_under_deterministic_routing, prefill-only flag wiring)test/registered/core/test_tp_invariant_ops.py— extended for MoE grouped-GEMM under the deterministic K-block contractOut of scope
MODEL_ARGS_DISABLE_MOE_PERMUTE_FUSION=1(Miles enforces); a permute-fusion-equivalent deterministic path can land separatelyTest plan
🤖 Generated with Claude Code