[feat] Init true on policy with qwen_moe#30
Draft
maocheng23 wants to merge 4 commits intofeat/true_on_policy_qwen_densefrom
Draft
[feat] Init true on policy with qwen_moe#30maocheng23 wants to merge 4 commits intofeat/true_on_policy_qwen_densefrom
maocheng23 wants to merge 4 commits intofeat/true_on_policy_qwen_densefrom
Conversation
This was referenced May 1, 2026
c63c77f to
9c390a1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the Qwen3-MoE true-on-policy contract on top of the dense stack. Stacked on
feat/true_on_policy_qwen_dense— diff in this PR is only the MoE delta.This is one of three tightly-coupled MoE PRs that must land together — they share a single contract identifier
qwen3_moe_true_on_policy_v1defined by a vendored schema in each repo.Companion PRs (MoE, must land in lockstep):
Stacked on dense (must land first):
Target
Bit-identical (exact-zero) logprob parity between the SGLang rollout engine and the Megatron trainer for every scored response token at TP=1/EP=4/CP=2 for Qwen3-30B-A3B (MoE), with fully differentiable backward.
Validated on H200 x8:
train_rollout_logprob_abs_diff = 0.0for 3 steps; gradient audit shows 435 parameters with non-zero gradients across attention, embedding, layernorm, MoE experts, MoE routers, and output layer.Design
The MoE contract reuses the dense three-layer architecture (Miles -> SGLang -> Megatron), adding a new
qwen3_moe_true_on_policy_v1contract object. The Megatron training forward reproduces SGLang's MoE numerics in a differentiable surface; the no-grad inference path (reference-model logprobs and on-policy logprob recompute) delegates to SGLang's fused-experts kernel for arithmetic identity with rollout.Backward stays differentiable —
SGLangGroupedMLPinherits from Megatron'sGroupedMLPand only swaps to the SGLang fused path whentorch.is_grad_enabled() == False.In this PR (Megatron)
megatron/core/true_on_policy/:schema.py— addsQWEN3_MOE_TRUE_ON_POLICY_V1_SCHEMAandqwen3_moe_sglang_mathkernel contract (byte-identical with SGLang and Miles copies)contracts.py— addsQWEN3_MOE_TRUE_ON_POLICY_V1registry entry;policy_for(config)derivesep_invariant_moefromexpert_model_parallel_size > 1and emits MoE-only fields whenmodel_family == 'qwen3_moe'moe.py,moe_context.py,moe_reduce.py— typed deterministic-MoE helpers (router pre-softmax dtype, deterministic top-k, EP all-reduce contract for local-masked combine)moe_experts.py—SGLangGroupedMLP: GroupedMLP subclass with no-grad SGLang fused-experts forward (forward_sglang_local_masked,_forward_sglang_fused_by_source), grad-enabled path stays on stock Megatronprovider.py,sglang_backend.py,norm.py—SGLangSpecProviderwires MoE layer classes forqwen3_moefamilyif self.config.use_sglang:branches in MoE forward paths; everything reads typed runtime policy:transformer/moe/moe_layer.py— router/dispatch/combine paths gate on runtime policy; reference-pass routes to SGLang local-masked when policy says sotransformer/moe/router.py— fp32 pre-softmax + deterministic top-k tie-break under contracttransformer/moe/token_dispatcher.py— deterministic permute (no fusion) under contract; EP all-reduce path matches SGLangtransformer/moe/moe_utils.py— typed expert-routing helperstransformer/moe/experts.py— minimal hook soSGLangGroupedMLPcan intercept inference forwardmodels/gpt/gpt_layer_specs.py— splits dense vs MoEsharded_state_dict_keys_map(the local-spec layernorm checkpoint mapping fix that unblocked MoE weight refresh)optimizer/cpu_offloading/hybrid_optimizer.py—HybridDeviceOptimizerstale-CPU-copy fix on weight refresh pathValidation
Remote H200 x8, container
miles-maocheng, Qwen3-30B-A3B, Megatron TP=1/EP=4/CP=2/PP=1, SGLang 2 engines tp=4 ep=4:Gradient audit (step 0): 435 parameters total, 435 with grad, 435 nonzero — no missing or zero-grad entries across attention, embedding, layernorm, MoE experts, MoE routers, output layer.
Source:
recovery/qwen3_moe_clean/journal/2026-04-30-moe-onpolicy-normal-validation.md.CPU/unit tests:
tests/unit_tests/extension/test_sglang_extension.py— extended with MoE coverage (router pre-softmax dtype, top-k tie-break, dispatch/combine determinism, GroupedMLP no-grad SGLang path, EP local-masked path, runtime-policyis_moederivation)Out of scope
MODEL_ARGS_DISABLE_MOE_PERMUTE_FUSION=1)Test plan
~0.03to0.05band)🤖 Generated with Claude Code