[feat] Init true on policy with qwen_moe by maocheng23 · Pull Request #30 · radixark/Megatron-LM

maocheng23 · 2026-05-01T07:48:43Z

Summary

Adds the Qwen3-MoE true-on-policy contract on top of the dense stack. Stacked on feat/true_on_policy_qwen_dense — diff in this PR is only the MoE delta.

This is one of three tightly-coupled MoE PRs that must land together — they share a single contract identifier qwen3_moe_true_on_policy_v1 defined by a vendored schema in each repo.

Companion PRs (MoE, must land in lockstep):

SGLang (fork-side review): [feat] Init true on policy with qwen_moe maocheng23/sglang#3
Miles: [feat] Init true on policy with qwen_moe miles#1059

Stacked on dense (must land first):

Megatron-LM dense: [feat] Init true on policy with qwen_dense #29
SGLang dense: [feat] Init true on policy with qwen_dense sgl-project/sglang#23961
Miles dense: [feat] Init true on policy with qwen_dense miles#1052

Target

Bit-identical (exact-zero) logprob parity between the SGLang rollout engine and the Megatron trainer for every scored response token at TP=1/EP=4/CP=2 for Qwen3-30B-A3B (MoE), with fully differentiable backward.

Validated on H200 x8: train_rollout_logprob_abs_diff = 0.0 for 3 steps; gradient audit shows 435 parameters with non-zero gradients across attention, embedding, layernorm, MoE experts, MoE routers, and output layer.

Design

The MoE contract reuses the dense three-layer architecture (Miles -> SGLang -> Megatron), adding a new qwen3_moe_true_on_policy_v1 contract object. The Megatron training forward reproduces SGLang's MoE numerics in a differentiable surface; the no-grad inference path (reference-model logprobs and on-policy logprob recompute) delegates to SGLang's fused-experts kernel for arithmetic identity with rollout.

Backward stays differentiable — SGLangGroupedMLP inherits from Megatron's GroupedMLP and only swaps to the SGLang fused path when torch.is_grad_enabled() == False.

In this PR (Megatron)

megatron/core/true_on_policy/:
- schema.py — adds QWEN3_MOE_TRUE_ON_POLICY_V1_SCHEMA and qwen3_moe_sglang_math kernel contract (byte-identical with SGLang and Miles copies)
- contracts.py — adds QWEN3_MOE_TRUE_ON_POLICY_V1 registry entry; policy_for(config) derives ep_invariant_moe from expert_model_parallel_size > 1 and emits MoE-only fields when model_family == 'qwen3_moe'
- moe.py, moe_context.py, moe_reduce.py — typed deterministic-MoE helpers (router pre-softmax dtype, deterministic top-k, EP all-reduce contract for local-masked combine)
- moe_experts.py — SGLangGroupedMLP: GroupedMLP subclass with no-grad SGLang fused-experts forward (forward_sglang_local_masked, _forward_sglang_fused_by_source), grad-enabled path stays on stock Megatron
- provider.py, sglang_backend.py, norm.py — SGLangSpecProvider wires MoE layer classes for qwen3_moe family
Forward-path leak collapse — no if self.config.use_sglang: branches in MoE forward paths; everything reads typed runtime policy:
- transformer/moe/moe_layer.py — router/dispatch/combine paths gate on runtime policy; reference-pass routes to SGLang local-masked when policy says so
- transformer/moe/router.py — fp32 pre-softmax + deterministic top-k tie-break under contract
- transformer/moe/token_dispatcher.py — deterministic permute (no fusion) under contract; EP all-reduce path matches SGLang
- transformer/moe/moe_utils.py — typed expert-routing helpers
- transformer/moe/experts.py — minimal hook so SGLangGroupedMLP can intercept inference forward
models/gpt/gpt_layer_specs.py — splits dense vs MoE sharded_state_dict_keys_map (the local-spec layernorm checkpoint mapping fix that unblocked MoE weight refresh)
optimizer/cpu_offloading/hybrid_optimizer.py — HybridDeviceOptimizer stale-CPU-copy fix on weight refresh path

Validation

Remote H200 x8, container miles-maocheng, Qwen3-30B-A3B, Megatron TP=1/EP=4/CP=2/PP=1, SGLang 2 engines tp=4 ep=4:

Mode	Step	rollout_logp	train_logp	grad_norm
Full deterministic	0	-0.2502	-0.2502	0.0342
Full deterministic	1	-0.2335	-0.2335	0.0465
Fast decode (no fusion)	0	-0.2452	-0.2452	0.0391
Fast decode (no fusion)	1	-0.2350	-0.2350	0.0318
Fast decode (no fusion)	2	-0.2467	-0.2467	0.0459

Gradient audit (step 0): 435 parameters total, 435 with grad, 435 nonzero — no missing or zero-grad entries across attention, embedding, layernorm, MoE experts, MoE routers, output layer.

Source: recovery/qwen3_moe_clean/journal/2026-04-30-moe-onpolicy-normal-validation.md.

CPU/unit tests:

tests/unit_tests/extension/test_sglang_extension.py — extended with MoE coverage (router pre-softmax dtype, top-k tie-break, dispatch/combine determinism, GroupedMLP no-grad SGLang path, EP local-masked path, runtime-policy is_moe derivation)

Out of scope

Qwen3-Next MoE contract — additive after this stack lands
Permute-fusion compatibility — contract currently requires permute fusion disabled (Miles enforces via MODEL_ARGS_DISABLE_MOE_PERMUTE_FUSION=1)
DeepEP / sequence-parallel TP+EP combos beyond the validated TP=1/EP=4/CP=2 layout

Test plan

CPU unit tests pass in CI
GPU exact-zero E2E gate at TP=1/EP=4/CP=2 (full deterministic) — done locally, needs CI replay
GPU exact-zero E2E gate at TP=1/EP=4/CP=2 (fast decode, no fusion) — done locally, needs CI replay
grad_norm parity check vs off-policy baseline (~0.03 to 0.05 band)
100-step on/off-policy comparison run

🤖 Generated with Claude Code

This was referenced May 1, 2026

[feat] Init true on policy with qwen_moe radixark/miles#1059

Draft

[feat] Init true on policy with qwen_moe maocheng23/sglang#3

Draft

maocheng23 added 3 commits May 4, 2026 18:47

Add Qwen3 MoE true-on-policy Megatron parity

81c6e34

Fix MoE true-on-policy weight refresh path

ebbe843

Split true-on-policy MoE helpers

9c390a1

maocheng23 force-pushed the feat/true_on_policy_qwen_moe branch from c63c77f to 9c390a1 Compare May 5, 2026 01:51

Extract true-on-policy MoE layer extensions and combine ordering

c2a98e1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Init true on policy with qwen_moe#30

[feat] Init true on policy with qwen_moe#30
maocheng23 wants to merge 4 commits intofeat/true_on_policy_qwen_densefrom
feat/true_on_policy_qwen_moe

maocheng23 commented May 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maocheng23 commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Target

Design

In this PR (Megatron)

Validation

Out of scope

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

maocheng23 commented May 1, 2026 •

edited

Loading