Skip to content

[feat] Init true on policy with qwen_moe#30

Draft
maocheng23 wants to merge 4 commits intofeat/true_on_policy_qwen_densefrom
feat/true_on_policy_qwen_moe
Draft

[feat] Init true on policy with qwen_moe#30
maocheng23 wants to merge 4 commits intofeat/true_on_policy_qwen_densefrom
feat/true_on_policy_qwen_moe

Conversation

@maocheng23
Copy link
Copy Markdown

@maocheng23 maocheng23 commented May 1, 2026

Summary

Adds the Qwen3-MoE true-on-policy contract on top of the dense stack. Stacked on feat/true_on_policy_qwen_dense — diff in this PR is only the MoE delta.

This is one of three tightly-coupled MoE PRs that must land together — they share a single contract identifier qwen3_moe_true_on_policy_v1 defined by a vendored schema in each repo.

Companion PRs (MoE, must land in lockstep):

Stacked on dense (must land first):

Target

Bit-identical (exact-zero) logprob parity between the SGLang rollout engine and the Megatron trainer for every scored response token at TP=1/EP=4/CP=2 for Qwen3-30B-A3B (MoE), with fully differentiable backward.

Validated on H200 x8: train_rollout_logprob_abs_diff = 0.0 for 3 steps; gradient audit shows 435 parameters with non-zero gradients across attention, embedding, layernorm, MoE experts, MoE routers, and output layer.

Design

The MoE contract reuses the dense three-layer architecture (Miles -> SGLang -> Megatron), adding a new qwen3_moe_true_on_policy_v1 contract object. The Megatron training forward reproduces SGLang's MoE numerics in a differentiable surface; the no-grad inference path (reference-model logprobs and on-policy logprob recompute) delegates to SGLang's fused-experts kernel for arithmetic identity with rollout.

Backward stays differentiableSGLangGroupedMLP inherits from Megatron's GroupedMLP and only swaps to the SGLang fused path when torch.is_grad_enabled() == False.

In this PR (Megatron)

  • megatron/core/true_on_policy/:
    • schema.py — adds QWEN3_MOE_TRUE_ON_POLICY_V1_SCHEMA and qwen3_moe_sglang_math kernel contract (byte-identical with SGLang and Miles copies)
    • contracts.py — adds QWEN3_MOE_TRUE_ON_POLICY_V1 registry entry; policy_for(config) derives ep_invariant_moe from expert_model_parallel_size > 1 and emits MoE-only fields when model_family == 'qwen3_moe'
    • moe.py, moe_context.py, moe_reduce.py — typed deterministic-MoE helpers (router pre-softmax dtype, deterministic top-k, EP all-reduce contract for local-masked combine)
    • moe_experts.pySGLangGroupedMLP: GroupedMLP subclass with no-grad SGLang fused-experts forward (forward_sglang_local_masked, _forward_sglang_fused_by_source), grad-enabled path stays on stock Megatron
    • provider.py, sglang_backend.py, norm.pySGLangSpecProvider wires MoE layer classes for qwen3_moe family
  • Forward-path leak collapse — no if self.config.use_sglang: branches in MoE forward paths; everything reads typed runtime policy:
    • transformer/moe/moe_layer.py — router/dispatch/combine paths gate on runtime policy; reference-pass routes to SGLang local-masked when policy says so
    • transformer/moe/router.py — fp32 pre-softmax + deterministic top-k tie-break under contract
    • transformer/moe/token_dispatcher.py — deterministic permute (no fusion) under contract; EP all-reduce path matches SGLang
    • transformer/moe/moe_utils.py — typed expert-routing helpers
    • transformer/moe/experts.py — minimal hook so SGLangGroupedMLP can intercept inference forward
  • models/gpt/gpt_layer_specs.py — splits dense vs MoE sharded_state_dict_keys_map (the local-spec layernorm checkpoint mapping fix that unblocked MoE weight refresh)
  • optimizer/cpu_offloading/hybrid_optimizer.pyHybridDeviceOptimizer stale-CPU-copy fix on weight refresh path

Validation

Remote H200 x8, container miles-maocheng, Qwen3-30B-A3B, Megatron TP=1/EP=4/CP=2/PP=1, SGLang 2 engines tp=4 ep=4:

Mode Step rollout_logp train_logp abs_diff grad_norm
Full deterministic 0 -0.2502 -0.2502 0.0 0.0342
Full deterministic 1 -0.2335 -0.2335 0.0 0.0465
Fast decode (no fusion) 0 -0.2452 -0.2452 0.0 0.0391
Fast decode (no fusion) 1 -0.2350 -0.2350 0.0 0.0318
Fast decode (no fusion) 2 -0.2467 -0.2467 0.0 0.0459

Gradient audit (step 0): 435 parameters total, 435 with grad, 435 nonzero — no missing or zero-grad entries across attention, embedding, layernorm, MoE experts, MoE routers, output layer.

Source: recovery/qwen3_moe_clean/journal/2026-04-30-moe-onpolicy-normal-validation.md.

CPU/unit tests:

  • tests/unit_tests/extension/test_sglang_extension.py — extended with MoE coverage (router pre-softmax dtype, top-k tie-break, dispatch/combine determinism, GroupedMLP no-grad SGLang path, EP local-masked path, runtime-policy is_moe derivation)

Out of scope

  • Qwen3-Next MoE contract — additive after this stack lands
  • Permute-fusion compatibility — contract currently requires permute fusion disabled (Miles enforces via MODEL_ARGS_DISABLE_MOE_PERMUTE_FUSION=1)
  • DeepEP / sequence-parallel TP+EP combos beyond the validated TP=1/EP=4/CP=2 layout

Test plan

  • CPU unit tests pass in CI
  • GPU exact-zero E2E gate at TP=1/EP=4/CP=2 (full deterministic) — done locally, needs CI replay
  • GPU exact-zero E2E gate at TP=1/EP=4/CP=2 (fast decode, no fusion) — done locally, needs CI replay
  • grad_norm parity check vs off-policy baseline (~0.03 to 0.05 band)
  • 100-step on/off-policy comparison run

🤖 Generated with Claude Code

@maocheng23 maocheng23 force-pushed the feat/true_on_policy_qwen_moe branch from c63c77f to 9c390a1 Compare May 5, 2026 01:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant