fix: skip w13 swap in FlashInfer CUTLASS path for non-gated MoE#17
Open
djmmoss wants to merge 1 commit intoTomerBN-Nvidia:ultra-rl-v0.17from
Open
Conversation
convert_to_unquantized_kernel_format() unconditionally calls swap_w13_to_w31(layer.w13_weight) when the FLASHINFER_CUTLASS backend is selected. That helper assumes the second-to-last dim of w13 is [w1; w3] and flips the two halves. For non-gated MoE (is_act_and_mul=False, e.g. NemotronH), w13_weight has shape [num_experts, intermediate_size, hidden_size] — there is only one logical weight, no halves to swap. Reshaping with `// 2` and flipping silently scrambles the weights, producing nonsense output (the model echoes the prompt or hits length limit without producing real content). Symptom on NemotronH ultra_v3: at temperature=0, the model's reasoning becomes pure prompt repetition; finish_reason is always 'length' and content is empty. Fix: gate the swap on layer.moe_config.is_act_and_mul. For non-gated layouts, leave the weight untouched. Verified: with the fix, CUTLASS produces correct, coherent reasoning that matches Triton on math/algebra/sequence/deduction prompts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
68d108d to
37fbdbf
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
convert_to_unquantized_kernel_format()invllm/model_executor/layers/fused_moe/oracle/unquantized.pyunconditionally callsswap_w13_to_w31(layer.w13_weight)when the FLASHINFER_CUTLASS backend is selected. That helper assumes the second-to-last dim ofw13is[w1; w3](the gate/up projections in a SwiGLU-style gated MoE) and flips the two halves so the kernel sees[w3; w1].For non-gated MoE (
is_act_and_mul=False),w13_weighthas shape[num_experts, intermediate_size, hidden_size]— there is only one logical weight, no halves to swap.swap_w13_to_w31reshapes with// 2and flips, silently splitting the single weight matrix down the middle and swapping the halves, which scrambles the weights.The model still runs, but produces nonsense output: at
temperature=0, generation degenerates into prompt repetition,finish_reasonis alwayslength, andcontentis empty.This patch gates the swap on
layer.moe_config.is_act_and_mul. For non-gated layouts, the weight is left untouched.Test plan
VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=throughput(CUTLASS path) — 0/10 prompts produced coherent output before the fix.temperature=0. Numerical drift causes minor wording differences but final answers match.🤖 Generated with Claude Code