[NVbug 6142360] Share fused gate_up amax fallback to keep weight_scale_2 consistent (#1411)

meenchen · web-flow · commit 570920be5c21 · 2026-05-07T14:55:02.000-07:00
## Summary - Pre-fill the source `gate_up_proj_weight_quantizers[idx]._amax` from the **fused** `gate_up[idx]` tensor when it is uncalibrated, **before** the per-projection deepcopy in `_export_fused_experts`. Both gate's clone and up's clone then inherit the same scalar amax, so `weight_scale_2 = amax / (6 · 448)` matches across the W1/W3 fusion that vLLM expects at load time. - Add a unit regression test that builds a fused-experts module with intentionally mismatched gate vs. up weight magnitudes, leaves every expert uncalibrated, and asserts the gate and up wrappers carry the same amax into the FP4 quantization step. Fails on `main`, passes with this fix. ## Why NVbug 6142360: Qwen3.5-MoE / Qwen3-Next NVFP4 checkpoints produced by ModelOpt yielded garbled output under vLLM (`1\n1\n1\n…` style degenerate token loops). The vLLM log showed `w1_weight_scale_2 must match w3_weight_scale_2. Accuracy may be affected.`. Root cause: in `_export_fused_experts`, when an expert receives no calibration tokens (common for low-frequency experts even with `--calib_size 512` on 128-expert MoEs), the per-projection fallback computed `amax` independently from each split slice: ```python w_quantizer.amax = weight_slice.abs().amax().to(torch.float32) ``` `weight_slice` is the gate-only or up-only 2-D half of the fused `gate_up_proj`. Since gate and up have different magnitudes, gate and up end up with different scalar amax values — and therefore different `weight_scale_2`. vLLM fuses W1 and W3 into a single weight at load time and asserts a single shared scale; the half whose scale was discarded is now off by the gate/up magnitude ratio (~10× was typical), which catastrophically corrupts the MoE output. The fix derives the fallback amax once from the whole `gate_up[idx]` tensor before the deepcopies, so gate and up share the same amax — exactly what calibration would have produced if any token had hit the expert. Calibrated experts are unaffected (the new code path is gated on `_amax` being missing or zero). `down_proj` keeps its existing per-projection fallback because it has its own quantizer with no fusion partner. ## End-to-end vLLM verification Used `vllm/vllm-openai:nightly` Docker on RTX 6000 Ada, Marlin NVFP4 backend, with a real Qwen3-30B-A3B-Instruct NVFP4 checkpoint. The bug requires the export-time scale skew, so I A/B-tested by mutating `gate_proj.weight_scale_2` directly: | Variant | gate vs up `weight_scale_2` | Output for "Write an article about AI." | |---|---|---| | Baseline (original checkpoint) | matched (calibrated) | coherent: "The Rise of Artificial Intelligence: Transforming the World One Algorithm at a Time…" | | Bug-simulated (gate = up / 10 for all 6143 expert pairs) | mismatched ~10× | `__':\n__':\n__':\n…` (degenerate loop, same shape as user's `1\n1\n1\n…`) | | Fix-simulated (gate restored to up) | matched | coherent: same article as baseline | The 10× skew matches what the bug actually emits for uncalibrated experts (e.g. `gate=2.77e-5, up=2.30e-4` from a synthetic repro), and reproduces the same garbled-loop pattern the user reported. ## Test plan - [x] New unit test `tests/unit/torch/quantization/plugins/test_fused_experts.py::TestExportFusedExperts::test_uncalibrated_expert_gate_up_share_amax` passes. - [x] Full `tests/unit/torch/quantization/plugins/test_fused_experts.py`: 28/28 pass. - [x] Broader `tests/unit/torch/export/` + `tests/unit/torch/quantization/`: 609 passed, 9 skipped. - [x] vLLM end-to-end A/B (table above): bug reproduces with mismatched scales, fix produces coherent output identical to baseline. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A  - Did you get Claude approval on this PR?: ✅ / ❌ / N/A  ### Additional Information   ## Summary by CodeRabbit * **Bug Fixes** * Enhanced handling of uncalibrated mixture-of-experts weight quantizers during export with improved consistency checks and informative warnings. * **Tests** * Added regression test for uncalibrated expert quantization behavior.  Signed-off-by: weimingc <weimingc@nvidia.com> Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
diff --git a/modelopt/torch/export/moe_utils.py b/modelopt/torch/export/moe_utils.py
@@ -62,6 +62,29 @@ def _export_fused_experts(module: nn.Module, dtype: torch.dtype) -> None:
     for idx in range(n):
         expert = nn.Module()
 
+        # If the gate_up source quantizer was never calibrated (rare expert
+        # that received no calibration tokens), derive its amax once from the
+        # FUSED tensor so gate and up share the same weight_scale_2 below.
+        # Why: vLLM fuses W1 (gate) and W3 (up) at load time and asserts a
+        # single per-tensor scale across the fusion. The per-projection
+        # fallback further down would otherwise compute amax independently from
+        # each half — gate's max and up's max generally differ — producing
+        # mismatched weight_scale_2 and garbled MoE output at inference.
+        gate_up_q = module.gate_up_proj_weight_quantizers[idx]
+        if getattr(gate_up_q, "is_enabled", False) and (
+            not hasattr(gate_up_q, "_amax")
+            or gate_up_q._amax is None
+            or torch.all(gate_up_q._amax == 0)
+        ):
+            gate_up_q.amax = gate_up[idx].abs().amax().to(torch.float32)
+            warnings.warn(
+                f"Expert {idx} gate_up_proj weight quantizer was not calibrated "
+                f"(amax missing or zero). Using fused-tensor amax as fallback "
+                f"(shared by gate and up so weight_scale_2 stays consistent). "
+                f"Consider increasing calibration size to activate all experts.",
+                stacklevel=2,
+            )
+
         projections = [
             ("gate_proj", gate_up[idx, :expert_dim, :], 0, fused_dim0, True),
             ("up_proj", gate_up[idx, expert_dim:, :], expert_dim, fused_dim0, True),
diff --git a/tests/unit/torch/quantization/plugins/test_fused_experts.py b/tests/unit/torch/quantization/plugins/test_fused_experts.py
@@ -300,6 +300,94 @@ def test_export_creates_per_expert_submodules(self):
         if QuantModuleRegistry.get(expert_type) is not None:
             QuantModuleRegistry.unregister(expert_type)
 
+    def test_uncalibrated_expert_gate_up_share_amax(self, monkeypatch):
+        """gate_proj and up_proj must share weight_scale_2 even when an expert
+        was never routed during calibration.
+
+        Regression for the bug where ``_export_fused_experts``'s per-projection
+        fallback computed amax independently from the gate and up halves of the
+        fused tensor — producing mismatched ``weight_scale_2`` values for any
+        uncalibrated expert. vLLM fuses W1 (gate) and W3 (up) at load time and
+        asserts a single shared scale; mismatched scales corrupted MoE output.
+        The fix derives the fallback amax once from the fused ``gate_up[idx]``
+        tensor before the deepcopies, so gate's clone and up's clone start with
+        the same amax.
+        """
+        from modelopt.torch.export.moe_utils import _export_fused_experts
+
+        # Build experts where gate and up have very different magnitudes —
+        # any per-half fallback would clearly produce different amaxes.
+        experts = _SyntheticFusedExperts()
+        gate = torch.randn(NUM_EXPERTS, INTERMEDIATE_DIM, HIDDEN_DIM) * 0.02
+        up = torch.randn(NUM_EXPERTS, INTERMEDIATE_DIM, HIDDEN_DIM) * 0.20
+        with torch.no_grad():
+            experts.gate_up_proj.copy_(torch.cat([gate, up], dim=1))
+
+        expert_type = type(experts)
+        if QuantModuleRegistry.get(expert_type) is None:
+            QuantModuleRegistry.register({expert_type: "test.SyntheticFusedExperts"})(
+                _QuantFusedExperts
+            )
+        try:
+            converted = QuantModuleRegistry.convert(experts)
+
+            # Leave every expert weight quantizer uncalibrated (no _amax).
+            # Mark them enabled to exercise the export-time fallback path.
+            for q in converted.gate_up_proj_weight_quantizers:
+                q._disabled = False
+            for q in converted.down_proj_weight_quantizers:
+                q._disabled = False
+
+            # Capture the amax each per-projection wrapper carries into the
+            # FP4 quantization step. Patching here avoids needing CUDA / FP4.
+            seen = {}  # (expert_idx, proj_name) -> amax tensor
+
+            def _spy_export(wrapper, dtype):
+                # Identify which expert/projection this wrapper belongs to by
+                # matching the weight tensor against the fused parameters.
+                w = wrapper.weight.data
+                # gate_up_proj is (N, 2*INTER, HIDDEN); split halves are
+                # contiguous .data views or .contiguous() copies — we can match
+                # by shape and value identity for this synthetic case.
+                amax = wrapper.weight_quantizer._amax.detach().clone()
+                # Identify by matching against gate vs. up slices of each expert.
+                for idx in range(NUM_EXPERTS):
+                    g_slice = converted.gate_up_proj.data[idx, :INTERMEDIATE_DIM, :]
+                    u_slice = converted.gate_up_proj.data[idx, INTERMEDIATE_DIM:, :]
+                    d_slice = converted.down_proj.data[idx]
+                    if w.shape == g_slice.shape and torch.equal(w, g_slice):
+                        seen[(idx, "gate_proj")] = amax
+                        return
+                    if w.shape == u_slice.shape and torch.equal(w, u_slice):
+                        seen[(idx, "up_proj")] = amax
+                        return
+                    if w.shape == d_slice.shape and torch.equal(w, d_slice):
+                        seen[(idx, "down_proj")] = amax
+                        return
+
+            monkeypatch.setattr(
+                "modelopt.torch.export.unified_export_hf._export_quantized_weight",
+                _spy_export,
+            )
+
+            _export_fused_experts(converted, torch.float16)
+
+            # Assert: for every expert, gate's amax matches up's amax.
+            for idx in range(NUM_EXPERTS):
+                g_amax = seen.get((idx, "gate_proj"))
+                u_amax = seen.get((idx, "up_proj"))
+                assert g_amax is not None and u_amax is not None, (
+                    f"Expert {idx}: missing recorded amax (gate={g_amax}, up={u_amax})"
+                )
+                assert torch.allclose(g_amax, u_amax), (
+                    f"Expert {idx}: gate amax {g_amax.item()} != up amax {u_amax.item()}. "
+                    f"Uncalibrated fused experts must share gate/up amax so that "
+                    f"weight_scale_2 stays consistent across the fusion."
+                )
+        finally:
+            if QuantModuleRegistry.get(expert_type) is not None:
+                QuantModuleRegistry.unregister(expert_type)
+
 
 # ---------------------------------------------------------------------------
 # Tests for force_eager_experts_impl_on_the_fly