Commit 570920b
authored
[NVbug 6142360] Share fused gate_up amax fallback to keep weight_scale_2 consistent (#1411)
## Summary
- Pre-fill the source `gate_up_proj_weight_quantizers[idx]._amax` from
the **fused** `gate_up[idx]` tensor when it is uncalibrated, **before**
the per-projection deepcopy in `_export_fused_experts`. Both gate's
clone and up's clone then inherit the same scalar amax, so
`weight_scale_2 = amax / (6 · 448)` matches across the W1/W3 fusion that
vLLM expects at load time.
- Add a unit regression test that builds a fused-experts module with
intentionally mismatched gate vs. up weight magnitudes, leaves every
expert uncalibrated, and asserts the gate and up wrappers carry the same
amax into the FP4 quantization step. Fails on `main`, passes with this
fix.
## Why
NVbug 6142360: Qwen3.5-MoE / Qwen3-Next NVFP4 checkpoints produced by
ModelOpt yielded garbled output under vLLM (`1\n1\n1\n…` style
degenerate token loops). The vLLM log showed `w1_weight_scale_2 must
match w3_weight_scale_2. Accuracy may be affected.`.
Root cause: in `_export_fused_experts`, when an expert receives no
calibration tokens (common for low-frequency experts even with
`--calib_size 512` on 128-expert MoEs), the per-projection fallback
computed `amax` independently from each split slice:
```python
w_quantizer.amax = weight_slice.abs().amax().to(torch.float32)
```
`weight_slice` is the gate-only or up-only 2-D half of the fused
`gate_up_proj`. Since gate and up have different magnitudes, gate and up
end up with different scalar amax values — and therefore different
`weight_scale_2`. vLLM fuses W1 and W3 into a single weight at load time
and asserts a single shared scale; the half whose scale was discarded is
now off by the gate/up magnitude ratio (~10× was typical), which
catastrophically corrupts the MoE output.
The fix derives the fallback amax once from the whole `gate_up[idx]`
tensor before the deepcopies, so gate and up share the same amax —
exactly what calibration would have produced if any token had hit the
expert. Calibrated experts are unaffected (the new code path is gated on
`_amax` being missing or zero). `down_proj` keeps its existing
per-projection fallback because it has its own quantizer with no fusion
partner.
## End-to-end vLLM verification
Used `vllm/vllm-openai:nightly` Docker on RTX 6000 Ada, Marlin NVFP4
backend, with a real Qwen3-30B-A3B-Instruct NVFP4 checkpoint. The bug
requires the export-time scale skew, so I A/B-tested by mutating
`gate_proj.weight_scale_2` directly:
| Variant | gate vs up `weight_scale_2` | Output for "Write an article
about AI." |
|---|---|---|
| Baseline (original checkpoint) | matched (calibrated) | coherent: "The
Rise of Artificial Intelligence: Transforming the World One Algorithm at
a Time…" |
| Bug-simulated (gate = up / 10 for all 6143 expert pairs) | mismatched
~10× | `__':\n__':\n__':\n…` (degenerate loop, same shape as user's
`1\n1\n1\n…`) |
| Fix-simulated (gate restored to up) | matched | coherent: same article
as baseline |
The 10× skew matches what the bug actually emits for uncalibrated
experts (e.g. `gate=2.77e-5, up=2.30e-4` from a synthetic repro), and
reproduces the same garbled-loop pattern the user reported.
## Test plan
- [x] New unit test
`tests/unit/torch/quantization/plugins/test_fused_experts.py::TestExportFusedExperts::test_uncalibrated_expert_gate_up_share_amax`
passes.
- [x] Full
`tests/unit/torch/quantization/plugins/test_fused_experts.py`: 28/28
pass.
- [x] Broader `tests/unit/torch/export/` +
`tests/unit/torch/quantization/`: 609 passed, 9 skipped.
- [x] vLLM end-to-end A/B (table above): bug reproduces with mismatched
scales, fix produces coherent output identical to baseline.
### Before your PR is "*Ready for review*"
Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).
Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).
- Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain
why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A
<!--- Mandatory -->
- Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory
for new features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes
or backward incompatible changes. -->
- Did you get Claude approval on this PR?: ✅ / ❌ / N/A <!--- Run
`/claude review`. NVIDIA org members can self-trigger for complex
changes; orthogonal to CodeRabbit. -->
### Additional Information
<!-- E.g. related issue. -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Bug Fixes**
* Enhanced handling of uncalibrated mixture-of-experts weight quantizers
during export with improved consistency checks and informative warnings.
* **Tests**
* Added regression test for uncalibrated expert quantization behavior.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: weimingc <weimingc@nvidia.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>1 parent 1f9c0bf commit 570920b
2 files changed
Lines changed: 111 additions & 0 deletions
File tree
- modelopt/torch/export
- tests/unit/torch/quantization/plugins
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
62 | 62 | | |
63 | 63 | | |
64 | 64 | | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
65 | 88 | | |
66 | 89 | | |
67 | 90 | | |
| |||
Lines changed: 88 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
300 | 300 | | |
301 | 301 | | |
302 | 302 | | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
303 | 391 | | |
304 | 392 | | |
305 | 393 | | |
| |||
0 commit comments