Skip to content

Commit add79c6

Browse files
yifjiangclaude
andcommitted
[Quantization] Support NVFP4 for inline-swiglu fused MoE experts (MiniMax-M3)
MiniMaxM3VLExperts is a standard transformers 5.x fused-experts container (3-D gate_up_proj/down_proj + num_experts) but applies SwiGLU inline and has no act_fn submodule, so _is_fused_experts_module returned False -> the experts were never wrapped -> nvfp4_experts_only enabled zero expert quantizers and export raised NotImplementedError("...experts type 'MiniMaxM3VLExperts'..."). Drop the act_fn requirement from the detector. _QuantFusedExperts only intercepts F.linear and never reads act_fn, and _export_fused_experts is weight-only, so no export change is needed once detection wraps the experts. Models needing custom forwards (Llama4, GptOss, DBRX, Qwen3-VL-MoE) remain excluded earlier via their explicit registrations. Flip the now-incorrect test_module_missing_act_fn test and add an inline-SwiGLU synthetic experts detection + calibration test. Add a CHANGELOG entry and a MiniMax M3 row to the llm_ptq support matrix. Validated end-to-end on GB200: 14,592 expert weight quantizers enabled, 260 GB NVFP4 checkpoint, wikitext-2 perplexity 5.083 -> 5.420 (+6.6%). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
1 parent cc17f2c commit add79c6

4 files changed

Lines changed: 107 additions & 7 deletions

File tree

CHANGELOG.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ Changelog
77
**New Features**
88

99
- Add the ``day0-release`` agent skill (``.agents/skills/day0-release/``), a deterministic end-to-end driver that chains the PTQ → evaluation → comparison skills (the evaluation stage deploys the checkpoint itself) with an enforced gate after each stage and returns a publish decision (ACCEPT / REGRESSION / ANOMALOUS / INFEASIBLE). Ships three GPU-free, unit-tested gate scripts (``gate_ptq.py``, ``gate_run.py``, ``gate_compare.py``) that validate checkpoint coverage, evaluation-run completeness, and baseline-vs-candidate accuracy threshold. v1 reports and stops on regression; the recipe-search loop is deferred.
10+
- Add NVFP4 quantization support for MiniMax-M3 (``minimax_m3_vl``, a ~428B MoE VLM). Its routed-experts container ``MiniMaxM3VLExperts`` follows the standard transformers 5.x fused-experts pattern (3-D ``gate_up_proj``/``down_proj`` + ``num_experts``) but applies SwiGLU inline rather than via an ``act_fn`` submodule; ``_is_fused_experts_module`` no longer requires ``act_fn`` (``_QuantFusedExperts`` never reads it), so these experts are wrapped as ``_QuantFusedExperts`` and calibrate/export through the existing fused path. Quantize with ``--qformat nvfp4_experts_only``; load via transformers >=5.12 native support (no ``trust_remote_code``).
1011
- Add **streaming** speculative-decoding training (EAGLE3 / DFlash): the draft trains on base-model hidden states produced on the fly by a co-located ``vllm serve`` (no disk dump), moved trainer-side over NIXL RDMA, scaling to multi-node (dedicated serve replicas + DDP trainers). New launcher examples for NVFP4 Kimi-K2.5 / K2.6 on GB200/aarch64 under ``tools/launcher/examples/moonshotai/``.
1112

1213
0.45 (2026-06-xx)

examples/llm_ptq/README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,7 @@ Please reference our [framework scripts](#framework-scripts) and our [docs](http
114114
| GLM-4.7<sup>8</sup> || - | - | - ||
115115
| Kimi K2 | - | - | - | - ||
116116
| MiniMax M2.1 | - | - | - | - ||
117+
| MiniMax M3<sup>11</sup> | - | - | - | - ||
117118
| GPT-OSS<sup>10</sup> | - | - | - | - ||
118119
| T5 ||||| - |
119120
| Whisper<sup>9</sup> ||||| - |
@@ -130,7 +131,8 @@ Please reference our [framework scripts](#framework-scripts) and our [docs](http
130131
> *<sup>7.</sup>[PTQ for DeepSeek](../deepseek/README.md)* \
131132
> *<sup>8.</sup>GLM-4.7 has MTP (Multi-Token Prediction) layers that are automatically loaded and excluded from quantization.* \
132133
> *<sup>9.</sup>Running Whisper model with transformers>=5.0 requires [torchcodec](https://github.com/meta-pytorch/torchcodec?tab=readme-ov-file#installing-cuda-enabled-torchcodec) and other system packages (e.g. ffmpeg).* \
133-
> *<sup>10.</sup>GPT-OSS ships with native MXFP4 weights; NVFP4 export is produced via the closed-form `--cast_mxfp4_to_nvfp4` cast (see [MXFP4 → NVFP4 cast](#mxfp4--nvfp4-cast-for-gpt-oss)).*
134+
> *<sup>10.</sup>GPT-OSS ships with native MXFP4 weights; NVFP4 export is produced via the closed-form `--cast_mxfp4_to_nvfp4` cast (see [MXFP4 → NVFP4 cast](#mxfp4--nvfp4-cast-for-gpt-oss)).* \
135+
> *<sup>11.</sup>MiniMax M3 (`minimax_m3_vl`) requires transformers >=5.12 (native support); load without `trust_remote_code`. Recommended recipe `nvfp4_experts_only` (routed experts to NVFP4; attention, dense layers, shared experts, vision tower, router/gate, embeddings, lm_head kept higher precision).*
134136
135137
> *The accuracy loss after PTQ may vary depending on the actual model and the quantization method. Different models may have different accuracy loss and usually the accuracy loss is more significant when the base model is small. If the accuracy after PTQ is not meeting the requirement, please try either modifying [hf_ptq.py](./hf_ptq.py) and disabling the KV cache quantization or using the [QAT](./../llm_qat/README.md) instead. For NVFP4 quantization specifically, we recommend `nvfp4_mlp_only`, `nvfp4_experts_only`, or `nvfp4_omlp_only` to achieve higher accuracy by restricting quantization to the MLP/expert layers (and optionally the `o_proj` layer) while keeping the attention QKV projections unquantized.*
136138

modelopt/torch/quantization/plugins/huggingface.py

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1442,17 +1442,22 @@ def _is_fused_experts_module(module):
14421442
"""Check if a module is a fused MoE expert container compatible with _QuantFusedExperts.
14431443
14441444
Detects the standardized HuggingFace transformers 5.0+ fused expert pattern:
1445-
``gate_up_proj`` (3-D parameter), ``down_proj`` (3-D parameter), ``num_experts``,
1446-
and ``act_fn``. Matches ``MixtralExperts``, ``Qwen2MoeExperts``,
1445+
``gate_up_proj`` (3-D parameter), ``down_proj`` (3-D parameter), and
1446+
``num_experts``. Matches ``MixtralExperts``, ``Qwen2MoeExperts``,
14471447
``Qwen3MoeExperts``, ``Qwen3_5MoeExperts``, ``DeepseekV3NaiveMoe``,
1448-
``JambaExperts``, ``OlmoeExperts``, etc.
1448+
``JambaExperts``, ``OlmoeExperts``, ``MiniMaxM3VLExperts``, etc.
1449+
1450+
``act_fn`` is intentionally NOT required: some fused-expert containers (e.g.
1451+
``MiniMaxM3VLExperts``) apply their gating activation inline rather than via an
1452+
``act_fn`` submodule. ``_QuantFusedExperts`` never reads ``act_fn`` (it only
1453+
intercepts ``F.linear``), so the activation form is irrelevant to detection.
14491454
14501455
Returns ``False`` for non-standard layouts (DBRX, GptOss, GraniteMoE,
14511456
Llama4TextExperts) which have their own explicit registrations.
14521457
"""
14531458
if not hasattr(module, "gate_up_proj") or not hasattr(module, "down_proj"):
14541459
return False
1455-
if not hasattr(module, "num_experts") or not hasattr(module, "act_fn"):
1460+
if not hasattr(module, "num_experts"):
14561461
return False
14571462
gate_up = getattr(module, "gate_up_proj")
14581463
down = getattr(module, "down_proj")

tests/unit/torch/quantization/plugins/test_fused_experts.py

Lines changed: 94 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,37 @@ def forward(self, hidden_states, top_k_index, top_k_weights):
8484
return final_hidden_states
8585

8686

87+
class _SyntheticFusedExpertsInlineSwiglu(_SyntheticFusedExperts):
88+
"""Fused experts that apply SwiGLU inline (no ``act_fn`` submodule), mimicking
89+
transformers' ``MiniMaxM3VLExperts``. Verifies detection/quantization do not require ``act_fn``."""
90+
91+
def __init__(self):
92+
super().__init__()
93+
del self.act_fn # gating activation is applied inline in forward, not via a submodule
94+
95+
def forward(self, hidden_states, top_k_index, top_k_weights):
96+
final_hidden_states = torch.zeros_like(hidden_states)
97+
with torch.no_grad():
98+
expert_mask = F.one_hot(top_k_index, num_classes=self.num_experts).permute(2, 1, 0)
99+
expert_hit = torch.greater(expert_mask.sum(dim=(-1, -2)), 0).nonzero()
100+
for expert_idx in expert_hit:
101+
expert_idx = expert_idx[0]
102+
if expert_idx == self.num_experts:
103+
continue
104+
top_k_pos, token_idx = torch.where(expert_mask[expert_idx])
105+
current_state = hidden_states[token_idx]
106+
gate, up = F.linear(current_state, self.gate_up_proj[expert_idx]).chunk(2, dim=-1)
107+
current_hidden_states = F.silu(gate) * up # inline swiglu (no self.act_fn)
108+
current_hidden_states = F.linear(current_hidden_states, self.down_proj[expert_idx])
109+
current_hidden_states = (
110+
current_hidden_states * top_k_weights[token_idx, top_k_pos, None]
111+
)
112+
final_hidden_states.index_add_(
113+
0, token_idx, current_hidden_states.to(final_hidden_states.dtype)
114+
)
115+
return final_hidden_states
116+
117+
87118
class _SyntheticTopKRouter(nn.Module):
88119
def __init__(self):
89120
super().__init__()
@@ -145,12 +176,23 @@ def test_module_with_2d_gate_up_not_detected(self):
145176
module.act_fn = nn.SiLU()
146177
assert _is_fused_experts_module(module) is False
147178

148-
def test_module_missing_act_fn_not_detected(self):
179+
def test_module_missing_act_fn_still_detected(self):
180+
"""``act_fn`` is optional: e.g. ``MiniMaxM3VLExperts`` applies swiglu inline.
181+
182+
``_QuantFusedExperts`` only intercepts ``F.linear`` and never reads ``act_fn``,
183+
so the structural detector must not require it.
184+
"""
149185
module = nn.Module()
150186
module.gate_up_proj = nn.Parameter(torch.randn(4, 16, 8))
151187
module.down_proj = nn.Parameter(torch.randn(4, 8, 16))
152188
module.num_experts = 4
153-
assert _is_fused_experts_module(module) is False
189+
assert _is_fused_experts_module(module) is True
190+
191+
def test_inline_swiglu_fused_experts_detected(self):
192+
"""Fused experts applying swiglu inline (no ``act_fn`` submodule) are detected."""
193+
module = _SyntheticFusedExpertsInlineSwiglu()
194+
assert not hasattr(module, "act_fn")
195+
assert _is_fused_experts_module(module) is True
154196

155197
def test_sparse_moe_block_not_detected_as_fused(self):
156198
block = _SyntheticSparseMoeBlock()
@@ -652,6 +694,56 @@ def forward_loop(m):
652694

653695
self._cleanup_registry(expert_type)
654696

697+
def test_inline_swiglu_experts_calibrate(self):
698+
"""No-``act_fn`` (inline swiglu) fused experts convert and calibrate like ``act_fn`` ones.
699+
700+
Regression for ``MiniMaxM3VLExperts``: detection used to require ``act_fn``, so these
701+
experts were never wrapped and no quantizers were inserted.
702+
"""
703+
model = _TinyMoEModel()
704+
model.moe.experts = _SyntheticFusedExpertsInlineSwiglu()
705+
expert_type = type(model.moe.experts)
706+
self._cleanup_registry(expert_type)
707+
708+
quant_cfg = {
709+
"quant_cfg": [
710+
{"quantizer_name": "*", "enable": False},
711+
{
712+
"quantizer_name": "*gate_up_proj_input_quantizer",
713+
"cfg": {"num_bits": 8, "axis": None},
714+
},
715+
{
716+
"quantizer_name": "*down_proj_input_quantizer",
717+
"cfg": {"num_bits": 8, "axis": None},
718+
},
719+
{
720+
"quantizer_name": "*gate_up_proj_weight_quantizer",
721+
"cfg": {"num_bits": 8, "axis": 0},
722+
},
723+
{
724+
"quantizer_name": "*down_proj_weight_quantizer",
725+
"cfg": {"num_bits": 8, "axis": 0},
726+
},
727+
],
728+
"algorithm": "max",
729+
}
730+
731+
def forward_loop(m):
732+
torch.manual_seed(0)
733+
for _ in range(2):
734+
m(torch.randn(1, 4, HIDDEN_DIM))
735+
736+
mtq.quantize(model, quant_cfg, forward_loop=forward_loop)
737+
738+
experts = model.moe.experts
739+
assert experts.gate_up_proj_input_quantizer.amax is not None
740+
assert experts.down_proj_input_quantizer.amax is not None
741+
for idx in range(NUM_EXPERTS):
742+
assert experts.gate_up_proj_weight_quantizers[idx].amax is not None
743+
assert experts.down_proj_weight_quantizers[idx].amax is not None
744+
745+
self._cleanup_registry(expert_type)
746+
655747
def test_local_hessian_refines_per_expert_weights(self):
656748
"""local_hessian captures each expert's routed activations and refines its weight amax."""
657749
model = _TinyMoEModel()

0 commit comments

Comments
 (0)