Commit 6a3b6b8
authored
[Recipes][LLM PTQ] Add nvfp4 MSE+FP8-cast-KV recipes (experts_only / mlp_only) + --recipe in example scripts (#1407)
## Summary
- Adds two PTQ recipes that combine **experts/MLP-only NVFP4 W4A4** with
**MSE FP8 scale-sweep weight calibration** and **FP8 KV cache with
`use_constant_amax: true`** (skips KV calibration; matches the
`nvfp4_default-fp8_cast_kv` contract):
- `modelopt_recipes/general/ptq/nvfp4_experts_only_mse-fp8_cast_kv.yaml`
— applies to `*mlp.experts*` / `*block_sparse_moe*` only.
- `modelopt_recipes/general/ptq/nvfp4_mlp_only_mse-fp8_cast_kv.yaml` —
applies to all `*mlp*` / `*block_sparse_moe*` (dense MLP + MoE).
- Threads a new `--recipe` flag through
`examples/llm_ptq/scripts/parser.sh` and `huggingface_example.sh`.
Either `--quant` or `--recipe` is required; passing **both errors out**.
Recipe names are not validated in the script — `hf_ptq.py` is the source
of truth.
- Drops the bash-side `qformat` whitelist case-statement in
`huggingface_example.sh` for the same reason.
## Files
**New recipes (`modelopt_recipes/general/ptq/`):**
- `nvfp4_experts_only_mse-fp8_cast_kv.yaml` — same patterns as
`nvfp4_experts_only-fp8_kv.yaml`.
- `nvfp4_mlp_only_mse-fp8_cast_kv.yaml` — same patterns as
`nvfp4_mlp_only-fp8_kv.yaml`.
Both differ from their `_kv` siblings by:
- `algorithm: max` → `{ method: mse, fp8_scale_sweep: true, layerwise:
false }`
- All targeted **weight quantizers** switch `type: dynamic` → `type:
static` (otherwise `mse_calibrate` skips them: only static block-quant
weight quantizers are recognized for the FP8 sweep — see
`model_calib.py:369-374`).
- Input quantizers stay dynamic.
- KV bmm adds `use_constant_amax: true` (the `_cast_kv` flavor).
**Scripts (`examples/llm_ptq/scripts/`):**
- `parser.sh` — adds `--recipe` long-option, default `RECIPE=""`,
validates one-of-{`--quant`, `--recipe`} and not-both.
- `huggingface_example.sh` — when `RECIPE` is set, derives `MODEL_NAME`
from the recipe basename, passes `--recipe=…` to `hf_ptq.py` instead of
`--qformat=…`, and exits after export with a TRT-LLM deployment hint
(recipes can produce arbitrary configs that the script's downstream
`run_tensorrt_llm.py` path doesn't know how to handle generically).
Drops the `qformat` whitelist; defers to `hf_ptq.py`.
## Behavior
```
# Errors with: "Cannot specify both --quant and --recipe; pick one."
bash huggingface_example.sh --model=... --quant=nvfp4 --recipe=... --tasks=quant
# Errors with usage if neither is given
bash huggingface_example.sh --model=... --tasks=quant
# Both of these are now accepted; --recipe is forwarded verbatim to hf_ptq.py
bash huggingface_example.sh --model=... --quant=nvfp4 --tasks=quant
bash huggingface_example.sh --model=... --recipe=general/ptq/nvfp4_experts_only_mse-fp8_cast_kv --tasks=quant
bash huggingface_example.sh --model=... --recipe=general/ptq/nvfp4_mlp_only_mse-fp8_cast_kv --tasks=quant
```
## Test plan
- [x] `experts_only_mse-fp8_cast_kv` loads via
`modelopt.recipe.load_recipe(...)` and produces the expected algorithm +
per-pattern `quant_cfg` (verified in a working env: `algorithm ==
{'method': 'mse', 'fp8_scale_sweep': True, 'layerwise': False}`; expert
weight quantizers `type: static`; KV bmm has `use_constant_amax: True`).
- [x] Parser sanity: 4 flag combinations (both, neither, only `--quant`,
only `--recipe`) all behave as designed.
## Note
Pre-commit hook `check-modelopt-recipes` was skipped on both commits
because the local conda env has a broken `torchvision` install
(`AttributeError: partially initialized module 'torchvision' has no
attribute 'extension'`) that prevents `from modelopt.recipe.loader
import load_recipe`. The `experts_only` recipe was validated
independently by running `tools/precommit/check_modelopt_recipes.py` in
a working environment (exits 0); the `mlp_only` one is the same shape
with a different glob.
Rebased onto `main` from #1391 (which targeted
`chenjiel/nvfp4-fp8-sweep-triton`). The diff is scoped to the recipes +
script wiring; no kernel/sweep changes are included here.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Added recipe-based quantization as an alternative to format-based
quantization with a new `--recipe` CLI option.
* Added two new quantization recipes for targeted layer optimization:
one for expert-layer-only quantization and one for MLP-layer-only
quantization, both featuring NVFP4 and FP8 KV-cache optimization.
* **Configuration**
* `--quant` and `--recipe` options are now mutually exclusive; specify
one to configure quantization behavior.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>1 parent 570920b commit 6a3b6b8
4 files changed
Lines changed: 136 additions & 29 deletions
File tree
- examples/llm_ptq/scripts
- modelopt_recipes/general/ptq
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
49 | 49 | | |
50 | 50 | | |
51 | 51 | | |
52 | | - | |
53 | | - | |
54 | | - | |
55 | | - | |
56 | | - | |
57 | | - | |
58 | | - | |
59 | | - | |
60 | | - | |
61 | | - | |
62 | | - | |
63 | | - | |
| 52 | + | |
64 | 53 | | |
65 | 54 | | |
66 | 55 | | |
| |||
72 | 61 | | |
73 | 62 | | |
74 | 63 | | |
75 | | - | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
76 | 72 | | |
77 | 73 | | |
78 | 74 | | |
| |||
164 | 160 | | |
165 | 161 | | |
166 | 162 | | |
167 | | - | |
168 | | - | |
169 | | - | |
170 | | - | |
171 | | - | |
172 | | - | |
173 | | - | |
174 | | - | |
175 | | - | |
176 | | - | |
177 | | - | |
178 | 163 | | |
179 | 164 | | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
180 | 170 | | |
181 | 171 | | |
182 | 172 | | |
183 | 173 | | |
184 | | - | |
| 174 | + | |
185 | 175 | | |
186 | 176 | | |
187 | 177 | | |
| |||
203 | 193 | | |
204 | 194 | | |
205 | 195 | | |
206 | | - | |
| 196 | + | |
207 | 197 | | |
208 | 198 | | |
209 | 199 | | |
| |||
212 | 202 | | |
213 | 203 | | |
214 | 204 | | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
215 | 210 | | |
216 | 211 | | |
217 | 212 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
| 23 | + | |
23 | 24 | | |
24 | 25 | | |
25 | 26 | | |
| |||
37 | 38 | | |
38 | 39 | | |
39 | 40 | | |
40 | | - | |
| 41 | + | |
41 | 42 | | |
42 | 43 | | |
43 | 44 | | |
44 | 45 | | |
45 | 46 | | |
46 | 47 | | |
| 48 | + | |
47 | 49 | | |
48 | 50 | | |
49 | 51 | | |
| |||
99 | 101 | | |
100 | 102 | | |
101 | 103 | | |
102 | | - | |
103 | | - | |
| 104 | + | |
| 105 | + | |
104 | 106 | | |
105 | 107 | | |
106 | 108 | | |
107 | 109 | | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
108 | 117 | | |
109 | 118 | | |
110 | 119 | | |
| |||
135 | 144 | | |
136 | 145 | | |
137 | 146 | | |
| 147 | + | |
138 | 148 | | |
139 | 149 | | |
140 | 150 | | |
| |||
Lines changed: 48 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
Lines changed: 54 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
0 commit comments