Commit 66b54ed
authored
[OMNIML-5024] specdec_bench cell t0_d3 — google/gemma-4-E4B-it / MTP / vllm (#1663)
### What does this PR do?
Type of change: Bug fix + new example
Wires SPEED-bench's MTP path to support **Gemma 4** (and any future MTP
variant that uses a separate assistant / draft model), and adds the
SPEED-bench MTP/vLLM example for `google/gemma-4-E4B-it`.
**Key difference: Gemma 4 MTP vs. generic MTP.** vLLM's
`speculative_config` accepts two different shapes for MTP:
| Variant | `speculative_config` shape | Models |
|---|---|---|
| **Generic MTP** | `{"method": "mtp", "num_speculative_tokens": N}` |
Models that carry their own MTP layer in-tree (e.g. Qwen 3.5 MTP
variants) — no separate draft / assistant model. |
| **Assistant-model MTP** | `{"model": "<assistant>",
"num_speculative_tokens": N}` (no `method` key — vLLM auto-detects from
the assistant) | Gemma 4 family (E2B / E4B / 26B-A4B / 31B); each target
model has a paired `<target>-assistant` checkpoint that acts as the MTP
draft. Landed in
[vllm-project/vllm#41745](vllm-project/vllm#41745)
(2026-05-06). |
The specdec_bench vLLM wrapper at
`examples/specdec_bench/specdec_bench/models/vllm.py` previously emitted
only the generic shape for any `--speculative_algorithm MTP` invocation,
which produced `NotImplementedError: Unsupported speculative method:
'mtp'` on Gemma 4 even with a container that has the support
(`vllm/vllm-openai:v0.22.1`+). This PR teaches the wrapper to switch
shapes based on whether `--draft_model_dir` is provided.
**Concrete changes:**
1. **`examples/specdec_bench/specdec_bench/models/vllm.py`** — when
`speculative_algorithm == "MTP"` AND `draft_model_dir` is set, emit
`{"model": draft_model_dir, "num_speculative_tokens": N}`
(assistant-model shape). Otherwise emit the existing `{"method": "mtp",
...}` (generic shape). Backward-compatible — Qwen 3.5 MTP and other
callers that omit `--draft_model_dir` get the same config they got
before.
2. **`examples/specdec_bench/specdec_bench/utils.py`** — `get_tokenizer`
reads `extra_special_tokens` from the model's `tokenizer_config.json`
and passes them through to `AutoTokenizer.from_pretrained`. Gemma 4
tokenizers ship a list-shaped `extra_special_tokens` entry that the
constructor would otherwise reject. Necessary for any Gemma 4 cell.
3.
**`tools/launcher/examples/gemma-4/gemma-4-E4B-it/specdec_bench_mtp_vllm.yaml`**
— SPEED-bench parent YAML for `google/gemma-4-E4B-it`. Uses
`vllm/vllm-openai:v0.22.1` (has `gemma4_mtp.py` from #41745) and wires
`--draft_model_dir /hf-local/google/gemma-4-E4B-it-assistant` on both
task_0 (qualitative) and task_1 (throughput_32k).
4.
**`tools/launcher/common/specdec_bench/_cells/gemma-4-E4B-it_mtp_vllm_t0_d3.yaml`**
— runtime params for the `t0_d3` cell of OMNIML-5022 (`temperature=0`,
`max_model_len=40960`).
### Usage
```python
# Wrapper-level: same CLI as before, just pass --draft_model_dir for
# Gemma 4 MTP. The wrapper auto-routes to the assistant-model shape.
# python examples/specdec_bench/run.py \
# --engine VLLM \
# --speculative_algorithm MTP \
# --draft_model_dir /hf-local/google/gemma-4-E4B-it-assistant \
# --draft_length 3 \
# --tp_size 1 \
# ...other SPEED-bench knobs...
# Equivalent direct vLLM invocation (for reference, no wrapper):
from vllm import LLM, SamplingParams
llm = LLM(
model="google/gemma-4-E4B-it",
speculative_config={
"model": "google/gemma-4-E4B-it-assistant",
"num_speculative_tokens": 3,
},
trust_remote_code=True,
)
```
### Testing
- **Upstream existence checks**: verified the assistant models
`google/gemma-4-{E2B,E4B,26B-A4B}-it-assistant` exist, public, ungated
on HuggingFace; verified `vllm/model_executor/models/gemma4_mtp.py` is
in vLLM `v0.22.0`, `v0.22.1`, and `main`.
- **Backward compat**: `MTP` callers that don't pass `--draft_model_dir`
(e.g. the existing Qwen 3.5 MTP/vLLM cells under
`tools/launcher/examples/Qwen/Qwen3.5-4B/`) take the unchanged
`{"method": "mtp", ...}` branch. No diff for those.
- **End-to-end cluster validation**: pending. Will run via the
OMNIML-5022 cells (OMNIML-5024 / 5025 / 5026 / 5027) once the
nmm-sandbox submodule pin advances past this PR. Each cell exercises
`task_0` (SPEED-Bench qualitative, 880 samples) + `task_1`
(throughput_32k, 80 samples) on cw_dfw, single H100.
### Before your PR is "*Ready for review*"
Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).
Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).
- Is this change backward compatible?: ✅ — the wrapper only takes the
new branch when `--draft_model_dir` is provided alongside
`--speculative_algorithm MTP`. Existing MTP callers (Qwen 3.5 etc.) keep
the generic `method: "mtp"` config.
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A — no new
dependencies.
- Did you write any new necessary tests?: ❌ — relying on the SPEED-bench
cluster cells (OMNIML-5024 …5027) for end-to-end validation; no unit
test fixture for the vLLM wrapper exists in `tests/` for me to extend
symmetrically. Happy to add one if reviewers want it.
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
❌ — small fix + example addition. Can add if requested.
- Did you get Claude approval on this PR?: ❌ — will run `/claude review`
once the PR is marked Ready for review.
### Additional Information
- JIRA: [OMNIML-5024](https://jirasw.nvidia.com/browse/OMNIML-5024)
(cell_t0_d3); siblings OMNIML-5025/5026/5027 (cell_{t0_d7, t1_d3,
t1_d7}) of Epic OMNIML-5022 are blocked on this PR landing.
- Upstream reference: vllm-project/vllm#41745 — "[Spec Decode] Add
Gemma4 MTP speculative decoding support".
- Companion (pensieve-intern !91, internal): adds a
"Model-family-specific MTP invocation" table to the specdec_bench cell
SPEC so future agents pair `MTP` with the right `--draft_model_dir` from
SPEC-read time.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Added a SPEED-bench pipeline for Gemma 4 using vLLM speculative
decoding (MTP) with qualitative and throughput tasks.
* **Improvements**
* Speculative-decoding logic updated to handle assistant-model and
generic MTP cases distinctly.
* Tokenizer loading now reads and normalizes extra special tokens from
tokenizer config when available.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Pensieve Intern <chenhany@nvidia.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>1 parent 48767a0 commit 66b54ed
3 files changed
Lines changed: 118 additions & 5 deletions
File tree
- examples/specdec_bench/specdec_bench
- models
- tools/launcher/examples/gemma-4/gemma-4-E4B-it
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
63 | 63 | | |
64 | 64 | | |
65 | 65 | | |
66 | | - | |
67 | | - | |
68 | | - | |
69 | | - | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
70 | 87 | | |
71 | 88 | | |
72 | 89 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
35 | 35 | | |
36 | 36 | | |
37 | 37 | | |
38 | | - | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
39 | 52 | | |
40 | 53 | | |
41 | 54 | | |
| |||
Lines changed: 83 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
0 commit comments