Commit 43b67a8
authored
specdec_bench: keep method=mtp when adding model=<assistant> for Gemma 4 MTP (#1677)
### What does this PR do?
Type of change: Bug fix
Fixes the specdec_bench vLLM wrapper's MTP `speculative_config` emission
so Gemma 4 MTP no longer hits the wrong code path inside vLLM.
### Bug
vLLM's `SpeculativeConfig.__post_init__`
(`vllm/config/speculative.py:529-602`) auto-detects `method` ONLY when
it's unset. When `model` is provided and `method` is `None`, the default
branch sets `method = "draft_model"` — the generic same-architecture
draft path, NOT MTP. That path enforces equal num_heads between target
and draft and raises:
```
AssertionError: All layers in one attention group must share num_heads; got {8, 4}
```
on heterogeneous-head models. Gemma 4 has 8 target heads and 4 draft
heads by design.
### Where the previous fix went wrong
PR #1663 changed the MTP branch in the wrapper to emit `{model:
<assistant>, num_speculative_tokens: N}` WITHOUT `method` when
`draft_model_dir` was provided, based on a misread of vLLM PR #41745's
test plan that only showed the `{model, num_speculative_tokens}` shape.
That test plan was the direct `LLM(...)` constructor invocation; vLLM
had already defaulted method internally. Going through specdec_bench's
`AsyncEngineArgs(speculative_config=...)` path, the explicit `method`
key is required to avoid the auto-detect → draft_model fallback.
### Reference
vLLM's own test at
[`tests/v1/e2e/spec_decode/test_spec_decode.py:818-823`](https://github.com/vllm-project/vllm/blob/main/tests/v1/e2e/spec_decode/test_spec_decode.py#L818)
does exactly this for the gemma4-e4b parametrization:
```python
speculative_config = {
"method": method, # "mtp"
"num_speculative_tokens": ...,
}
if draft_model is not None: # Gemma 4 case
speculative_config["model"] = draft_model
```
### Fix
Restore `method="mtp"` as the unconditional MTP path. ADD `model` only
when `draft_model_dir` is set. Backward-compatible for Qwen 3.5 MTP /
DeepSeek MTP / other inline-MTP families (they keep the bare `{method:
"mtp"}` config).
### Validation
Field-tested via vLLM PR #41745's correctness test on `gemma-4-E4B-it` +
`gemma-4-E4B-it-assistant`: produced 304.7 output TPS at γ=4 vs 171.0
baseline (178% speedup) on H100. The same `speculative_config` shape
this fix emits.
### Surfaced on
[OMNIML-5024](https://jirasw.nvidia.com/browse/OMNIML-5024) pipeline
#54356795:
- Wrapper emitted `{model: assistant, num_speculative_tokens: 3}`
- vLLM auto-detected `method = "draft_model"`
- Loaded gemma-4-E4B-it-assistant (4 heads) as a generic draft for
gemma-4-E4B-it (8 heads)
- Attention-group num_heads check tripped → AssertionError, task_0
FAILED, task_1 CANCELLED
### Before your PR is "*Ready for review*"
- Backward compatible: ✅ (Qwen 3.5 / DeepSeek MTP unchanged; only the
MTP+`draft_model_dir` case changes).
- New tests: ❌ — the test exercising this codepath would need a GPU +
gemma-4 model checkout, which is cluster work, not unit-test scope.
JIRA-tracked validation via OMNIML-5024 dispatch after this lands.
- Changelog: ❌
### Additional Information
- vLLM PR #41745 (Gemma4 MTP support)
- Companion: NVIDIA/Model-Optimizer PR #1675 (launcher
`GlobalVariables.draft_model` schema fix)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Bug Fixes**
* Fixed speculative decoding configuration handling in the benchmark
example to ensure consistent method assignment and proper draft model
configuration.
* **Documentation**
* Updated configuration comments to reflect corrected behavior and
improved clarity.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>1 parent 46eddab commit 43b67a8
1 file changed
Lines changed: 43 additions & 19 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
63 | 63 | | |
64 | 64 | | |
65 | 65 | | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
66 | 99 | | |
67 | 100 | | |
68 | | - | |
69 | | - | |
70 | | - | |
71 | | - | |
72 | | - | |
73 | | - | |
74 | | - | |
75 | | - | |
76 | | - | |
77 | | - | |
78 | | - | |
79 | | - | |
80 | | - | |
81 | | - | |
82 | | - | |
83 | | - | |
84 | | - | |
85 | | - | |
86 | | - | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
87 | 111 | | |
88 | 112 | | |
89 | 113 | | |
| |||
0 commit comments