[OMNIML-5024] specdec_bench cell t0_d3 — google/gemma-4-E4B-it / MTP / vllm#1663
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughAdds tokenizer handling for checkpoint extra special tokens, refines vLLM MTP speculative-decoding to distinguish assistant-model vs generic modes, and adds a Gemma-4 E4B-it SPEED benchmark job with two vLLM MTP tasks. ChangesGemma-4 MTP Benchmark Setup
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1663 +/- ##
==========================================
+ Coverage 77.30% 77.32% +0.01%
==========================================
Files 509 509
Lines 55914 55914
==========================================
+ Hits 43227 43238 +11
+ Misses 12687 12676 -11
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
ce245cd to
10d5d24
Compare
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🧹 Nitpick comments (1)
examples/specdec_bench/specdec_bench/utils.py (1)
41-42: ⚡ Quick winAdd error handling for JSON parsing.
The
json.load()call on line 42 can raiseJSONDecodeErroriftokenizer_config.jsonis malformed. While a corrupt config file is unlikely in practice, adding a try-except block would make the function more robust and prevent cryptic errors downstream.🛡️ Proposed fix to add defensive error handling
tokenizer_config_path = os.path.join(path, "tokenizer_config.json") if os.path.exists(tokenizer_config_path): - with open(tokenizer_config_path) as f: - tokenizer_config = json.load(f) - extra_special_tokens = tokenizer_config.get("extra_special_tokens") + try: + with open(tokenizer_config_path) as f: + tokenizer_config = json.load(f) + extra_special_tokens = tokenizer_config.get("extra_special_tokens") + except (OSError, json.JSONDecodeError): + # Fall back to default behavior if config is unreadable + pass🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/specdec_bench/specdec_bench/utils.py` around lines 41 - 42, The json.load(tokenizer_config_path) call can raise json.JSONDecodeError; wrap the open(...) / json.load(...) in a try/except that catches json.JSONDecodeError and raises a clearer error (e.g., ValueError or a custom exception) that includes tokenizer_config_path and the original exception message, or log and rethrow to provide actionable context; update the code around tokenizer_config_path and tokenizer_config to use this defensive handling so malformed tokenizer_config.json produces a clear, descriptive error.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/specdec_bench/specdec_bench/utils.py`:
- Around line 46-50: The current conversion of a list into
kwargs["extra_special_tokens"] assumes tokens are wrapped like "<|...|>" and
builds names with token.strip("<|>").replace("|", "_") + "_token"; instead
validate the contract expected by AutoTokenizer.from_pretrained by ensuring
kwargs["extra_special_tokens"] is a dict of {token_name: token_string}, and
harden the key derivation in the block that handles extra_special_tokens: check
each token's format and either canonicalize wrapped tokens as before or fall
back to a safe sanitized name (e.g., remove non-alphanumerics, limit length)
plus a unique numeric suffix to prevent collisions, and raise or log a clear
error if a token cannot be safely named; update any code paths that call
AutoTokenizer.from_pretrained to pass this dict.
---
Nitpick comments:
In `@examples/specdec_bench/specdec_bench/utils.py`:
- Around line 41-42: The json.load(tokenizer_config_path) call can raise
json.JSONDecodeError; wrap the open(...) / json.load(...) in a try/except that
catches json.JSONDecodeError and raises a clearer error (e.g., ValueError or a
custom exception) that includes tokenizer_config_path and the original exception
message, or log and rethrow to provide actionable context; update the code
around tokenizer_config_path and tokenizer_config to use this defensive handling
so malformed tokenizer_config.json produces a clear, descriptive error.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c462dded-57d8-4edb-a743-3f8809ed0246
📒 Files selected for processing (4)
examples/specdec_bench/specdec_bench/models/vllm.pyexamples/specdec_bench/specdec_bench/utils.pytools/launcher/common/specdec_bench/_cells/gemma-4-E4B-it_mtp_vllm_t0_d3.yamltools/launcher/examples/gemma-4/gemma-4-E4B-it/specdec_bench_mtp_vllm.yaml
| if isinstance(extra_special_tokens, list): | ||
| kwargs["extra_special_tokens"] = { | ||
| token.strip("<|>").replace("|", "_") + "_token": token | ||
| for token in extra_special_tokens | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the HuggingFace AutoTokenizer API for extra_special_tokens handling
# Check if transformers tokenizer accepts extra_special_tokens
python3 <<'EOF'
from transformers import AutoTokenizer
import inspect
# Inspect AutoTokenizer.from_pretrained signature
sig = inspect.signature(AutoTokenizer.from_pretrained)
params = list(sig.parameters.keys())
# Look for extra_special_tokens or similar parameters
special_token_params = [p for p in params if 'special' in p.lower() or 'token' in p.lower()]
print("Parameters related to special tokens:")
for p in special_token_params:
print(f" - {p}")
# Check PreTrainedTokenizer.__init__ as well
from transformers import PreTrainedTokenizer
init_sig = inspect.signature(PreTrainedTokenizer.__init__)
init_params = list(init_sig.parameters.keys())
init_special = [p for p in init_params if 'special' in p.lower() or 'token' in p.lower()]
print("\nPreTrainedTokenizer.__init__ special token parameters:")
for p in init_special:
print(f" - {p}")
EOFRepository: NVIDIA/Model-Optimizer
Length of output: 191
🌐 Web query:
HuggingFace transformers AutoTokenizer.from_pretrained extra_special_tokens parameter format
💡 Result:
In the Hugging Face transformers library, the extra_special_tokens parameter used within AutoTokenizer.from_pretrained is passed as a dictionary [1][2]. The format for this dictionary is: extra_special_tokens={"token_name": "token_string"} [1][2] For example: tokenizer = AutoTokenizer.from_pretrained( "model_name", extra_special_tokens={"image_token": ""}) [1][3] Key Details: 1. Usage: This parameter allows you to register additional special tokens [1][3]. While commonly used in multimodal models as placeholders for images, video, or audio, these tokens do not have to be modality-specific and can represent any token the model needs to access [2][4]. 2. Functionality: If these tokens are not already present in the vocabulary, they are added to the end of the vocabulary [2][4]. 3. Decoding: Adding tokens through this parameter ensures they are treated as special tokens, meaning they will be skipped when decoding if skip_special_tokens=True is set [2][4]. 4. Alternative Method: Beyond passing it during instantiation, you can also manage special tokens using the add_special_tokens method on a tokenizer instance [4]. Note that attempting to pass some flags like add_special_tokens=True directly into from_pretrained can cause conflicts with existing methods in certain tokenizer classes; it is generally safer to manage such flags after initialization if they are not explicitly supported by the constructor [5].
Citations:
- 1: https://huggingface.co/docs/transformers/main/en/fast_tokenizers
- 2: https://huggingface.co/docs/transformers/main/en/main%5Fclasses/tokenizer
- 3: https://github.com/huggingface/transformers/blob/main/docs/source/en/fast_tokenizers.md
- 4: https://huggingface.co/docs/transformers/en/main_classes/tokenizer
- 5: AutoTokenizer.from_pretrained with add_special_token=True cannot be deserialized back huggingface/transformers#34557
Confirm extra_special_tokens contract; harden key-name derivation.
HuggingFace AutoTokenizer.from_pretrained expects extra_special_tokens as a dict of {token_name: token_string} (token names typically end with *_token), so converting a list into kwargs["extra_special_tokens"] in that shape matches the API contract. The main remaining risk is the heuristic deriving token_name (token.strip("<|>").replace("|", "_") + "_token"), which assumes <|...|>-wrapped tokens and can produce unexpected/invalid names or collisions when tokens don’t follow that format.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/specdec_bench/specdec_bench/utils.py` around lines 46 - 50, The
current conversion of a list into kwargs["extra_special_tokens"] assumes tokens
are wrapped like "<|...|>" and builds names with token.strip("<|>").replace("|",
"_") + "_token"; instead validate the contract expected by
AutoTokenizer.from_pretrained by ensuring kwargs["extra_special_tokens"] is a
dict of {token_name: token_string}, and harden the key derivation in the block
that handles extra_special_tokens: check each token's format and either
canonicalize wrapped tokens as before or fall back to a safe sanitized name
(e.g., remove non-alphanumerics, limit length) plus a unique numeric suffix to
prevent collisions, and raise or log a clear error if a token cannot be safely
named; update any code paths that call AutoTokenizer.from_pretrained to pass
this dict.
|
/claude review |
| if isinstance(extra_special_tokens, list): | ||
| kwargs["extra_special_tokens"] = { | ||
| token.strip("<|>").replace("|", "_") + "_token": token | ||
| for token in extra_special_tokens | ||
| } |
There was a problem hiding this comment.
[SUGGESTION] The synthesized extra_special_tokens keys can collide with HuggingFace's built-in special-token attributes and with each other.
SpecialTokensMixin reserves attribute names like bos_token, eos_token, pad_token, cls_token, sep_token, unk_token, mask_token. If the source list happens to include something whose stripped-and-suffixed name matches one of those (e.g. "<|bos|>" → "bos_token"), this kwargs path overwrites the real built-in token mapping at construction time, silently corrupting tokenization. The naming heuristic also collapses:
"<|foo|>"and"foo"both become"foo_token"(last write wins, one token gets dropped).- empty/edge inputs like
"<|>"collapse to a key of"_token".
For Gemma 4 today this happens to work, but it's a foot-gun for any future tokenizer with overlapping names. Safer alternative: drop the heuristic and use index-based names that can't collide with HF reserved names:
kwargs["extra_special_tokens"] = {
f"extra_special_token_{i}": token
for i, token in enumerate(extra_special_tokens)
}The keys are only used as attribute lookups by user code; the actual tokenizer behavior depends on the token values, not the keys.
| def get_tokenizer(path, trust_remote_code=False): | ||
| return AutoTokenizer.from_pretrained(path, trust_remote_code=trust_remote_code) | ||
| extra_special_tokens = None | ||
| tokenizer_config_path = os.path.join(path, "tokenizer_config.json") |
There was a problem hiding this comment.
[SUGGESTION] When path is a HuggingFace Hub repo ID (e.g. "google/gemma-4-E4B-it") rather than a local directory, os.path.exists(tokenizer_config_path) returns False and the new branch is skipped — so AutoTokenizer.from_pretrained will still hit the original "list-shaped extra_special_tokens" error.
This is fine for the launcher path (which mounts checkpoints under /hf-local/... and always passes a directory), but breaks direct CLI usage that resolves through the HF cache. If you want the fix to also cover that case, fall back to huggingface_hub.try_to_load_from_cache(path, "tokenizer_config.json") (or cached_file) when os.path.exists fails. Not blocking — flagging since the PR description frames this as a Gemma-4 fix in general, not launcher-specific.
| # pipeline.task_0.args+=["--temperature 0","--max_seq_len 65536","--save_dir /scratchspace/<sweep>/qualitative","--draft_length 3"] \ | ||
| # pipeline.task_1.args+=["--temperature 0","--max_seq_len 65536","--save_dir /scratchspace/<sweep>/throughput_32k","--num_requests 80","--draft_length 3"] |
There was a problem hiding this comment.
[SUGGESTION] The example overrides include "--draft_length 3", but --draft_length 3 is already in task_0.args (line 39) and task_1.args (line 66). With args+=[...], the override gets appended — argparse then sees --draft_length 3 --draft_length 3 and just uses the last occurrence. Functionally harmless, but it means the comment misleads reviewers/users about which knobs are cell-overridable: a user who reads this comment and tries args+=["--draft_length 7"] will be surprised when both 3 and 7 end up on the command line. Either drop --draft_length 3 from the override hint, or move it out of the base args block (matching the --temperature / --max_seq_len / --save_dir pattern, which are NOT in the base args and ARE legitimately cell-overridable).
There was a problem hiding this comment.
Claude review passed — no blocking issues found. LGTM
Targeted fix that cleanly routes the two MTP speculative_config shapes (assistant-model vs. generic) based on whether --draft_model_dir is set, plus the tokenizer plumbing Gemma 4 needs and a parent YAML for the cluster cells. The branch is backward-compatible: existing MTP callers (Qwen 3.5) that don't pass --draft_model_dir get the unchanged {"method": "mtp", ...} config.
Findings: 0 CRITICAL / 0 IMPORTANT / 3 SUGGESTION (all non-blocking, posted inline):
utils.py:extra_special_tokenskey heuristic can collide with HFSpecialTokensMixinreserved names (bos_token,eos_token, etc.) and with itself;f"extra_special_token_{i}"is collision-free.utils.py:os.path.existsshort-circuit means the fix only triggers for local checkpoint dirs — Hub-ID callers still hit the original error. Fine for the launcher's mounted-checkpoint path, just worth flagging.- Gemma-4 YAML: the override-example comment lists
--draft_length 3, but that flag is already in the base args, so cell overrides duplicate it. Same pattern as--temperature/--max_seq_len/--save_dirwould be cleaner.
Risk: low. Wrapper change is a pure additive branch; YAML is a new example; the only file that runs in existing flows (utils.py) is gated on a list-shaped extra_special_tokens field that current callers don't have.
|
/ok to test 21e935d |
Signed-off-by: Pensieve Intern <chenhany@nvidia.com>
Signed-off-by: Pensieve Intern <chenhany@nvidia.com>
Signed-off-by: Pensieve Intern <chenhany@nvidia.com>
Signed-off-by: Pensieve Intern <chenhany@nvidia.com>
Signed-off-by: Pensieve Intern <chenhany@nvidia.com>
Signed-off-by: Pensieve Intern <chenhany@nvidia.com>
Signed-off-by: Pensieve Intern <chenhany@nvidia.com>
vLLM PR vllm-project/vllm#41745 (2026-05-06) shipped Gemma 4 MTP support; the implementation expects ``speculative_config`` shaped as ``{"model": <assistant>, "num_speculative_tokens": N}`` (no ``method`` key — vLLM auto-detects Gemma 4 from the assistant). The specdec_bench wrapper at ``models/vllm.py`` unconditionally emitted ``{"method": "mtp", "num_speculative_tokens": N}`` for any ``--speculative_algorithm MTP`` invocation, which produced ``NotImplementedError: Unsupported speculative method: 'mtp'`` on Gemma 4 even with a container that has the support (``vllm/vllm-openai:v0.22.1``+). Changes: 1. ``examples/specdec_bench/specdec_bench/models/vllm.py``: when MTP is paired with ``--draft_model_dir``, emit the assistant-model config shape; otherwise keep the generic ``method: "mtp"`` path (Qwen 3.5 etc.). Preserves backward compatibility — callers that didn't pass ``--draft_model_dir`` get the same config they got before. 2. ``tools/launcher/examples/gemma-4/gemma-4-E4B-it/specdec_bench_mtp_vllm.yaml``: bump container to ``vllm/vllm-openai:v0.22.1`` (the qwen3_5-cu130 tag predates the gemma4_mtp PR and doesn't recognize ``model_type=gemma4``); add ``--draft_model_dir /hf-local/google/gemma-4-E4B-it-assistant`` to both task_0 and task_1 args; expose the assistant path via ``global_vars.draft_model`` for reuse; rewrite the header comment with the corrected diagnosis. Validated upstream: ``google/gemma-4-{E2B,E4B,26B-A4B,31B}-it-assistant`` exist on HuggingFace (public, ungated); ``gemma4_mtp.py`` is in ``v0.22.0``, ``v0.22.1``, and ``main``. PR #41745's test plan documents the same config shape this change emits. Surfaced by OMNIML-5024/5025/5026/5027 (the four cells of OMNIML-5022). Each cell agent independently misdiagnosed this as "no container supports both Gemma 4 and MTP," when the gap was actually the wrapper not being wired for the assistant-model config shape that vLLM expects. Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Post-#1564, each ``(model, algorithm, engine)`` Epic owns exactly one committed YAML: the parent at ``tools/launcher/examples/<family>/<model>/specdec_bench_<algo>_<engine>.yaml``. Per-cell knobs (temperature, max_seq_len, save_dir, draft_length / block_size) come from CLI overrides at slurm-invoke time via ``pipeline.task_N.args+=[...]``. No per-cell file is committed. This branch had originally created ``tools/launcher/common/specdec_bench/_cells/gemma-4-E4B-it_mtp_vllm_t0_d3.yaml`` + a ``--runtime_params common/specdec_bench/_cells/...`` reference in the parent, both pre-#1564 shapes that duplicated the parent's knobs. The cell-stage workflow SPEC template had stale Step 3 guidance still mentioning the ``_cells/<sweep_name>.yaml`` shape, which is what the agent followed; that contradiction is cleaned up in pensieve-intern !91. Drop: - ``tools/launcher/common/specdec_bench/_cells/gemma-4-E4B-it_mtp_vllm_t0_d3.yaml`` - the ``--runtime_params common/specdec_bench/_cells/...`` line on both task_0 and task_1 in the parent YAML Update the header comment to document the canonical CLI-override invocation pattern (the same pattern used by the NVIDIA-Nemotron-3-Super-120B-A12B-BF16 parent on main). The cell-side overrides for OMNIML-5024 (t0_d3) become: pipeline.task_0.args+=["--temperature 0", "--max_seq_len 65536", "--save_dir /scratchspace/<sweep>/qualitative", "--draft_length 3"] pipeline.task_1.args+=["--temperature 0", "--max_seq_len 65536", "--save_dir /scratchspace/<sweep>/throughput_32k", "--num_requests 80", "--draft_length 3"] Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Pre-commit ruff-format collapsed the dict comprehension onto a single line. Pure formatting — no behaviour change. Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
21e935d to
8a53ae6
Compare
|
/ok to test 8a53ae6 |
|
…a 4 MTP (#1677) ### What does this PR do? Type of change: Bug fix Fixes the specdec_bench vLLM wrapper's MTP `speculative_config` emission so Gemma 4 MTP no longer hits the wrong code path inside vLLM. ### Bug vLLM's `SpeculativeConfig.__post_init__` (`vllm/config/speculative.py:529-602`) auto-detects `method` ONLY when it's unset. When `model` is provided and `method` is `None`, the default branch sets `method = "draft_model"` — the generic same-architecture draft path, NOT MTP. That path enforces equal num_heads between target and draft and raises: ``` AssertionError: All layers in one attention group must share num_heads; got {8, 4} ``` on heterogeneous-head models. Gemma 4 has 8 target heads and 4 draft heads by design. ### Where the previous fix went wrong PR #1663 changed the MTP branch in the wrapper to emit `{model: <assistant>, num_speculative_tokens: N}` WITHOUT `method` when `draft_model_dir` was provided, based on a misread of vLLM PR #41745's test plan that only showed the `{model, num_speculative_tokens}` shape. That test plan was the direct `LLM(...)` constructor invocation; vLLM had already defaulted method internally. Going through specdec_bench's `AsyncEngineArgs(speculative_config=...)` path, the explicit `method` key is required to avoid the auto-detect → draft_model fallback. ### Reference vLLM's own test at [`tests/v1/e2e/spec_decode/test_spec_decode.py:818-823`](https://github.com/vllm-project/vllm/blob/main/tests/v1/e2e/spec_decode/test_spec_decode.py#L818) does exactly this for the gemma4-e4b parametrization: ```python speculative_config = { "method": method, # "mtp" "num_speculative_tokens": ..., } if draft_model is not None: # Gemma 4 case speculative_config["model"] = draft_model ``` ### Fix Restore `method="mtp"` as the unconditional MTP path. ADD `model` only when `draft_model_dir` is set. Backward-compatible for Qwen 3.5 MTP / DeepSeek MTP / other inline-MTP families (they keep the bare `{method: "mtp"}` config). ### Validation Field-tested via vLLM PR #41745's correctness test on `gemma-4-E4B-it` + `gemma-4-E4B-it-assistant`: produced 304.7 output TPS at γ=4 vs 171.0 baseline (178% speedup) on H100. The same `speculative_config` shape this fix emits. ### Surfaced on [OMNIML-5024](https://jirasw.nvidia.com/browse/OMNIML-5024) pipeline #54356795: - Wrapper emitted `{model: assistant, num_speculative_tokens: 3}` - vLLM auto-detected `method = "draft_model"` - Loaded gemma-4-E4B-it-assistant (4 heads) as a generic draft for gemma-4-E4B-it (8 heads) - Attention-group num_heads check tripped → AssertionError, task_0 FAILED, task_1 CANCELLED ### Before your PR is "*Ready for review*" - Backward compatible: ✅ (Qwen 3.5 / DeepSeek MTP unchanged; only the MTP+`draft_model_dir` case changes). - New tests: ❌ — the test exercising this codepath would need a GPU + gemma-4 model checkout, which is cluster work, not unit-test scope. JIRA-tracked validation via OMNIML-5024 dispatch after this lands. - Changelog: ❌ ### Additional Information - vLLM PR #41745 (Gemma4 MTP support) - Companion: NVIDIA/Model-Optimizer PR #1675 (launcher `GlobalVariables.draft_model` schema fix) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Fixed speculative decoding configuration handling in the benchmark example to ensure consistent method assignment and proper draft model configuration. * **Documentation** * Updated configuration comments to reflect corrected behavior and improved clarity. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
What does this PR do?
Type of change: Bug fix + new example
Wires SPEED-bench's MTP path to support Gemma 4 (and any future MTP variant that uses a separate assistant / draft model), and adds the SPEED-bench MTP/vLLM example for
google/gemma-4-E4B-it.Key difference: Gemma 4 MTP vs. generic MTP. vLLM's
speculative_configaccepts two different shapes for MTP:speculative_configshape{"method": "mtp", "num_speculative_tokens": N}{"model": "<assistant>", "num_speculative_tokens": N}(nomethodkey — vLLM auto-detects from the assistant)<target>-assistantcheckpoint that acts as the MTP draft. Landed in vllm-project/vllm#41745 (2026-05-06).The specdec_bench vLLM wrapper at
examples/specdec_bench/specdec_bench/models/vllm.pypreviously emitted only the generic shape for any--speculative_algorithm MTPinvocation, which producedNotImplementedError: Unsupported speculative method: 'mtp'on Gemma 4 even with a container that has the support (vllm/vllm-openai:v0.22.1+). This PR teaches the wrapper to switch shapes based on whether--draft_model_diris provided.Concrete changes:
examples/specdec_bench/specdec_bench/models/vllm.py— whenspeculative_algorithm == "MTP"ANDdraft_model_diris set, emit{"model": draft_model_dir, "num_speculative_tokens": N}(assistant-model shape). Otherwise emit the existing{"method": "mtp", ...}(generic shape). Backward-compatible — Qwen 3.5 MTP and other callers that omit--draft_model_dirget the same config they got before.examples/specdec_bench/specdec_bench/utils.py—get_tokenizerreadsextra_special_tokensfrom the model'stokenizer_config.jsonand passes them through toAutoTokenizer.from_pretrained. Gemma 4 tokenizers ship a list-shapedextra_special_tokensentry that the constructor would otherwise reject. Necessary for any Gemma 4 cell.tools/launcher/examples/gemma-4/gemma-4-E4B-it/specdec_bench_mtp_vllm.yaml— SPEED-bench parent YAML forgoogle/gemma-4-E4B-it. Usesvllm/vllm-openai:v0.22.1(hasgemma4_mtp.pyfrom #41745) and wires--draft_model_dir /hf-local/google/gemma-4-E4B-it-assistanton both task_0 (qualitative) and task_1 (throughput_32k).tools/launcher/common/specdec_bench/_cells/gemma-4-E4B-it_mtp_vllm_t0_d3.yaml— runtime params for thet0_d3cell of OMNIML-5022 (temperature=0,max_model_len=40960).Usage
Testing
google/gemma-4-{E2B,E4B,26B-A4B}-it-assistantexist, public, ungated on HuggingFace; verifiedvllm/model_executor/models/gemma4_mtp.pyis in vLLMv0.22.0,v0.22.1, andmain.MTPcallers that don't pass--draft_model_dir(e.g. the existing Qwen 3.5 MTP/vLLM cells undertools/launcher/examples/Qwen/Qwen3.5-4B/) take the unchanged{"method": "mtp", ...}branch. No diff for those.task_0(SPEED-Bench qualitative, 880 samples) +task_1(throughput_32k, 80 samples) on cw_dfw, single H100.Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).--draft_model_diris provided alongside--speculative_algorithm MTP. Existing MTP callers (Qwen 3.5 etc.) keep the genericmethod: "mtp"config.CONTRIBUTING.md: N/A — no new dependencies.tests/for me to extend symmetrically. Happy to add one if reviewers want it./claude reviewonce the PR is marked Ready for review.Additional Information
MTPwith the right--draft_model_dirfrom SPEC-read time.Summary by CodeRabbit
New Features
Improvements