[OMNIML-5024] specdec_bench cell t0_d3 — google/gemma-4-E4B-it / MTP / vllm by ChenhanYu · Pull Request #1663 · NVIDIA/Model-Optimizer

ChenhanYu · 2026-06-10T00:15:07Z

What does this PR do?

Type of change: Bug fix + new example

Wires SPEED-bench's MTP path to support Gemma 4 (and any future MTP variant that uses a separate assistant / draft model), and adds the SPEED-bench MTP/vLLM example for google/gemma-4-E4B-it.

Key difference: Gemma 4 MTP vs. generic MTP. vLLM's speculative_config accepts two different shapes for MTP:

Variant	`speculative_config` shape	Models
Generic MTP	`{"method": "mtp", "num_speculative_tokens": N}`	Models that carry their own MTP layer in-tree (e.g. Qwen 3.5 MTP variants) — no separate draft / assistant model.
Assistant-model MTP	`{"model": "<assistant>", "num_speculative_tokens": N}` (no `method` key — vLLM auto-detects from the assistant)	Gemma 4 family (E2B / E4B / 26B-A4B / 31B); each target model has a paired `<target>-assistant` checkpoint that acts as the MTP draft. Landed in vllm-project/vllm#41745 (2026-05-06).

The specdec_bench vLLM wrapper at examples/specdec_bench/specdec_bench/models/vllm.py previously emitted only the generic shape for any --speculative_algorithm MTP invocation, which produced NotImplementedError: Unsupported speculative method: 'mtp' on Gemma 4 even with a container that has the support (vllm/vllm-openai:v0.22.1+). This PR teaches the wrapper to switch shapes based on whether --draft_model_dir is provided.

Concrete changes:

examples/specdec_bench/specdec_bench/models/vllm.py — when speculative_algorithm == "MTP" AND draft_model_dir is set, emit {"model": draft_model_dir, "num_speculative_tokens": N} (assistant-model shape). Otherwise emit the existing {"method": "mtp", ...} (generic shape). Backward-compatible — Qwen 3.5 MTP and other callers that omit --draft_model_dir get the same config they got before.
examples/specdec_bench/specdec_bench/utils.py — get_tokenizer reads extra_special_tokens from the model's tokenizer_config.json and passes them through to AutoTokenizer.from_pretrained. Gemma 4 tokenizers ship a list-shaped extra_special_tokens entry that the constructor would otherwise reject. Necessary for any Gemma 4 cell.
tools/launcher/examples/gemma-4/gemma-4-E4B-it/specdec_bench_mtp_vllm.yaml — SPEED-bench parent YAML for google/gemma-4-E4B-it. Uses vllm/vllm-openai:v0.22.1 (has gemma4_mtp.py from #41745) and wires --draft_model_dir /hf-local/google/gemma-4-E4B-it-assistant on both task_0 (qualitative) and task_1 (throughput_32k).
tools/launcher/common/specdec_bench/_cells/gemma-4-E4B-it_mtp_vllm_t0_d3.yaml — runtime params for the t0_d3 cell of OMNIML-5022 (temperature=0, max_model_len=40960).

Usage

# Wrapper-level: same CLI as before, just pass --draft_model_dir for
# Gemma 4 MTP. The wrapper auto-routes to the assistant-model shape.

# python examples/specdec_bench/run.py \
#     --engine VLLM \
#     --speculative_algorithm MTP \
#     --draft_model_dir /hf-local/google/gemma-4-E4B-it-assistant \
#     --draft_length 3 \
#     --tp_size 1 \
#     ...other SPEED-bench knobs...

# Equivalent direct vLLM invocation (for reference, no wrapper):
from vllm import LLM, SamplingParams
llm = LLM(
    model="google/gemma-4-E4B-it",
    speculative_config={
        "model": "google/gemma-4-E4B-it-assistant",
        "num_speculative_tokens": 3,
    },
    trust_remote_code=True,
)

Testing

Upstream existence checks: verified the assistant models google/gemma-4-{E2B,E4B,26B-A4B}-it-assistant exist, public, ungated on HuggingFace; verified vllm/model_executor/models/gemma4_mtp.py is in vLLM v0.22.0, v0.22.1, and main.
Backward compat: MTP callers that don't pass --draft_model_dir (e.g. the existing Qwen 3.5 MTP/vLLM cells under tools/launcher/examples/Qwen/Qwen3.5-4B/) take the unchanged {"method": "mtp", ...} branch. No diff for those.
End-to-end cluster validation: pending. Will run via the OMNIML-5022 cells (OMNIML-5024 / 5025 / 5026 / 5027) once the nmm-sandbox submodule pin advances past this PR. Each cell exercises task_0 (SPEED-Bench qualitative, 880 samples) + task_1 (throughput_32k, 80 samples) on cw_dfw, single H100.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ — the wrapper only takes the new branch when --draft_model_dir is provided alongside --speculative_algorithm MTP. Existing MTP callers (Qwen 3.5 etc.) keep the generic method: "mtp" config.
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A — no new dependencies.
Did you write any new necessary tests?: ❌ — relying on the SPEED-bench cluster cells (OMNIML-5024 …5027) for end-to-end validation; no unit test fixture for the vLLM wrapper exists in tests/ for me to extend symmetrically. Happy to add one if reviewers want it.
Did you update Changelog?: ❌ — small fix + example addition. Can add if requested.
Did you get Claude approval on this PR?: ❌ — will run /claude review once the PR is marked Ready for review.

Additional Information

JIRA: OMNIML-5024 (cell_t0_d3); siblings OMNIML-5025/5026/5027 (cell_{t0_d7, t1_d3, t1_d7}) of Epic OMNIML-5022 are blocked on this PR landing.
Upstream reference: [Spec Decode] Add Gemma4 MTP speculative decoding support vllm-project/vllm#41745 — "[Spec Decode] Add Gemma4 MTP speculative decoding support".
Companion (pensieve-intern !91, internal): adds a "Model-family-specific MTP invocation" table to the specdec_bench cell SPEC so future agents pair MTP with the right --draft_model_dir from SPEC-read time.

Summary by CodeRabbit

New Features
- Added a SPEED-bench pipeline for Gemma 4 using vLLM speculative decoding (MTP) with qualitative and throughput tasks.
Improvements
- Speculative-decoding logic updated to handle assistant-model and generic MTP cases distinctly.
- Tokenizer loading now reads and normalizes extra special tokens from tokenizer config when available.

copy-pr-bot · 2026-06-10T00:15:11Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-06-10T00:15:15Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fd2d418d-a089-40ee-973e-6aa1ae2fb7a8

📥 Commits

Reviewing files that changed from the base of the PR and between 21e935d and 8a53ae6.

📒 Files selected for processing (3)

examples/specdec_bench/specdec_bench/models/vllm.py
examples/specdec_bench/specdec_bench/utils.py
tools/launcher/examples/gemma-4/gemma-4-E4B-it/specdec_bench_mtp_vllm.yaml

🚧 Files skipped from review as they are similar to previous changes (3)

examples/specdec_bench/specdec_bench/utils.py
tools/launcher/examples/gemma-4/gemma-4-E4B-it/specdec_bench_mtp_vllm.yaml
examples/specdec_bench/specdec_bench/models/vllm.py

📝 Walkthrough

Walkthrough

Adds tokenizer handling for checkpoint extra special tokens, refines vLLM MTP speculative-decoding to distinguish assistant-model vs generic modes, and adds a Gemma-4 E4B-it SPEED benchmark job with two vLLM MTP tasks.

Changes

Gemma-4 MTP Benchmark Setup

Layer / File(s)	Summary
Tokenizer extra_special_tokens support `examples/specdec_bench/specdec_bench/utils.py`	`get_tokenizer()` reads `tokenizer_config.json` from the checkpoint directory, extracts `extra_special_tokens`, converts a list into the HF token-to-token mapping format, and supplies it to `AutoTokenizer.from_pretrained()` with `trust_remote_code`.
vLLM MTP configuration strategy `examples/specdec_bench/specdec_bench/models/vllm.py`	MTP speculative decoding configuration now branches on `draft_model_dir`: when provided, uses assistant-model specdec without a `method` field; when absent, uses generic MTP with `method: "mtp"` and `num_speculative_tokens`.
Gemma-4 MTP benchmark job definition `tools/launcher/examples/gemma-4/gemma-4-E4B-it/specdec_bench_mtp_vllm.yaml`	New SPEED benchmark job for Gemma-4 E4B-it with two pipeline tasks: qualitative dataset at concurrency 32, and throughput_32k at concurrency 8 with 80 requests; both use vLLM MTP settings, draft length 3, and single-GPU container configuration.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title references a specific model (Gemma 4 E4B-it), algorithm (MTP), and engine (vLLM), directly matching the PR's core objective to add support for this configuration.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	All Python changes comply with SECURITY.md. trust_remote_code is configurable, defaults to False, with no hardcoded True values. No unsafe deserialization, eval/exec, or security bypasses found.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch pensieve-intern/OMNIML-5022/t0_d3

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-06-10T00:43:43Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.32%. Comparing base (48767a0) to head (8a53ae6).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1663      +/-   ##
==========================================
+ Coverage   77.30%   77.32%   +0.01%     
==========================================
  Files         509      509              
  Lines       55914    55914              
==========================================
+ Hits        43227    43238      +11     
+ Misses      12687    12676      -11

Flag	Coverage Δ
examples	`42.31% <ø> (-0.13%)`	⬇️
regression	`14.67% <ø> (+0.05%)`	⬆️
unit	`54.46% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🧹 Nitpick comments (1)

examples/specdec_bench/specdec_bench/utils.py (1)

41-42: ⚡ Quick win

Add error handling for JSON parsing.

The json.load() call on line 42 can raise JSONDecodeError if tokenizer_config.json is malformed. While a corrupt config file is unlikely in practice, adding a try-except block would make the function more robust and prevent cryptic errors downstream.

🛡️ Proposed fix to add defensive error handling

     tokenizer_config_path = os.path.join(path, "tokenizer_config.json")
     if os.path.exists(tokenizer_config_path):
-        with open(tokenizer_config_path) as f:
-            tokenizer_config = json.load(f)
-        extra_special_tokens = tokenizer_config.get("extra_special_tokens")
+        try:
+            with open(tokenizer_config_path) as f:
+                tokenizer_config = json.load(f)
+            extra_special_tokens = tokenizer_config.get("extra_special_tokens")
+        except (OSError, json.JSONDecodeError):
+            # Fall back to default behavior if config is unreadable
+            pass

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/specdec_bench/specdec_bench/utils.py` around lines 41 - 42, The
json.load(tokenizer_config_path) call can raise json.JSONDecodeError; wrap the
open(...) / json.load(...) in a try/except that catches json.JSONDecodeError and
raises a clearer error (e.g., ValueError or a custom exception) that includes
tokenizer_config_path and the original exception message, or log and rethrow to
provide actionable context; update the code around tokenizer_config_path and
tokenizer_config to use this defensive handling so malformed
tokenizer_config.json produces a clear, descriptive error.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/specdec_bench/specdec_bench/utils.py`:
- Around line 46-50: The current conversion of a list into
kwargs["extra_special_tokens"] assumes tokens are wrapped like "<|...|>" and
builds names with token.strip("<|>").replace("|", "_") + "_token"; instead
validate the contract expected by AutoTokenizer.from_pretrained by ensuring
kwargs["extra_special_tokens"] is a dict of {token_name: token_string}, and
harden the key derivation in the block that handles extra_special_tokens: check
each token's format and either canonicalize wrapped tokens as before or fall
back to a safe sanitized name (e.g., remove non-alphanumerics, limit length)
plus a unique numeric suffix to prevent collisions, and raise or log a clear
error if a token cannot be safely named; update any code paths that call
AutoTokenizer.from_pretrained to pass this dict.

---

Nitpick comments:
In `@examples/specdec_bench/specdec_bench/utils.py`:
- Around line 41-42: The json.load(tokenizer_config_path) call can raise
json.JSONDecodeError; wrap the open(...) / json.load(...) in a try/except that
catches json.JSONDecodeError and raises a clearer error (e.g., ValueError or a
custom exception) that includes tokenizer_config_path and the original exception
message, or log and rethrow to provide actionable context; update the code
around tokenizer_config_path and tokenizer_config to use this defensive handling
so malformed tokenizer_config.json produces a clear, descriptive error.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c462dded-57d8-4edb-a743-3f8809ed0246

📥 Commits

Reviewing files that changed from the base of the PR and between bde162a and 302512e.

📒 Files selected for processing (4)

examples/specdec_bench/specdec_bench/models/vllm.py
examples/specdec_bench/specdec_bench/utils.py
tools/launcher/common/specdec_bench/_cells/gemma-4-E4B-it_mtp_vllm_t0_d3.yaml
tools/launcher/examples/gemma-4/gemma-4-E4B-it/specdec_bench_mtp_vllm.yaml

coderabbitai · 2026-06-10T16:21:44Z

+    if isinstance(extra_special_tokens, list):
+        kwargs["extra_special_tokens"] = {
+            token.strip("<|>").replace("|", "_") + "_token": token
+            for token in extra_special_tokens
+        }


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify the HuggingFace AutoTokenizer API for extra_special_tokens handling # Check if transformers tokenizer accepts extra_special_tokens python3 <<'EOF' from transformers import AutoTokenizer import inspect # Inspect AutoTokenizer.from_pretrained signature sig = inspect.signature(AutoTokenizer.from_pretrained) params = list(sig.parameters.keys()) # Look for extra_special_tokens or similar parameters special_token_params = [p for p in params if 'special' in p.lower() or 'token' in p.lower()] print("Parameters related to special tokens:") for p in special_token_params: print(f" - {p}") # Check PreTrainedTokenizer.__init__ as well from transformers import PreTrainedTokenizer init_sig = inspect.signature(PreTrainedTokenizer.__init__) init_params = list(init_sig.parameters.keys()) init_special = [p for p in init_params if 'special' in p.lower() or 'token' in p.lower()] print("\nPreTrainedTokenizer.__init__ special token parameters:") for p in init_special: print(f" - {p}") EOF

Repository: NVIDIA/Model-Optimizer

Length of output: 191

🌐 Web query:

HuggingFace transformers AutoTokenizer.from_pretrained extra_special_tokens parameter format

💡 Result:

In the Hugging Face transformers library, the extra_special_tokens parameter used within AutoTokenizer.from_pretrained is passed as a dictionary [1][2]. The format for this dictionary is: extra_special_tokens={"token_name": "token_string"} [1][2] For example: tokenizer = AutoTokenizer.from_pretrained( "model_name", extra_special_tokens={"image_token": ""}) [1][3] Key Details: 1. Usage: This parameter allows you to register additional special tokens [1][3]. While commonly used in multimodal models as placeholders for images, video, or audio, these tokens do not have to be modality-specific and can represent any token the model needs to access [2][4]. 2. Functionality: If these tokens are not already present in the vocabulary, they are added to the end of the vocabulary [2][4]. 3. Decoding: Adding tokens through this parameter ensures they are treated as special tokens, meaning they will be skipped when decoding if skip_special_tokens=True is set [2][4]. 4. Alternative Method: Beyond passing it during instantiation, you can also manage special tokens using the add_special_tokens method on a tokenizer instance [4]. Note that attempting to pass some flags like add_special_tokens=True directly into from_pretrained can cause conflicts with existing methods in certain tokenizer classes; it is generally safer to manage such flags after initialization if they are not explicitly supported by the constructor [5].

Citations:

1: https://huggingface.co/docs/transformers/main/en/fast_tokenizers

2: https://huggingface.co/docs/transformers/main/en/main%5Fclasses/tokenizer

3: https://github.com/huggingface/transformers/blob/main/docs/source/en/fast_tokenizers.md

4: https://huggingface.co/docs/transformers/en/main_classes/tokenizer

5: AutoTokenizer.from_pretrained with add_special_token=True cannot be deserialized back huggingface/transformers#34557

Confirm extra_special_tokens contract; harden key-name derivation.
HuggingFace AutoTokenizer.from_pretrained expects extra_special_tokens as a dict of {token_name: token_string} (token names typically end with *_token), so converting a list into kwargs["extra_special_tokens"] in that shape matches the API contract. The main remaining risk is the heuristic deriving token_name (token.strip("<|>").replace("|", "_") + "_token"), which assumes <|...|>-wrapped tokens and can produce unexpected/invalid names or collisions when tokens don’t follow that format.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/specdec_bench/specdec_bench/utils.py` around lines 46 - 50, The current conversion of a list into kwargs["extra_special_tokens"] assumes tokens are wrapped like "<|...|>" and builds names with token.strip("<|>").replace("|", "_") + "_token"; instead validate the contract expected by AutoTokenizer.from_pretrained by ensuring kwargs["extra_special_tokens"] is a dict of {token_name: token_string}, and harden the key derivation in the block that handles extra_special_tokens: check each token's format and either canonicalize wrapped tokens as before or fall back to a safe sanitized name (e.g., remove non-alphanumerics, limit length) plus a unique numeric suffix to prevent collisions, and raise or log a clear error if a token cannot be safely named; update any code paths that call AutoTokenizer.from_pretrained to pass this dict.

ChenhanYu · 2026-06-10T16:26:32Z

/claude review

claude · 2026-06-10T16:41:47Z

+    if isinstance(extra_special_tokens, list):
+        kwargs["extra_special_tokens"] = {
+            token.strip("<|>").replace("|", "_") + "_token": token
+            for token in extra_special_tokens
+        }


[SUGGESTION] The synthesized extra_special_tokens keys can collide with HuggingFace's built-in special-token attributes and with each other.

SpecialTokensMixin reserves attribute names like bos_token, eos_token, pad_token, cls_token, sep_token, unk_token, mask_token. If the source list happens to include something whose stripped-and-suffixed name matches one of those (e.g. "<|bos|>" → "bos_token"), this kwargs path overwrites the real built-in token mapping at construction time, silently corrupting tokenization. The naming heuristic also collapses:

"<|foo|>" and "foo" both become "foo_token" (last write wins, one token gets dropped).

empty/edge inputs like "<|>" collapse to a key of "_token".

For Gemma 4 today this happens to work, but it's a foot-gun for any future tokenizer with overlapping names. Safer alternative: drop the heuristic and use index-based names that can't collide with HF reserved names:

kwargs["extra_special_tokens"] = { f"extra_special_token_{i}": token for i, token in enumerate(extra_special_tokens) }

The keys are only used as attribute lookups by user code; the actual tokenizer behavior depends on the token values, not the keys.

claude · 2026-06-10T16:41:55Z

 def get_tokenizer(path, trust_remote_code=False):
-    return AutoTokenizer.from_pretrained(path, trust_remote_code=trust_remote_code)
+    extra_special_tokens = None
+    tokenizer_config_path = os.path.join(path, "tokenizer_config.json")


[SUGGESTION] When path is a HuggingFace Hub repo ID (e.g. "google/gemma-4-E4B-it") rather than a local directory, os.path.exists(tokenizer_config_path) returns False and the new branch is skipped — so AutoTokenizer.from_pretrained will still hit the original "list-shaped extra_special_tokens" error.

This is fine for the launcher path (which mounts checkpoints under /hf-local/... and always passes a directory), but breaks direct CLI usage that resolves through the HF cache. If you want the fix to also cover that case, fall back to huggingface_hub.try_to_load_from_cache(path, "tokenizer_config.json") (or cached_file) when os.path.exists fails. Not blocking — flagging since the PR description frames this as a Gemma-4 fix in general, not launcher-specific.

claude · 2026-06-10T16:41:58Z

+#     pipeline.task_0.args+=["--temperature 0","--max_seq_len 65536","--save_dir /scratchspace/<sweep>/qualitative","--draft_length 3"] \
+#     pipeline.task_1.args+=["--temperature 0","--max_seq_len 65536","--save_dir /scratchspace/<sweep>/throughput_32k","--num_requests 80","--draft_length 3"]


[SUGGESTION] The example overrides include "--draft_length 3", but --draft_length 3 is already in task_0.args (line 39) and task_1.args (line 66). With args+=[...], the override gets appended — argparse then sees --draft_length 3 --draft_length 3 and just uses the last occurrence. Functionally harmless, but it means the comment misleads reviewers/users about which knobs are cell-overridable: a user who reads this comment and tries args+=["--draft_length 7"] will be surprised when both 3 and 7 end up on the command line. Either drop --draft_length 3 from the override hint, or move it out of the base args block (matching the --temperature / --max_seq_len / --save_dir pattern, which are NOT in the base args and ARE legitimately cell-overridable).

claude

Claude review passed — no blocking issues found. LGTM

Targeted fix that cleanly routes the two MTP speculative_config shapes (assistant-model vs. generic) based on whether --draft_model_dir is set, plus the tokenizer plumbing Gemma 4 needs and a parent YAML for the cluster cells. The branch is backward-compatible: existing MTP callers (Qwen 3.5) that don't pass --draft_model_dir get the unchanged {"method": "mtp", ...} config.

Findings: 0 CRITICAL / 0 IMPORTANT / 3 SUGGESTION (all non-blocking, posted inline):

utils.py: extra_special_tokens key heuristic can collide with HF SpecialTokensMixin reserved names (bos_token, eos_token, etc.) and with itself; f"extra_special_token_{i}" is collision-free.
utils.py: os.path.exists short-circuit means the fix only triggers for local checkpoint dirs — Hub-ID callers still hit the original error. Fine for the launcher's mounted-checkpoint path, just worth flagging.
Gemma-4 YAML: the override-example comment lists --draft_length 3, but that flag is already in the base args, so cell overrides duplicate it. Same pattern as --temperature / --max_seq_len / --save_dir would be cleaner.

Risk: low. Wrapper change is a pure additive branch; YAML is a new example; the only file that runs in existing flows (utils.py) is gated on a list-shaped extra_special_tokens field that current callers don't have.

ChenhanYu · 2026-06-10T16:53:35Z

/ok to test 21e935d

Signed-off-by: Pensieve Intern <chenhany@nvidia.com>

vLLM PR vllm-project/vllm#41745 (2026-05-06) shipped Gemma 4 MTP support; the implementation expects ``speculative_config`` shaped as ``{"model": <assistant>, "num_speculative_tokens": N}`` (no ``method`` key — vLLM auto-detects Gemma 4 from the assistant). The specdec_bench wrapper at ``models/vllm.py`` unconditionally emitted ``{"method": "mtp", "num_speculative_tokens": N}`` for any ``--speculative_algorithm MTP`` invocation, which produced ``NotImplementedError: Unsupported speculative method: 'mtp'`` on Gemma 4 even with a container that has the support (``vllm/vllm-openai:v0.22.1``+). Changes: 1. ``examples/specdec_bench/specdec_bench/models/vllm.py``: when MTP is paired with ``--draft_model_dir``, emit the assistant-model config shape; otherwise keep the generic ``method: "mtp"`` path (Qwen 3.5 etc.). Preserves backward compatibility — callers that didn't pass ``--draft_model_dir`` get the same config they got before. 2. ``tools/launcher/examples/gemma-4/gemma-4-E4B-it/specdec_bench_mtp_vllm.yaml``: bump container to ``vllm/vllm-openai:v0.22.1`` (the qwen3_5-cu130 tag predates the gemma4_mtp PR and doesn't recognize ``model_type=gemma4``); add ``--draft_model_dir /hf-local/google/gemma-4-E4B-it-assistant`` to both task_0 and task_1 args; expose the assistant path via ``global_vars.draft_model`` for reuse; rewrite the header comment with the corrected diagnosis. Validated upstream: ``google/gemma-4-{E2B,E4B,26B-A4B,31B}-it-assistant`` exist on HuggingFace (public, ungated); ``gemma4_mtp.py`` is in ``v0.22.0``, ``v0.22.1``, and ``main``. PR #41745's test plan documents the same config shape this change emits. Surfaced by OMNIML-5024/5025/5026/5027 (the four cells of OMNIML-5022). Each cell agent independently misdiagnosed this as "no container supports both Gemma 4 and MTP," when the gap was actually the wrapper not being wired for the assistant-model config shape that vLLM expects. Signed-off-by: Chenhan Yu <chenhany@nvidia.com>

Post-#1564, each ``(model, algorithm, engine)`` Epic owns exactly one committed YAML: the parent at ``tools/launcher/examples/<family>/<model>/specdec_bench_<algo>_<engine>.yaml``. Per-cell knobs (temperature, max_seq_len, save_dir, draft_length / block_size) come from CLI overrides at slurm-invoke time via ``pipeline.task_N.args+=[...]``. No per-cell file is committed. This branch had originally created ``tools/launcher/common/specdec_bench/_cells/gemma-4-E4B-it_mtp_vllm_t0_d3.yaml`` + a ``--runtime_params common/specdec_bench/_cells/...`` reference in the parent, both pre-#1564 shapes that duplicated the parent's knobs. The cell-stage workflow SPEC template had stale Step 3 guidance still mentioning the ``_cells/<sweep_name>.yaml`` shape, which is what the agent followed; that contradiction is cleaned up in pensieve-intern !91. Drop: - ``tools/launcher/common/specdec_bench/_cells/gemma-4-E4B-it_mtp_vllm_t0_d3.yaml`` - the ``--runtime_params common/specdec_bench/_cells/...`` line on both task_0 and task_1 in the parent YAML Update the header comment to document the canonical CLI-override invocation pattern (the same pattern used by the NVIDIA-Nemotron-3-Super-120B-A12B-BF16 parent on main). The cell-side overrides for OMNIML-5024 (t0_d3) become: pipeline.task_0.args+=["--temperature 0", "--max_seq_len 65536", "--save_dir /scratchspace/<sweep>/qualitative", "--draft_length 3"] pipeline.task_1.args+=["--temperature 0", "--max_seq_len 65536", "--save_dir /scratchspace/<sweep>/throughput_32k", "--num_requests 80", "--draft_length 3"] Signed-off-by: Chenhan Yu <chenhany@nvidia.com>

Pre-commit ruff-format collapsed the dict comprehension onto a single line. Pure formatting — no behaviour change. Signed-off-by: Chenhan Yu <chenhany@nvidia.com>

ChenhanYu · 2026-06-10T18:45:08Z

/ok to test 8a53ae6

github-actions · 2026-06-10T19:46:44Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-10 19:46 UTC

…a 4 MTP (#1677) ### What does this PR do? Type of change: Bug fix Fixes the specdec_bench vLLM wrapper's MTP `speculative_config` emission so Gemma 4 MTP no longer hits the wrong code path inside vLLM. ### Bug vLLM's `SpeculativeConfig.__post_init__` (`vllm/config/speculative.py:529-602`) auto-detects `method` ONLY when it's unset. When `model` is provided and `method` is `None`, the default branch sets `method = "draft_model"` — the generic same-architecture draft path, NOT MTP. That path enforces equal num_heads between target and draft and raises: ``` AssertionError: All layers in one attention group must share num_heads; got {8, 4} ``` on heterogeneous-head models. Gemma 4 has 8 target heads and 4 draft heads by design. ### Where the previous fix went wrong PR #1663 changed the MTP branch in the wrapper to emit `{model: <assistant>, num_speculative_tokens: N}` WITHOUT `method` when `draft_model_dir` was provided, based on a misread of vLLM PR #41745's test plan that only showed the `{model, num_speculative_tokens}` shape. That test plan was the direct `LLM(...)` constructor invocation; vLLM had already defaulted method internally. Going through specdec_bench's `AsyncEngineArgs(speculative_config=...)` path, the explicit `method` key is required to avoid the auto-detect → draft_model fallback. ### Reference vLLM's own test at [`tests/v1/e2e/spec_decode/test_spec_decode.py:818-823`](https://github.com/vllm-project/vllm/blob/main/tests/v1/e2e/spec_decode/test_spec_decode.py#L818) does exactly this for the gemma4-e4b parametrization: ```python speculative_config = { "method": method, # "mtp" "num_speculative_tokens": ..., } if draft_model is not None: # Gemma 4 case speculative_config["model"] = draft_model ``` ### Fix Restore `method="mtp"` as the unconditional MTP path. ADD `model` only when `draft_model_dir` is set. Backward-compatible for Qwen 3.5 MTP / DeepSeek MTP / other inline-MTP families (they keep the bare `{method: "mtp"}` config). ### Validation Field-tested via vLLM PR #41745's correctness test on `gemma-4-E4B-it` + `gemma-4-E4B-it-assistant`: produced 304.7 output TPS at γ=4 vs 171.0 baseline (178% speedup) on H100. The same `speculative_config` shape this fix emits. ### Surfaced on [OMNIML-5024](https://jirasw.nvidia.com/browse/OMNIML-5024) pipeline #54356795: - Wrapper emitted `{model: assistant, num_speculative_tokens: 3}` - vLLM auto-detected `method = "draft_model"` - Loaded gemma-4-E4B-it-assistant (4 heads) as a generic draft for gemma-4-E4B-it (8 heads) - Attention-group num_heads check tripped → AssertionError, task_0 FAILED, task_1 CANCELLED ### Before your PR is "*Ready for review*" - Backward compatible: ✅ (Qwen 3.5 / DeepSeek MTP unchanged; only the MTP+`draft_model_dir` case changes). - New tests: ❌ — the test exercising this codepath would need a GPU + gemma-4 model checkout, which is cluster work, not unit-test scope. JIRA-tracked validation via OMNIML-5024 dispatch after this lands. - Changelog: ❌ ### Additional Information - vLLM PR #41745 (Gemma4 MTP support) - Companion: NVIDIA/Model-Optimizer PR #1675 (launcher `GlobalVariables.draft_model` schema fix)  ## Summary by CodeRabbit * **Bug Fixes** * Fixed speculative decoding configuration handling in the benchmark example to ensure consistent method assignment and proper draft model configuration. * **Documentation** * Updated configuration comments to reflect corrected behavior and improved clarity.  Signed-off-by: Chenhan Yu <chenhany@nvidia.com>

ChenhanYu force-pushed the pensieve-intern/OMNIML-5022/t0_d3 branch from ce245cd to 10d5d24 Compare June 10, 2026 00:44

ChenhanYu requested review from h-guo18 and yeyu-nvidia June 10, 2026 16:08

ChenhanYu marked this pull request as ready for review June 10, 2026 16:08

ChenhanYu requested a review from a team as a code owner June 10, 2026 16:08

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

claude Bot reviewed Jun 10, 2026

View reviewed changes

claude Bot approved these changes Jun 10, 2026

View reviewed changes

ChenhanYu added 10 commits June 10, 2026 10:51

Add gemma-4 MTP vLLM SPEED-bench t0_d3 cell

854cee6

Signed-off-by: Pensieve Intern <chenhany@nvidia.com>

Use MTP-capable vLLM image for gemma-4 cell

22688d5

Signed-off-by: Pensieve Intern <chenhany@nvidia.com>

Allow Gemma specdec bench to upgrade tokenizer deps

349a6e7

Signed-off-by: Pensieve Intern <chenhany@nvidia.com>

Pin Gemma specdec transformers below 5

5cde29c

Signed-off-by: Pensieve Intern <chenhany@nvidia.com>

Quote specdec bench extra pip dependency specs

39295b3

Signed-off-by: Pensieve Intern <chenhany@nvidia.com>

Normalize list-style tokenizer extras for specdec bench

6cddc97

Signed-off-by: Pensieve Intern <chenhany@nvidia.com>

Drop unused specdec bench dependency override helper

2756e43

Signed-off-by: Pensieve Intern <chenhany@nvidia.com>

specdec_bench/utils.py: ruff format

8a53ae6

Pre-commit ruff-format collapsed the dict comprehension onto a single line. Pure formatting — no behaviour change. Signed-off-by: Chenhan Yu <chenhany@nvidia.com>

ChenhanYu force-pushed the pensieve-intern/OMNIML-5022/t0_d3 branch from 21e935d to 8a53ae6 Compare June 10, 2026 17:51

yeyu-nvidia approved these changes Jun 10, 2026

View reviewed changes

h-guo18 approved these changes Jun 10, 2026

View reviewed changes

ChenhanYu merged commit 66b54ed into main Jun 10, 2026
67 of 69 checks passed

ChenhanYu deleted the pensieve-intern/OMNIML-5022/t0_d3 branch June 10, 2026 19:46

This was referenced Jun 10, 2026

[OMNIML-4962] specdec_bench parent — Qwen/Qwen3.5-4B / DFlash / vLLM #1638

Merged

launcher: add draft_model to GlobalVariables for spec-decode parents #1675

Merged

specdec_bench: keep method=mtp when adding model=<assistant> for Gemma 4 MTP #1677

Merged

coderabbitai Bot mentioned this pull request Jun 15, 2026

[OMNIML-5025] cell_t0_d7 #1738

Merged

		# pipeline.task_0.args+=["--temperature 0","--max_seq_len 65536","--save_dir /scratchspace/<sweep>/qualitative","--draft_length 3"] \
		# pipeline.task_1.args+=["--temperature 0","--max_seq_len 65536","--save_dir /scratchspace/<sweep>/throughput_32k","--num_requests 80","--draft_length 3"]

Conversation

ChenhanYu commented Jun 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Jun 10, 2026

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

ChenhanYu commented Jun 10, 2026

Uh oh!

claude Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ChenhanYu commented Jun 10, 2026

Uh oh!

ChenhanYu commented Jun 10, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ChenhanYu commented Jun 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

codecov Bot commented Jun 10, 2026 •

edited

Loading