Skip to content

[OMNIML-5024] specdec_bench cell t0_d3 — google/gemma-4-E4B-it / MTP / vllm#1663

Merged
ChenhanYu merged 10 commits into
mainfrom
pensieve-intern/OMNIML-5022/t0_d3
Jun 10, 2026
Merged

[OMNIML-5024] specdec_bench cell t0_d3 — google/gemma-4-E4B-it / MTP / vllm#1663
ChenhanYu merged 10 commits into
mainfrom
pensieve-intern/OMNIML-5022/t0_d3

Conversation

@ChenhanYu

@ChenhanYu ChenhanYu commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

What does this PR do?

Type of change: Bug fix + new example

Wires SPEED-bench's MTP path to support Gemma 4 (and any future MTP variant that uses a separate assistant / draft model), and adds the SPEED-bench MTP/vLLM example for google/gemma-4-E4B-it.

Key difference: Gemma 4 MTP vs. generic MTP. vLLM's speculative_config accepts two different shapes for MTP:

Variant speculative_config shape Models
Generic MTP {"method": "mtp", "num_speculative_tokens": N} Models that carry their own MTP layer in-tree (e.g. Qwen 3.5 MTP variants) — no separate draft / assistant model.
Assistant-model MTP {"model": "<assistant>", "num_speculative_tokens": N} (no method key — vLLM auto-detects from the assistant) Gemma 4 family (E2B / E4B / 26B-A4B / 31B); each target model has a paired <target>-assistant checkpoint that acts as the MTP draft. Landed in vllm-project/vllm#41745 (2026-05-06).

The specdec_bench vLLM wrapper at examples/specdec_bench/specdec_bench/models/vllm.py previously emitted only the generic shape for any --speculative_algorithm MTP invocation, which produced NotImplementedError: Unsupported speculative method: 'mtp' on Gemma 4 even with a container that has the support (vllm/vllm-openai:v0.22.1+). This PR teaches the wrapper to switch shapes based on whether --draft_model_dir is provided.

Concrete changes:

  1. examples/specdec_bench/specdec_bench/models/vllm.py — when speculative_algorithm == "MTP" AND draft_model_dir is set, emit {"model": draft_model_dir, "num_speculative_tokens": N} (assistant-model shape). Otherwise emit the existing {"method": "mtp", ...} (generic shape). Backward-compatible — Qwen 3.5 MTP and other callers that omit --draft_model_dir get the same config they got before.

  2. examples/specdec_bench/specdec_bench/utils.pyget_tokenizer reads extra_special_tokens from the model's tokenizer_config.json and passes them through to AutoTokenizer.from_pretrained. Gemma 4 tokenizers ship a list-shaped extra_special_tokens entry that the constructor would otherwise reject. Necessary for any Gemma 4 cell.

  3. tools/launcher/examples/gemma-4/gemma-4-E4B-it/specdec_bench_mtp_vllm.yaml — SPEED-bench parent YAML for google/gemma-4-E4B-it. Uses vllm/vllm-openai:v0.22.1 (has gemma4_mtp.py from #41745) and wires --draft_model_dir /hf-local/google/gemma-4-E4B-it-assistant on both task_0 (qualitative) and task_1 (throughput_32k).

  4. tools/launcher/common/specdec_bench/_cells/gemma-4-E4B-it_mtp_vllm_t0_d3.yaml — runtime params for the t0_d3 cell of OMNIML-5022 (temperature=0, max_model_len=40960).

Usage

# Wrapper-level: same CLI as before, just pass --draft_model_dir for
# Gemma 4 MTP. The wrapper auto-routes to the assistant-model shape.

# python examples/specdec_bench/run.py \
#     --engine VLLM \
#     --speculative_algorithm MTP \
#     --draft_model_dir /hf-local/google/gemma-4-E4B-it-assistant \
#     --draft_length 3 \
#     --tp_size 1 \
#     ...other SPEED-bench knobs...

# Equivalent direct vLLM invocation (for reference, no wrapper):
from vllm import LLM, SamplingParams
llm = LLM(
    model="google/gemma-4-E4B-it",
    speculative_config={
        "model": "google/gemma-4-E4B-it-assistant",
        "num_speculative_tokens": 3,
    },
    trust_remote_code=True,
)

Testing

  • Upstream existence checks: verified the assistant models google/gemma-4-{E2B,E4B,26B-A4B}-it-assistant exist, public, ungated on HuggingFace; verified vllm/model_executor/models/gemma4_mtp.py is in vLLM v0.22.0, v0.22.1, and main.
  • Backward compat: MTP callers that don't pass --draft_model_dir (e.g. the existing Qwen 3.5 MTP/vLLM cells under tools/launcher/examples/Qwen/Qwen3.5-4B/) take the unchanged {"method": "mtp", ...} branch. No diff for those.
  • End-to-end cluster validation: pending. Will run via the OMNIML-5022 cells (OMNIML-5024 / 5025 / 5026 / 5027) once the nmm-sandbox submodule pin advances past this PR. Each cell exercises task_0 (SPEED-Bench qualitative, 880 samples) + task_1 (throughput_32k, 80 samples) on cw_dfw, single H100.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ — the wrapper only takes the new branch when --draft_model_dir is provided alongside --speculative_algorithm MTP. Existing MTP callers (Qwen 3.5 etc.) keep the generic method: "mtp" config.
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A — no new dependencies.
  • Did you write any new necessary tests?: ❌ — relying on the SPEED-bench cluster cells (OMNIML-5024 …5027) for end-to-end validation; no unit test fixture for the vLLM wrapper exists in tests/ for me to extend symmetrically. Happy to add one if reviewers want it.
  • Did you update Changelog?: ❌ — small fix + example addition. Can add if requested.
  • Did you get Claude approval on this PR?: ❌ — will run /claude review once the PR is marked Ready for review.

Additional Information

  • JIRA: OMNIML-5024 (cell_t0_d3); siblings OMNIML-5025/5026/5027 (cell_{t0_d7, t1_d3, t1_d7}) of Epic OMNIML-5022 are blocked on this PR landing.
  • Upstream reference: [Spec Decode] Add Gemma4 MTP speculative decoding support vllm-project/vllm#41745 — "[Spec Decode] Add Gemma4 MTP speculative decoding support".
  • Companion (pensieve-intern !91, internal): adds a "Model-family-specific MTP invocation" table to the specdec_bench cell SPEC so future agents pair MTP with the right --draft_model_dir from SPEC-read time.

Summary by CodeRabbit

  • New Features

    • Added a SPEED-bench pipeline for Gemma 4 using vLLM speculative decoding (MTP) with qualitative and throughput tasks.
  • Improvements

    • Speculative-decoding logic updated to handle assistant-model and generic MTP cases distinctly.
    • Tokenizer loading now reads and normalizes extra special tokens from tokenizer config when available.

@copy-pr-bot

copy-pr-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fd2d418d-a089-40ee-973e-6aa1ae2fb7a8

📥 Commits

Reviewing files that changed from the base of the PR and between 21e935d and 8a53ae6.

📒 Files selected for processing (3)
  • examples/specdec_bench/specdec_bench/models/vllm.py
  • examples/specdec_bench/specdec_bench/utils.py
  • tools/launcher/examples/gemma-4/gemma-4-E4B-it/specdec_bench_mtp_vllm.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
  • examples/specdec_bench/specdec_bench/utils.py
  • tools/launcher/examples/gemma-4/gemma-4-E4B-it/specdec_bench_mtp_vllm.yaml
  • examples/specdec_bench/specdec_bench/models/vllm.py

📝 Walkthrough

Walkthrough

Adds tokenizer handling for checkpoint extra special tokens, refines vLLM MTP speculative-decoding to distinguish assistant-model vs generic modes, and adds a Gemma-4 E4B-it SPEED benchmark job with two vLLM MTP tasks.

Changes

Gemma-4 MTP Benchmark Setup

Layer / File(s) Summary
Tokenizer extra_special_tokens support
examples/specdec_bench/specdec_bench/utils.py
get_tokenizer() reads tokenizer_config.json from the checkpoint directory, extracts extra_special_tokens, converts a list into the HF token-to-token mapping format, and supplies it to AutoTokenizer.from_pretrained() with trust_remote_code.
vLLM MTP configuration strategy
examples/specdec_bench/specdec_bench/models/vllm.py
MTP speculative decoding configuration now branches on draft_model_dir: when provided, uses assistant-model specdec without a method field; when absent, uses generic MTP with method: "mtp" and num_speculative_tokens.
Gemma-4 MTP benchmark job definition
tools/launcher/examples/gemma-4/gemma-4-E4B-it/specdec_bench_mtp_vllm.yaml
New SPEED benchmark job for Gemma-4 E4B-it with two pipeline tasks: qualitative dataset at concurrency 32, and throughput_32k at concurrency 8 with 80 requests; both use vLLM MTP settings, draft length 3, and single-GPU container configuration.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title references a specific model (Gemma 4 E4B-it), algorithm (MTP), and engine (vLLM), directly matching the PR's core objective to add support for this configuration.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed All Python changes comply with SECURITY.md. trust_remote_code is configurable, defaults to False, with no hardcoded True values. No unsafe deserialization, eval/exec, or security bypasses found.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pensieve-intern/OMNIML-5022/t0_d3

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.32%. Comparing base (48767a0) to head (8a53ae6).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1663      +/-   ##
==========================================
+ Coverage   77.30%   77.32%   +0.01%     
==========================================
  Files         509      509              
  Lines       55914    55914              
==========================================
+ Hits        43227    43238      +11     
+ Misses      12687    12676      -11     
Flag Coverage Δ
examples 42.31% <ø> (-0.13%) ⬇️
regression 14.67% <ø> (+0.05%) ⬆️
unit 54.46% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ChenhanYu ChenhanYu force-pushed the pensieve-intern/OMNIML-5022/t0_d3 branch from ce245cd to 10d5d24 Compare June 10, 2026 00:44
@ChenhanYu ChenhanYu requested review from h-guo18 and yeyu-nvidia June 10, 2026 16:08
@ChenhanYu ChenhanYu marked this pull request as ready for review June 10, 2026 16:08
@ChenhanYu ChenhanYu requested a review from a team as a code owner June 10, 2026 16:08

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🧹 Nitpick comments (1)
examples/specdec_bench/specdec_bench/utils.py (1)

41-42: ⚡ Quick win

Add error handling for JSON parsing.

The json.load() call on line 42 can raise JSONDecodeError if tokenizer_config.json is malformed. While a corrupt config file is unlikely in practice, adding a try-except block would make the function more robust and prevent cryptic errors downstream.

🛡️ Proposed fix to add defensive error handling
     tokenizer_config_path = os.path.join(path, "tokenizer_config.json")
     if os.path.exists(tokenizer_config_path):
-        with open(tokenizer_config_path) as f:
-            tokenizer_config = json.load(f)
-        extra_special_tokens = tokenizer_config.get("extra_special_tokens")
+        try:
+            with open(tokenizer_config_path) as f:
+                tokenizer_config = json.load(f)
+            extra_special_tokens = tokenizer_config.get("extra_special_tokens")
+        except (OSError, json.JSONDecodeError):
+            # Fall back to default behavior if config is unreadable
+            pass
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/specdec_bench/specdec_bench/utils.py` around lines 41 - 42, The
json.load(tokenizer_config_path) call can raise json.JSONDecodeError; wrap the
open(...) / json.load(...) in a try/except that catches json.JSONDecodeError and
raises a clearer error (e.g., ValueError or a custom exception) that includes
tokenizer_config_path and the original exception message, or log and rethrow to
provide actionable context; update the code around tokenizer_config_path and
tokenizer_config to use this defensive handling so malformed
tokenizer_config.json produces a clear, descriptive error.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/specdec_bench/specdec_bench/utils.py`:
- Around line 46-50: The current conversion of a list into
kwargs["extra_special_tokens"] assumes tokens are wrapped like "<|...|>" and
builds names with token.strip("<|>").replace("|", "_") + "_token"; instead
validate the contract expected by AutoTokenizer.from_pretrained by ensuring
kwargs["extra_special_tokens"] is a dict of {token_name: token_string}, and
harden the key derivation in the block that handles extra_special_tokens: check
each token's format and either canonicalize wrapped tokens as before or fall
back to a safe sanitized name (e.g., remove non-alphanumerics, limit length)
plus a unique numeric suffix to prevent collisions, and raise or log a clear
error if a token cannot be safely named; update any code paths that call
AutoTokenizer.from_pretrained to pass this dict.

---

Nitpick comments:
In `@examples/specdec_bench/specdec_bench/utils.py`:
- Around line 41-42: The json.load(tokenizer_config_path) call can raise
json.JSONDecodeError; wrap the open(...) / json.load(...) in a try/except that
catches json.JSONDecodeError and raises a clearer error (e.g., ValueError or a
custom exception) that includes tokenizer_config_path and the original exception
message, or log and rethrow to provide actionable context; update the code
around tokenizer_config_path and tokenizer_config to use this defensive handling
so malformed tokenizer_config.json produces a clear, descriptive error.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c462dded-57d8-4edb-a743-3f8809ed0246

📥 Commits

Reviewing files that changed from the base of the PR and between bde162a and 302512e.

📒 Files selected for processing (4)
  • examples/specdec_bench/specdec_bench/models/vllm.py
  • examples/specdec_bench/specdec_bench/utils.py
  • tools/launcher/common/specdec_bench/_cells/gemma-4-E4B-it_mtp_vllm_t0_d3.yaml
  • tools/launcher/examples/gemma-4/gemma-4-E4B-it/specdec_bench_mtp_vllm.yaml

Comment on lines +46 to +50
if isinstance(extra_special_tokens, list):
kwargs["extra_special_tokens"] = {
token.strip("<|>").replace("|", "_") + "_token": token
for token in extra_special_tokens
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the HuggingFace AutoTokenizer API for extra_special_tokens handling

# Check if transformers tokenizer accepts extra_special_tokens
python3 <<'EOF'
from transformers import AutoTokenizer
import inspect

# Inspect AutoTokenizer.from_pretrained signature
sig = inspect.signature(AutoTokenizer.from_pretrained)
params = list(sig.parameters.keys())

# Look for extra_special_tokens or similar parameters
special_token_params = [p for p in params if 'special' in p.lower() or 'token' in p.lower()]
print("Parameters related to special tokens:")
for p in special_token_params:
    print(f"  - {p}")

# Check PreTrainedTokenizer.__init__ as well
from transformers import PreTrainedTokenizer
init_sig = inspect.signature(PreTrainedTokenizer.__init__)
init_params = list(init_sig.parameters.keys())
init_special = [p for p in init_params if 'special' in p.lower() or 'token' in p.lower()]
print("\nPreTrainedTokenizer.__init__ special token parameters:")
for p in init_special:
    print(f"  - {p}")
EOF

Repository: NVIDIA/Model-Optimizer

Length of output: 191


🌐 Web query:

HuggingFace transformers AutoTokenizer.from_pretrained extra_special_tokens parameter format

💡 Result:

In the Hugging Face transformers library, the extra_special_tokens parameter used within AutoTokenizer.from_pretrained is passed as a dictionary [1][2]. The format for this dictionary is: extra_special_tokens={"token_name": "token_string"} [1][2] For example: tokenizer = AutoTokenizer.from_pretrained( "model_name", extra_special_tokens={"image_token": ""}) [1][3] Key Details: 1. Usage: This parameter allows you to register additional special tokens [1][3]. While commonly used in multimodal models as placeholders for images, video, or audio, these tokens do not have to be modality-specific and can represent any token the model needs to access [2][4]. 2. Functionality: If these tokens are not already present in the vocabulary, they are added to the end of the vocabulary [2][4]. 3. Decoding: Adding tokens through this parameter ensures they are treated as special tokens, meaning they will be skipped when decoding if skip_special_tokens=True is set [2][4]. 4. Alternative Method: Beyond passing it during instantiation, you can also manage special tokens using the add_special_tokens method on a tokenizer instance [4]. Note that attempting to pass some flags like add_special_tokens=True directly into from_pretrained can cause conflicts with existing methods in certain tokenizer classes; it is generally safer to manage such flags after initialization if they are not explicitly supported by the constructor [5].

Citations:


Confirm extra_special_tokens contract; harden key-name derivation.
HuggingFace AutoTokenizer.from_pretrained expects extra_special_tokens as a dict of {token_name: token_string} (token names typically end with *_token), so converting a list into kwargs["extra_special_tokens"] in that shape matches the API contract. The main remaining risk is the heuristic deriving token_name (token.strip("<|>").replace("|", "_") + "_token"), which assumes <|...|>-wrapped tokens and can produce unexpected/invalid names or collisions when tokens don’t follow that format.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/specdec_bench/specdec_bench/utils.py` around lines 46 - 50, The
current conversion of a list into kwargs["extra_special_tokens"] assumes tokens
are wrapped like "<|...|>" and builds names with token.strip("<|>").replace("|",
"_") + "_token"; instead validate the contract expected by
AutoTokenizer.from_pretrained by ensuring kwargs["extra_special_tokens"] is a
dict of {token_name: token_string}, and harden the key derivation in the block
that handles extra_special_tokens: check each token's format and either
canonicalize wrapped tokens as before or fall back to a safe sanitized name
(e.g., remove non-alphanumerics, limit length) plus a unique numeric suffix to
prevent collisions, and raise or log a clear error if a token cannot be safely
named; update any code paths that call AutoTokenizer.from_pretrained to pass
this dict.

@ChenhanYu

Copy link
Copy Markdown
Collaborator Author

/claude review

Comment on lines +46 to +50
if isinstance(extra_special_tokens, list):
kwargs["extra_special_tokens"] = {
token.strip("<|>").replace("|", "_") + "_token": token
for token in extra_special_tokens
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] The synthesized extra_special_tokens keys can collide with HuggingFace's built-in special-token attributes and with each other.

SpecialTokensMixin reserves attribute names like bos_token, eos_token, pad_token, cls_token, sep_token, unk_token, mask_token. If the source list happens to include something whose stripped-and-suffixed name matches one of those (e.g. "<|bos|>" → "bos_token"), this kwargs path overwrites the real built-in token mapping at construction time, silently corrupting tokenization. The naming heuristic also collapses:

  • "<|foo|>" and "foo" both become "foo_token" (last write wins, one token gets dropped).
  • empty/edge inputs like "<|>" collapse to a key of "_token".

For Gemma 4 today this happens to work, but it's a foot-gun for any future tokenizer with overlapping names. Safer alternative: drop the heuristic and use index-based names that can't collide with HF reserved names:

kwargs["extra_special_tokens"] = {
    f"extra_special_token_{i}": token
    for i, token in enumerate(extra_special_tokens)
}

The keys are only used as attribute lookups by user code; the actual tokenizer behavior depends on the token values, not the keys.

def get_tokenizer(path, trust_remote_code=False):
return AutoTokenizer.from_pretrained(path, trust_remote_code=trust_remote_code)
extra_special_tokens = None
tokenizer_config_path = os.path.join(path, "tokenizer_config.json")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] When path is a HuggingFace Hub repo ID (e.g. "google/gemma-4-E4B-it") rather than a local directory, os.path.exists(tokenizer_config_path) returns False and the new branch is skipped — so AutoTokenizer.from_pretrained will still hit the original "list-shaped extra_special_tokens" error.

This is fine for the launcher path (which mounts checkpoints under /hf-local/... and always passes a directory), but breaks direct CLI usage that resolves through the HF cache. If you want the fix to also cover that case, fall back to huggingface_hub.try_to_load_from_cache(path, "tokenizer_config.json") (or cached_file) when os.path.exists fails. Not blocking — flagging since the PR description frames this as a Gemma-4 fix in general, not launcher-specific.

Comment on lines +20 to +21
# pipeline.task_0.args+=["--temperature 0","--max_seq_len 65536","--save_dir /scratchspace/<sweep>/qualitative","--draft_length 3"] \
# pipeline.task_1.args+=["--temperature 0","--max_seq_len 65536","--save_dir /scratchspace/<sweep>/throughput_32k","--num_requests 80","--draft_length 3"]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] The example overrides include "--draft_length 3", but --draft_length 3 is already in task_0.args (line 39) and task_1.args (line 66). With args+=[...], the override gets appended — argparse then sees --draft_length 3 --draft_length 3 and just uses the last occurrence. Functionally harmless, but it means the comment misleads reviewers/users about which knobs are cell-overridable: a user who reads this comment and tries args+=["--draft_length 7"] will be surprised when both 3 and 7 end up on the command line. Either drop --draft_length 3 from the override hint, or move it out of the base args block (matching the --temperature / --max_seq_len / --save_dir pattern, which are NOT in the base args and ARE legitimately cell-overridable).

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review passed — no blocking issues found. LGTM

Targeted fix that cleanly routes the two MTP speculative_config shapes (assistant-model vs. generic) based on whether --draft_model_dir is set, plus the tokenizer plumbing Gemma 4 needs and a parent YAML for the cluster cells. The branch is backward-compatible: existing MTP callers (Qwen 3.5) that don't pass --draft_model_dir get the unchanged {"method": "mtp", ...} config.

Findings: 0 CRITICAL / 0 IMPORTANT / 3 SUGGESTION (all non-blocking, posted inline):

  • utils.py: extra_special_tokens key heuristic can collide with HF SpecialTokensMixin reserved names (bos_token, eos_token, etc.) and with itself; f"extra_special_token_{i}" is collision-free.
  • utils.py: os.path.exists short-circuit means the fix only triggers for local checkpoint dirs — Hub-ID callers still hit the original error. Fine for the launcher's mounted-checkpoint path, just worth flagging.
  • Gemma-4 YAML: the override-example comment lists --draft_length 3, but that flag is already in the base args, so cell overrides duplicate it. Same pattern as --temperature / --max_seq_len / --save_dir would be cleaner.

Risk: low. Wrapper change is a pure additive branch; YAML is a new example; the only file that runs in existing flows (utils.py) is gated on a list-shaped extra_special_tokens field that current callers don't have.

@ChenhanYu

Copy link
Copy Markdown
Collaborator Author

/ok to test 21e935d

ChenhanYu added 10 commits June 10, 2026 10:51
Signed-off-by: Pensieve Intern <chenhany@nvidia.com>
Signed-off-by: Pensieve Intern <chenhany@nvidia.com>
Signed-off-by: Pensieve Intern <chenhany@nvidia.com>
Signed-off-by: Pensieve Intern <chenhany@nvidia.com>
Signed-off-by: Pensieve Intern <chenhany@nvidia.com>
Signed-off-by: Pensieve Intern <chenhany@nvidia.com>
Signed-off-by: Pensieve Intern <chenhany@nvidia.com>
vLLM PR vllm-project/vllm#41745 (2026-05-06) shipped Gemma 4 MTP
support; the implementation expects ``speculative_config`` shaped as
``{"model": <assistant>, "num_speculative_tokens": N}`` (no ``method``
key — vLLM auto-detects Gemma 4 from the assistant). The specdec_bench
wrapper at ``models/vllm.py`` unconditionally emitted
``{"method": "mtp", "num_speculative_tokens": N}`` for any
``--speculative_algorithm MTP`` invocation, which produced
``NotImplementedError: Unsupported speculative method: 'mtp'`` on
Gemma 4 even with a container that has the support
(``vllm/vllm-openai:v0.22.1``+).

Changes:

1. ``examples/specdec_bench/specdec_bench/models/vllm.py``: when MTP is
   paired with ``--draft_model_dir``, emit the assistant-model config
   shape; otherwise keep the generic ``method: "mtp"`` path (Qwen 3.5
   etc.). Preserves backward compatibility — callers that didn't pass
   ``--draft_model_dir`` get the same config they got before.

2. ``tools/launcher/examples/gemma-4/gemma-4-E4B-it/specdec_bench_mtp_vllm.yaml``:
   bump container to ``vllm/vllm-openai:v0.22.1`` (the qwen3_5-cu130 tag
   predates the gemma4_mtp PR and doesn't recognize ``model_type=gemma4``);
   add ``--draft_model_dir /hf-local/google/gemma-4-E4B-it-assistant`` to
   both task_0 and task_1 args; expose the assistant path via
   ``global_vars.draft_model`` for reuse; rewrite the header comment with
   the corrected diagnosis.

Validated upstream: ``google/gemma-4-{E2B,E4B,26B-A4B,31B}-it-assistant``
exist on HuggingFace (public, ungated); ``gemma4_mtp.py`` is in
``v0.22.0``, ``v0.22.1``, and ``main``. PR #41745's test plan documents
the same config shape this change emits.

Surfaced by OMNIML-5024/5025/5026/5027 (the four cells of OMNIML-5022).
Each cell agent independently misdiagnosed this as "no container
supports both Gemma 4 and MTP," when the gap was actually the wrapper
not being wired for the assistant-model config shape that vLLM expects.

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Post-#1564, each ``(model, algorithm, engine)`` Epic owns exactly one
committed YAML: the parent at
``tools/launcher/examples/<family>/<model>/specdec_bench_<algo>_<engine>.yaml``.
Per-cell knobs (temperature, max_seq_len, save_dir, draft_length /
block_size) come from CLI overrides at slurm-invoke time via
``pipeline.task_N.args+=[...]``. No per-cell file is committed.

This branch had originally created
``tools/launcher/common/specdec_bench/_cells/gemma-4-E4B-it_mtp_vllm_t0_d3.yaml``
+ a ``--runtime_params common/specdec_bench/_cells/...`` reference in
the parent, both pre-#1564 shapes that duplicated the parent's knobs.
The cell-stage workflow SPEC template had stale Step 3 guidance still
mentioning the ``_cells/<sweep_name>.yaml`` shape, which is what the
agent followed; that contradiction is cleaned up in pensieve-intern !91.

Drop:
- ``tools/launcher/common/specdec_bench/_cells/gemma-4-E4B-it_mtp_vllm_t0_d3.yaml``
- the ``--runtime_params common/specdec_bench/_cells/...`` line on both
  task_0 and task_1 in the parent YAML

Update the header comment to document the canonical CLI-override
invocation pattern (the same pattern used by the
NVIDIA-Nemotron-3-Super-120B-A12B-BF16 parent on main).

The cell-side overrides for OMNIML-5024 (t0_d3) become:
    pipeline.task_0.args+=["--temperature 0",
                            "--max_seq_len 65536",
                            "--save_dir /scratchspace/<sweep>/qualitative",
                            "--draft_length 3"]
    pipeline.task_1.args+=["--temperature 0",
                            "--max_seq_len 65536",
                            "--save_dir /scratchspace/<sweep>/throughput_32k",
                            "--num_requests 80",
                            "--draft_length 3"]

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Pre-commit ruff-format collapsed the dict comprehension onto a single
line. Pure formatting — no behaviour change.

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
@ChenhanYu ChenhanYu force-pushed the pensieve-intern/OMNIML-5022/t0_d3 branch from 21e935d to 8a53ae6 Compare June 10, 2026 17:51
@ChenhanYu

Copy link
Copy Markdown
Collaborator Author

/ok to test 8a53ae6

@ChenhanYu ChenhanYu merged commit 66b54ed into main Jun 10, 2026
67 of 69 checks passed
@ChenhanYu ChenhanYu deleted the pensieve-intern/OMNIML-5022/t0_d3 branch June 10, 2026 19:46
@github-actions

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-10 19:46 UTC

ChenhanYu added a commit that referenced this pull request Jun 11, 2026
…a 4 MTP (#1677)

### What does this PR do?

Type of change: Bug fix

Fixes the specdec_bench vLLM wrapper's MTP `speculative_config` emission
so Gemma 4 MTP no longer hits the wrong code path inside vLLM.

### Bug

vLLM's `SpeculativeConfig.__post_init__`
(`vllm/config/speculative.py:529-602`) auto-detects `method` ONLY when
it's unset. When `model` is provided and `method` is `None`, the default
branch sets `method = "draft_model"` — the generic same-architecture
draft path, NOT MTP. That path enforces equal num_heads between target
and draft and raises:

```
AssertionError: All layers in one attention group must share num_heads; got {8, 4}
```

on heterogeneous-head models. Gemma 4 has 8 target heads and 4 draft
heads by design.

### Where the previous fix went wrong

PR #1663 changed the MTP branch in the wrapper to emit `{model:
<assistant>, num_speculative_tokens: N}` WITHOUT `method` when
`draft_model_dir` was provided, based on a misread of vLLM PR #41745's
test plan that only showed the `{model, num_speculative_tokens}` shape.
That test plan was the direct `LLM(...)` constructor invocation; vLLM
had already defaulted method internally. Going through specdec_bench's
`AsyncEngineArgs(speculative_config=...)` path, the explicit `method`
key is required to avoid the auto-detect → draft_model fallback.

### Reference

vLLM's own test at
[`tests/v1/e2e/spec_decode/test_spec_decode.py:818-823`](https://github.com/vllm-project/vllm/blob/main/tests/v1/e2e/spec_decode/test_spec_decode.py#L818)
does exactly this for the gemma4-e4b parametrization:

```python
speculative_config = {
    "method": method,                # "mtp"
    "num_speculative_tokens": ...,
}
if draft_model is not None:           # Gemma 4 case
    speculative_config["model"] = draft_model
```

### Fix

Restore `method="mtp"` as the unconditional MTP path. ADD `model` only
when `draft_model_dir` is set. Backward-compatible for Qwen 3.5 MTP /
DeepSeek MTP / other inline-MTP families (they keep the bare `{method:
"mtp"}` config).

### Validation

Field-tested via vLLM PR #41745's correctness test on `gemma-4-E4B-it` +
`gemma-4-E4B-it-assistant`: produced 304.7 output TPS at γ=4 vs 171.0
baseline (178% speedup) on H100. The same `speculative_config` shape
this fix emits.

### Surfaced on

[OMNIML-5024](https://jirasw.nvidia.com/browse/OMNIML-5024) pipeline
#54356795:
- Wrapper emitted `{model: assistant, num_speculative_tokens: 3}`
- vLLM auto-detected `method = "draft_model"`
- Loaded gemma-4-E4B-it-assistant (4 heads) as a generic draft for
gemma-4-E4B-it (8 heads)
- Attention-group num_heads check tripped → AssertionError, task_0
FAILED, task_1 CANCELLED

### Before your PR is "*Ready for review*"

- Backward compatible: ✅ (Qwen 3.5 / DeepSeek MTP unchanged; only the
MTP+`draft_model_dir` case changes).
- New tests: ❌ — the test exercising this codepath would need a GPU +
gemma-4 model checkout, which is cluster work, not unit-test scope.
JIRA-tracked validation via OMNIML-5024 dispatch after this lands.
- Changelog: ❌

### Additional Information

- vLLM PR #41745 (Gemma4 MTP support)
- Companion: NVIDIA/Model-Optimizer PR #1675 (launcher
`GlobalVariables.draft_model` schema fix)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Fixed speculative decoding configuration handling in the benchmark
example to ensure consistent method assignment and proper draft model
configuration.

* **Documentation**
* Updated configuration comments to reflect corrected behavior and
improved clarity.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
@coderabbitai coderabbitai Bot mentioned this pull request Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants