Skip to content
Merged
25 changes: 21 additions & 4 deletions examples/specdec_bench/specdec_bench/models/vllm.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,10 +63,27 @@ def __init__(self, model_dir, max_concurrent_requests, sampling_kwargs, **kwargs
specdec["disable_padded_drafter_batch"] = True
specdec["parallel_draft_block_sizes"] = kwargs.get("parallel_draft_block_sizes")
elif kwargs.get("speculative_algorithm") == "MTP":
specdec = {
"method": "mtp",
"num_speculative_tokens": kwargs.get("speculative_num_steps", 3),
}
draft_model_dir = kwargs.get("draft_model_dir")
if draft_model_dir:
# Assistant-model MTP (e.g. Gemma 4): vLLM's Gemma4 MTP
# support (vllm-project/vllm#41745) expects
# ``speculative_config={"model": <assistant>, ...}`` with
# no ``method`` key — vLLM auto-detects Gemma4 from the
# assistant model. Passing ``method: "mtp"`` here triggers
# ``NotImplementedError: Unsupported speculative method:
# 'mtp'`` on Gemma4 even on a container that has the
# support (e.g. ``vllm/vllm-openai:v0.22.1``+).
specdec = {
"model": draft_model_dir,
"num_speculative_tokens": kwargs.get("speculative_num_steps", 3),
}
else:
# Generic MTP path (Qwen3.5 etc.) — model carries its
# own MTP layer; no separate draft / assistant model.
specdec = {
"method": "mtp",
"num_speculative_tokens": kwargs.get("speculative_num_steps", 3),
}
elif kwargs.get("speculative_algorithm") == "DFLASH":
specdec = {
"method": "dflash",
Expand Down
15 changes: 14 additions & 1 deletion examples/specdec_bench/specdec_bench/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,20 @@


def get_tokenizer(path, trust_remote_code=False):
return AutoTokenizer.from_pretrained(path, trust_remote_code=trust_remote_code)
extra_special_tokens = None
tokenizer_config_path = os.path.join(path, "tokenizer_config.json")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] When path is a HuggingFace Hub repo ID (e.g. "google/gemma-4-E4B-it") rather than a local directory, os.path.exists(tokenizer_config_path) returns False and the new branch is skipped — so AutoTokenizer.from_pretrained will still hit the original "list-shaped extra_special_tokens" error.

This is fine for the launcher path (which mounts checkpoints under /hf-local/... and always passes a directory), but breaks direct CLI usage that resolves through the HF cache. If you want the fix to also cover that case, fall back to huggingface_hub.try_to_load_from_cache(path, "tokenizer_config.json") (or cached_file) when os.path.exists fails. Not blocking — flagging since the PR description frames this as a Gemma-4 fix in general, not launcher-specific.

if os.path.exists(tokenizer_config_path):
with open(tokenizer_config_path) as f:
tokenizer_config = json.load(f)
extra_special_tokens = tokenizer_config.get("extra_special_tokens")

kwargs = {"trust_remote_code": trust_remote_code}
if isinstance(extra_special_tokens, list):
kwargs["extra_special_tokens"] = {
token.strip("<|>").replace("|", "_") + "_token": token for token in extra_special_tokens
}
Comment on lines +46 to +49

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the HuggingFace AutoTokenizer API for extra_special_tokens handling

# Check if transformers tokenizer accepts extra_special_tokens
python3 <<'EOF'
from transformers import AutoTokenizer
import inspect

# Inspect AutoTokenizer.from_pretrained signature
sig = inspect.signature(AutoTokenizer.from_pretrained)
params = list(sig.parameters.keys())

# Look for extra_special_tokens or similar parameters
special_token_params = [p for p in params if 'special' in p.lower() or 'token' in p.lower()]
print("Parameters related to special tokens:")
for p in special_token_params:
    print(f"  - {p}")

# Check PreTrainedTokenizer.__init__ as well
from transformers import PreTrainedTokenizer
init_sig = inspect.signature(PreTrainedTokenizer.__init__)
init_params = list(init_sig.parameters.keys())
init_special = [p for p in init_params if 'special' in p.lower() or 'token' in p.lower()]
print("\nPreTrainedTokenizer.__init__ special token parameters:")
for p in init_special:
    print(f"  - {p}")
EOF

Repository: NVIDIA/Model-Optimizer

Length of output: 191


🌐 Web query:

HuggingFace transformers AutoTokenizer.from_pretrained extra_special_tokens parameter format

💡 Result:

In the Hugging Face transformers library, the extra_special_tokens parameter used within AutoTokenizer.from_pretrained is passed as a dictionary [1][2]. The format for this dictionary is: extra_special_tokens={"token_name": "token_string"} [1][2] For example: tokenizer = AutoTokenizer.from_pretrained( "model_name", extra_special_tokens={"image_token": ""}) [1][3] Key Details: 1. Usage: This parameter allows you to register additional special tokens [1][3]. While commonly used in multimodal models as placeholders for images, video, or audio, these tokens do not have to be modality-specific and can represent any token the model needs to access [2][4]. 2. Functionality: If these tokens are not already present in the vocabulary, they are added to the end of the vocabulary [2][4]. 3. Decoding: Adding tokens through this parameter ensures they are treated as special tokens, meaning they will be skipped when decoding if skip_special_tokens=True is set [2][4]. 4. Alternative Method: Beyond passing it during instantiation, you can also manage special tokens using the add_special_tokens method on a tokenizer instance [4]. Note that attempting to pass some flags like add_special_tokens=True directly into from_pretrained can cause conflicts with existing methods in certain tokenizer classes; it is generally safer to manage such flags after initialization if they are not explicitly supported by the constructor [5].

Citations:


Confirm extra_special_tokens contract; harden key-name derivation.
HuggingFace AutoTokenizer.from_pretrained expects extra_special_tokens as a dict of {token_name: token_string} (token names typically end with *_token), so converting a list into kwargs["extra_special_tokens"] in that shape matches the API contract. The main remaining risk is the heuristic deriving token_name (token.strip("<|>").replace("|", "_") + "_token"), which assumes <|...|>-wrapped tokens and can produce unexpected/invalid names or collisions when tokens don’t follow that format.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/specdec_bench/specdec_bench/utils.py` around lines 46 - 50, The
current conversion of a list into kwargs["extra_special_tokens"] assumes tokens
are wrapped like "<|...|>" and builds names with token.strip("<|>").replace("|",
"_") + "_token"; instead validate the contract expected by
AutoTokenizer.from_pretrained by ensuring kwargs["extra_special_tokens"] is a
dict of {token_name: token_string}, and harden the key derivation in the block
that handles extra_special_tokens: check each token's format and either
canonicalize wrapped tokens as before or fall back to a safe sanitized name
(e.g., remove non-alphanumerics, limit length) plus a unique numeric suffix to
prevent collisions, and raise or log a clear error if a token cannot be safely
named; update any code paths that call AutoTokenizer.from_pretrained to pass
this dict.

Comment on lines +46 to +49

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] The synthesized extra_special_tokens keys can collide with HuggingFace's built-in special-token attributes and with each other.

SpecialTokensMixin reserves attribute names like bos_token, eos_token, pad_token, cls_token, sep_token, unk_token, mask_token. If the source list happens to include something whose stripped-and-suffixed name matches one of those (e.g. "<|bos|>" → "bos_token"), this kwargs path overwrites the real built-in token mapping at construction time, silently corrupting tokenization. The naming heuristic also collapses:

  • "<|foo|>" and "foo" both become "foo_token" (last write wins, one token gets dropped).
  • empty/edge inputs like "<|>" collapse to a key of "_token".

For Gemma 4 today this happens to work, but it's a foot-gun for any future tokenizer with overlapping names. Safer alternative: drop the heuristic and use index-based names that can't collide with HF reserved names:

kwargs["extra_special_tokens"] = {
    f"extra_special_token_{i}": token
    for i, token in enumerate(extra_special_tokens)
}

The keys are only used as attribute lookups by user code; the actual tokenizer behavior depends on the token values, not the keys.


return AutoTokenizer.from_pretrained(path, **kwargs)


def encode_chat(tokenizer, messages, chat_template_args={}, completions=False):
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# SPEED-bench MTP speculative-decoding run for gemma-4-E4B-it via vLLM.
#
# Gemma 4 MTP support landed in vLLM PR vllm-project/vllm#41745 (2026-05-06)
# and is in ``vllm/vllm-openai:v0.22.1`` (and later). Gemma 4 MTP uses a
# separate assistant model passed via ``--draft_model_dir``; vLLM
# auto-detects Gemma 4 from the assistant and does NOT take a ``method``
# key in ``speculative_config``. The wrapper at
# ``examples/specdec_bench/specdec_bench/models/vllm.py`` routes to the
# assistant-model config shape when ``--speculative_algorithm MTP`` is
# paired with ``--draft_model_dir``.
#
# Assistant model: ``google/gemma-4-E4B-it-assistant`` (public, ungated).
#
# Slurm run on cw_dfw — cells override per-cell knobs via
# pipeline.task_N.args+=[...]:
#
# uv run slurm.py \
# --yaml modules/Model-Optimizer/tools/launcher/examples/gemma-4/gemma-4-E4B-it/specdec_bench_mtp_vllm.yaml \
# --yes detach=true \
# pipeline.task_0.args+=["--temperature 0","--max_seq_len 65536","--save_dir /scratchspace/<sweep>/qualitative","--draft_length 3"] \
# pipeline.task_1.args+=["--temperature 0","--max_seq_len 65536","--save_dir /scratchspace/<sweep>/throughput_32k","--num_requests 80","--draft_length 3"]
Comment on lines +20 to +21

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] The example overrides include "--draft_length 3", but --draft_length 3 is already in task_0.args (line 39) and task_1.args (line 66). With args+=[...], the override gets appended — argparse then sees --draft_length 3 --draft_length 3 and just uses the last occurrence. Functionally harmless, but it means the comment misleads reviewers/users about which knobs are cell-overridable: a user who reads this comment and tries args+=["--draft_length 7"] will be surprised when both 3 and 7 end up on the command line. Either drop --draft_length 3 from the override hint, or move it out of the base args block (matching the --temperature / --max_seq_len / --save_dir pattern, which are NOT in the base args and ARE legitimately cell-overridable).


job_name: gemma-4-E4B-it_specdec_bench_mtp_vllm

pipeline:
global_vars:
hf_model: /hf-local/google/gemma-4-E4B-it
draft_model: /hf-local/google/gemma-4-E4B-it-assistant

# task_0: SPEED qualitative split
task_0:
script: common/specdec_bench/run.sh
args:
- --dataset speed
- --dataset_path /hf-local/nvidia/SPEED-Bench-Internal/qualitative
- --engine VLLM
- --speculative_algorithm MTP
- --draft_model_dir <<global_vars.draft_model>>
- --draft_length 3
- --tp_size 1
- --ep_size 1
- --concurrency 32
- --output_length 4096
- --aa_timing
- --show_progress
- --save_dir /scratchspace/{sweep_name_default}/qualitative
environment:
- HF_MODEL_CKPT: <<global_vars.hf_model>>
- HF_LOCAL: /hf-local
slurm_config:
_factory_: "slurm_factory"
nodes: 1
ntasks_per_node: 1
gpus_per_node: 1
container: vllm/vllm-openai:v0.22.1

# task_1: SPEED throughput_32k split
task_1:
script: common/specdec_bench/run.sh
args:
- --dataset speed
- --dataset_path /hf-local/nvidia/SPEED-Bench-Internal/throughput_32k
- --engine VLLM
- --speculative_algorithm MTP
- --draft_model_dir <<global_vars.draft_model>>
- --draft_length 3
- --tp_size 1
- --ep_size 1
- --concurrency 8
- --num_requests 80
- --output_length 4096
- --aa_timing
- --show_progress
- --save_dir /scratchspace/{sweep_name_default}/throughput_32k
environment:
- HF_MODEL_CKPT: <<global_vars.hf_model>>
- HF_LOCAL: /hf-local
slurm_config:
_factory_: "slurm_factory"
nodes: 1
ntasks_per_node: 1
gpus_per_node: 1
container: vllm/vllm-openai:v0.22.1
Loading