-
Notifications
You must be signed in to change notification settings - Fork 442
[OMNIML-5024] specdec_bench cell t0_d3 — google/gemma-4-E4B-it / MTP / vllm #1663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
854cee6
22688d5
349a6e7
5cde29c
39295b3
6cddc97
2756e43
ed1204f
75e09c3
8a53ae6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -35,7 +35,20 @@ | |
|
|
||
|
|
||
| def get_tokenizer(path, trust_remote_code=False): | ||
| return AutoTokenizer.from_pretrained(path, trust_remote_code=trust_remote_code) | ||
| extra_special_tokens = None | ||
| tokenizer_config_path = os.path.join(path, "tokenizer_config.json") | ||
| if os.path.exists(tokenizer_config_path): | ||
| with open(tokenizer_config_path) as f: | ||
| tokenizer_config = json.load(f) | ||
| extra_special_tokens = tokenizer_config.get("extra_special_tokens") | ||
|
|
||
| kwargs = {"trust_remote_code": trust_remote_code} | ||
| if isinstance(extra_special_tokens, list): | ||
| kwargs["extra_special_tokens"] = { | ||
| token.strip("<|>").replace("|", "_") + "_token": token for token in extra_special_tokens | ||
| } | ||
|
Comment on lines
+46
to
+49
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
# Verify the HuggingFace AutoTokenizer API for extra_special_tokens handling
# Check if transformers tokenizer accepts extra_special_tokens
python3 <<'EOF'
from transformers import AutoTokenizer
import inspect
# Inspect AutoTokenizer.from_pretrained signature
sig = inspect.signature(AutoTokenizer.from_pretrained)
params = list(sig.parameters.keys())
# Look for extra_special_tokens or similar parameters
special_token_params = [p for p in params if 'special' in p.lower() or 'token' in p.lower()]
print("Parameters related to special tokens:")
for p in special_token_params:
print(f" - {p}")
# Check PreTrainedTokenizer.__init__ as well
from transformers import PreTrainedTokenizer
init_sig = inspect.signature(PreTrainedTokenizer.__init__)
init_params = list(init_sig.parameters.keys())
init_special = [p for p in init_params if 'special' in p.lower() or 'token' in p.lower()]
print("\nPreTrainedTokenizer.__init__ special token parameters:")
for p in init_special:
print(f" - {p}")
EOFRepository: NVIDIA/Model-Optimizer Length of output: 191 🌐 Web query:
💡 Result: In the Hugging Face transformers library, the extra_special_tokens parameter used within AutoTokenizer.from_pretrained is passed as a dictionary [1][2]. The format for this dictionary is: extra_special_tokens={"token_name": "token_string"} [1][2] For example: tokenizer = AutoTokenizer.from_pretrained( "model_name", extra_special_tokens={"image_token": " Citations:
Confirm 🤖 Prompt for AI Agents
Comment on lines
+46
to
+49
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [SUGGESTION] The synthesized
For Gemma 4 today this happens to work, but it's a foot-gun for any future tokenizer with overlapping names. Safer alternative: drop the heuristic and use index-based names that can't collide with HF reserved names: kwargs["extra_special_tokens"] = {
f"extra_special_token_{i}": token
for i, token in enumerate(extra_special_tokens)
}The keys are only used as attribute lookups by user code; the actual tokenizer behavior depends on the token values, not the keys. |
||
|
|
||
| return AutoTokenizer.from_pretrained(path, **kwargs) | ||
|
|
||
|
|
||
| def encode_chat(tokenizer, messages, chat_template_args={}, completions=False): | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,83 @@ | ||
| # SPEED-bench MTP speculative-decoding run for gemma-4-E4B-it via vLLM. | ||
| # | ||
| # Gemma 4 MTP support landed in vLLM PR vllm-project/vllm#41745 (2026-05-06) | ||
| # and is in ``vllm/vllm-openai:v0.22.1`` (and later). Gemma 4 MTP uses a | ||
| # separate assistant model passed via ``--draft_model_dir``; vLLM | ||
| # auto-detects Gemma 4 from the assistant and does NOT take a ``method`` | ||
| # key in ``speculative_config``. The wrapper at | ||
| # ``examples/specdec_bench/specdec_bench/models/vllm.py`` routes to the | ||
| # assistant-model config shape when ``--speculative_algorithm MTP`` is | ||
| # paired with ``--draft_model_dir``. | ||
| # | ||
| # Assistant model: ``google/gemma-4-E4B-it-assistant`` (public, ungated). | ||
| # | ||
| # Slurm run on cw_dfw — cells override per-cell knobs via | ||
| # pipeline.task_N.args+=[...]: | ||
| # | ||
| # uv run slurm.py \ | ||
| # --yaml modules/Model-Optimizer/tools/launcher/examples/gemma-4/gemma-4-E4B-it/specdec_bench_mtp_vllm.yaml \ | ||
| # --yes detach=true \ | ||
| # pipeline.task_0.args+=["--temperature 0","--max_seq_len 65536","--save_dir /scratchspace/<sweep>/qualitative","--draft_length 3"] \ | ||
| # pipeline.task_1.args+=["--temperature 0","--max_seq_len 65536","--save_dir /scratchspace/<sweep>/throughput_32k","--num_requests 80","--draft_length 3"] | ||
|
Comment on lines
+20
to
+21
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [SUGGESTION] The example overrides include |
||
|
|
||
| job_name: gemma-4-E4B-it_specdec_bench_mtp_vllm | ||
|
|
||
| pipeline: | ||
| global_vars: | ||
| hf_model: /hf-local/google/gemma-4-E4B-it | ||
| draft_model: /hf-local/google/gemma-4-E4B-it-assistant | ||
|
|
||
| # task_0: SPEED qualitative split | ||
| task_0: | ||
| script: common/specdec_bench/run.sh | ||
| args: | ||
| - --dataset speed | ||
| - --dataset_path /hf-local/nvidia/SPEED-Bench-Internal/qualitative | ||
| - --engine VLLM | ||
| - --speculative_algorithm MTP | ||
| - --draft_model_dir <<global_vars.draft_model>> | ||
| - --draft_length 3 | ||
| - --tp_size 1 | ||
| - --ep_size 1 | ||
| - --concurrency 32 | ||
| - --output_length 4096 | ||
| - --aa_timing | ||
| - --show_progress | ||
| - --save_dir /scratchspace/{sweep_name_default}/qualitative | ||
| environment: | ||
| - HF_MODEL_CKPT: <<global_vars.hf_model>> | ||
| - HF_LOCAL: /hf-local | ||
| slurm_config: | ||
| _factory_: "slurm_factory" | ||
| nodes: 1 | ||
| ntasks_per_node: 1 | ||
| gpus_per_node: 1 | ||
| container: vllm/vllm-openai:v0.22.1 | ||
|
|
||
| # task_1: SPEED throughput_32k split | ||
| task_1: | ||
| script: common/specdec_bench/run.sh | ||
| args: | ||
| - --dataset speed | ||
| - --dataset_path /hf-local/nvidia/SPEED-Bench-Internal/throughput_32k | ||
| - --engine VLLM | ||
| - --speculative_algorithm MTP | ||
| - --draft_model_dir <<global_vars.draft_model>> | ||
| - --draft_length 3 | ||
| - --tp_size 1 | ||
| - --ep_size 1 | ||
| - --concurrency 8 | ||
| - --num_requests 80 | ||
| - --output_length 4096 | ||
| - --aa_timing | ||
| - --show_progress | ||
| - --save_dir /scratchspace/{sweep_name_default}/throughput_32k | ||
| environment: | ||
| - HF_MODEL_CKPT: <<global_vars.hf_model>> | ||
| - HF_LOCAL: /hf-local | ||
| slurm_config: | ||
| _factory_: "slurm_factory" | ||
| nodes: 1 | ||
| ntasks_per_node: 1 | ||
| gpus_per_node: 1 | ||
| container: vllm/vllm-openai:v0.22.1 | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[SUGGESTION] When
pathis a HuggingFace Hub repo ID (e.g."google/gemma-4-E4B-it") rather than a local directory,os.path.exists(tokenizer_config_path)returns False and the new branch is skipped — soAutoTokenizer.from_pretrainedwill still hit the original "list-shapedextra_special_tokens" error.This is fine for the launcher path (which mounts checkpoints under
/hf-local/...and always passes a directory), but breaks direct CLI usage that resolves through the HF cache. If you want the fix to also cover that case, fall back tohuggingface_hub.try_to_load_from_cache(path, "tokenizer_config.json")(orcached_file) whenos.path.existsfails. Not blocking — flagging since the PR description frames this as a Gemma-4 fix in general, not launcher-specific.