Skip to content

Diffusion path strips all special tokens at the token-id level, deleting the tool-call/channel markers its own gemma4 parser needs (tool_calls: null, 'thought' leaks into content) #1351

@dudeoverhere

Description

@dudeoverhere

Summary

In mlx-vlm 0.6.3, the diffusion generation path strips all special tokens at the token-id level during detokenization. This deletes the <|tool_call>/<tool_call|> and <|channel>/<channel|> markers that the server's own gemma4 tool parser (mlx_vlm/tool_parsers/gemma4.py, auto-selected — /health reports loaded_tool_parser: "gemma4") and thinking splitter (server/app.py _split_thinking, which natively handles <|channel>thought) require.

Net effect for mlx-community/diffusiongemma-26B-A4B-it-4bit:

  1. Tool calls arrive as inert text contenttool_calls: null, content is the raw payload like call:get_weather{city:Austin} with the surrounding markers deleted, so process_tool_calls never matches.
  2. The literal word thought leaks into content after tool-result turns — the channel name is an ordinary string, so once the <|channel> marker around it is stripped, _split_thinking can't recognize the block and the prefix thought\n lands in content.

The server selects a tool parser whose markers its own detokenizer deletes before the parser ever sees them.

Root cause

  • server/generation.py:1424 — the skip set is built from tokenizer.all_special_ids (every special=True token, no exclusions).
  • generate/diffusion.py:1053 — that set is passed per-token to detokenizer.add_token(..., skip_special_token_ids=...), so the marker ids are dropped at the token-id level, before any decode call.
  • server/generation.py:1470 — the final decode additionally uses skip_special_tokens=True.

For DiffusionGemma the six ids the parsers need are 48/49 (<|tool_call>/<tool_call|>), 50/51 (<|tool_response>/<tool_response|>), and 100/101 (<|channel>/<channel|>) — all special=True in tokenizer.json, all in the skip set.

The autoregressive path doesn't have this problem (cf. #900, where markers leaked through to streamed content — the opposite failure; here they're deleted before process_tool_calls can fire). A decode-only patch is not sufficient — the stripping happens at the id level in detokenizer.add_token, before any decode() call.

Reproduction

Serve the model:

python -m mlx_vlm server --model mlx-community/diffusiongemma-26B-A4B-it-4bit --port 8010

Send a standard tools request:

curl -s http://127.0.0.1:8010/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "mlx-community/diffusiongemma-26B-A4B-it-4bit",
  "messages": [{"role": "user", "content": "What is the weather in Austin? Use the tool."}],
  "tools": [{"type": "function", "function": {"name": "get_weather",
    "description": "Get current weather for a city",
    "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}}]
}'

Observed:

{"message": {"content": "call:get_weather{city:Austin}", "tool_calls": null}, "finish_reason": "stop"}

Expected (and what we get with the workaround below applied):

{"message": {"content": "", "tool_calls": [{"function": {"name": "get_weather", "arguments": "{\"city\": \"Austin\"}"}}]}, "finish_reason": "tool_calls"}

Same defect in streaming mode (no delta.tool_calls ever emitted), and the thought\n leak reproduces by following up with a role: "tool" result turn — the chat template doesn't emit a pre-closed empty thought channel after tool turns, so the model emits <|channel>thought ... itself, and only the markers get stripped.

Suggested fix

When a tool parser is loaded, exclude its marker token ids (and the channel/thinking markers used by _split_thinking) from the skip set built at server/generation.py:1424, so the diffusion path preserves exactly the special tokens the server's downstream parsers consume. Deriving the ids at runtime via convert_tokens_to_ids (rather than hardcoding 48/49/50/51/100/101) keeps it robust to id drift across model revisions.

Workaround (verified)

We run the server through a launcher that wraps load_model_resources and, after model load:

  1. ID-level (the one that matters): retype the tokenizer into a dynamic subclass whose all_special_ids property excludes the six marker ids — the engine's skip set then preserves them.
  2. Text-level: wrap tokenizer.decode(skip_special_tokens=True) to still strip the remaining special tokens textually while keeping the six markers.
import mlx_vlm.server.generation as _gen

KEEP = {"<|tool_call>", "<tool_call|>", "<|tool_response>", "<tool_response|>",
        "<|channel>", "<channel|>"}

def _patch_tokenizer(tokenizer):
    keep_ids = {tid for m in KEEP
                if (tid := tokenizer.convert_tokens_to_ids(m)) is not None and tid >= 0}
    cls = type(tokenizer)
    orig_prop = cls.all_special_ids
    tokenizer.__class__ = type(f"{cls.__name__}Patched", (cls,), {
        "all_special_ids": property(
            lambda self: [i for i in orig_prop.fget(self) if i not in keep_ids])})

    strip = {t for t in tokenizer.all_special_tokens if t not in KEEP}
    orig_decode = tokenizer.decode
    def decode(token_ids, skip_special_tokens=False, **kw):
        text = orig_decode(token_ids, skip_special_tokens=False, **kw)
        if skip_special_tokens:
            for s in strip:
                text = text.replace(s, "")
        return text
    tokenizer.decode = decode

_orig_load = _gen.load_model_resources
def load_model_resources(model_path, adapter_path):
    model, processor, config = _orig_load(model_path, adapter_path)
    _patch_tokenizer(getattr(processor, "tokenizer", processor))
    return model, processor, config
_gen.load_model_resources = load_model_resources

from mlx_vlm.server import main
main()

With this applied, non-streaming and streaming tool calls both produce proper tool_calls with finish_reason: "tool_calls" (the gemma4 parser even converts the model's unquoted call: syntax into valid quoted-JSON arguments), the thought leak is gone, and plain/streaming completions without tools are unchanged.

Environment

  • mlx-vlm 0.6.3, mlx 0.31.2, mlx-lm 0.31.3, transformers 5.11.0, tokenizers 0.22.2
  • macOS 26.4.1, Apple Silicon (M5 Max, arm64), Python 3.12
  • Model: mlx-community/diffusiongemma-26B-A4B-it-4bit

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions