Summary
In mlx-vlm 0.6.3, the diffusion generation path strips all special tokens at the token-id level during detokenization. This deletes the <|tool_call>/<tool_call|> and <|channel>/<channel|> markers that the server's own gemma4 tool parser (mlx_vlm/tool_parsers/gemma4.py, auto-selected — /health reports loaded_tool_parser: "gemma4") and thinking splitter (server/app.py _split_thinking, which natively handles <|channel>thought) require.
Net effect for mlx-community/diffusiongemma-26B-A4B-it-4bit:
- Tool calls arrive as inert text content —
tool_calls: null, content is the raw payload like call:get_weather{city:Austin} with the surrounding markers deleted, so process_tool_calls never matches.
- The literal word
thought leaks into content after tool-result turns — the channel name is an ordinary string, so once the <|channel> marker around it is stripped, _split_thinking can't recognize the block and the prefix thought\n lands in content.
The server selects a tool parser whose markers its own detokenizer deletes before the parser ever sees them.
Root cause
server/generation.py:1424 — the skip set is built from tokenizer.all_special_ids (every special=True token, no exclusions).
generate/diffusion.py:1053 — that set is passed per-token to detokenizer.add_token(..., skip_special_token_ids=...), so the marker ids are dropped at the token-id level, before any decode call.
server/generation.py:1470 — the final decode additionally uses skip_special_tokens=True.
For DiffusionGemma the six ids the parsers need are 48/49 (<|tool_call>/<tool_call|>), 50/51 (<|tool_response>/<tool_response|>), and 100/101 (<|channel>/<channel|>) — all special=True in tokenizer.json, all in the skip set.
The autoregressive path doesn't have this problem (cf. #900, where markers leaked through to streamed content — the opposite failure; here they're deleted before process_tool_calls can fire). A decode-only patch is not sufficient — the stripping happens at the id level in detokenizer.add_token, before any decode() call.
Reproduction
Serve the model:
python -m mlx_vlm server --model mlx-community/diffusiongemma-26B-A4B-it-4bit --port 8010
Send a standard tools request:
curl -s http://127.0.0.1:8010/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "mlx-community/diffusiongemma-26B-A4B-it-4bit",
"messages": [{"role": "user", "content": "What is the weather in Austin? Use the tool."}],
"tools": [{"type": "function", "function": {"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}}]
}'
Observed:
{"message": {"content": "call:get_weather{city:Austin}", "tool_calls": null}, "finish_reason": "stop"}
Expected (and what we get with the workaround below applied):
{"message": {"content": "", "tool_calls": [{"function": {"name": "get_weather", "arguments": "{\"city\": \"Austin\"}"}}]}, "finish_reason": "tool_calls"}
Same defect in streaming mode (no delta.tool_calls ever emitted), and the thought\n leak reproduces by following up with a role: "tool" result turn — the chat template doesn't emit a pre-closed empty thought channel after tool turns, so the model emits <|channel>thought ... itself, and only the markers get stripped.
Suggested fix
When a tool parser is loaded, exclude its marker token ids (and the channel/thinking markers used by _split_thinking) from the skip set built at server/generation.py:1424, so the diffusion path preserves exactly the special tokens the server's downstream parsers consume. Deriving the ids at runtime via convert_tokens_to_ids (rather than hardcoding 48/49/50/51/100/101) keeps it robust to id drift across model revisions.
Workaround (verified)
We run the server through a launcher that wraps load_model_resources and, after model load:
- ID-level (the one that matters): retype the tokenizer into a dynamic subclass whose
all_special_ids property excludes the six marker ids — the engine's skip set then preserves them.
- Text-level: wrap
tokenizer.decode(skip_special_tokens=True) to still strip the remaining special tokens textually while keeping the six markers.
import mlx_vlm.server.generation as _gen
KEEP = {"<|tool_call>", "<tool_call|>", "<|tool_response>", "<tool_response|>",
"<|channel>", "<channel|>"}
def _patch_tokenizer(tokenizer):
keep_ids = {tid for m in KEEP
if (tid := tokenizer.convert_tokens_to_ids(m)) is not None and tid >= 0}
cls = type(tokenizer)
orig_prop = cls.all_special_ids
tokenizer.__class__ = type(f"{cls.__name__}Patched", (cls,), {
"all_special_ids": property(
lambda self: [i for i in orig_prop.fget(self) if i not in keep_ids])})
strip = {t for t in tokenizer.all_special_tokens if t not in KEEP}
orig_decode = tokenizer.decode
def decode(token_ids, skip_special_tokens=False, **kw):
text = orig_decode(token_ids, skip_special_tokens=False, **kw)
if skip_special_tokens:
for s in strip:
text = text.replace(s, "")
return text
tokenizer.decode = decode
_orig_load = _gen.load_model_resources
def load_model_resources(model_path, adapter_path):
model, processor, config = _orig_load(model_path, adapter_path)
_patch_tokenizer(getattr(processor, "tokenizer", processor))
return model, processor, config
_gen.load_model_resources = load_model_resources
from mlx_vlm.server import main
main()
With this applied, non-streaming and streaming tool calls both produce proper tool_calls with finish_reason: "tool_calls" (the gemma4 parser even converts the model's unquoted call: syntax into valid quoted-JSON arguments), the thought leak is gone, and plain/streaming completions without tools are unchanged.
Environment
- mlx-vlm 0.6.3, mlx 0.31.2, mlx-lm 0.31.3, transformers 5.11.0, tokenizers 0.22.2
- macOS 26.4.1, Apple Silicon (M5 Max, arm64), Python 3.12
- Model:
mlx-community/diffusiongemma-26B-A4B-it-4bit
Summary
In mlx-vlm 0.6.3, the diffusion generation path strips all special tokens at the token-id level during detokenization. This deletes the
<|tool_call>/<tool_call|>and<|channel>/<channel|>markers that the server's own gemma4 tool parser (mlx_vlm/tool_parsers/gemma4.py, auto-selected —/healthreportsloaded_tool_parser: "gemma4") and thinking splitter (server/app.py_split_thinking, which natively handles<|channel>thought) require.Net effect for
mlx-community/diffusiongemma-26B-A4B-it-4bit:tool_calls: null, content is the raw payload likecall:get_weather{city:Austin}with the surrounding markers deleted, soprocess_tool_callsnever matches.thoughtleaks into content after tool-result turns — the channel name is an ordinary string, so once the<|channel>marker around it is stripped,_split_thinkingcan't recognize the block and the prefixthought\nlands incontent.The server selects a tool parser whose markers its own detokenizer deletes before the parser ever sees them.
Root cause
server/generation.py:1424— the skip set is built fromtokenizer.all_special_ids(everyspecial=Truetoken, no exclusions).generate/diffusion.py:1053— that set is passed per-token todetokenizer.add_token(..., skip_special_token_ids=...), so the marker ids are dropped at the token-id level, before any decode call.server/generation.py:1470— the final decode additionally usesskip_special_tokens=True.For DiffusionGemma the six ids the parsers need are
48/49(<|tool_call>/<tool_call|>),50/51(<|tool_response>/<tool_response|>), and100/101(<|channel>/<channel|>) — allspecial=Trueintokenizer.json, all in the skip set.The autoregressive path doesn't have this problem (cf. #900, where markers leaked through to streamed content — the opposite failure; here they're deleted before
process_tool_callscan fire). A decode-only patch is not sufficient — the stripping happens at the id level indetokenizer.add_token, before anydecode()call.Reproduction
Serve the model:
Send a standard tools request:
Observed:
{"message": {"content": "call:get_weather{city:Austin}", "tool_calls": null}, "finish_reason": "stop"}Expected (and what we get with the workaround below applied):
{"message": {"content": "", "tool_calls": [{"function": {"name": "get_weather", "arguments": "{\"city\": \"Austin\"}"}}]}, "finish_reason": "tool_calls"}Same defect in streaming mode (no
delta.tool_callsever emitted), and thethought\nleak reproduces by following up with arole: "tool"result turn — the chat template doesn't emit a pre-closed empty thought channel aftertoolturns, so the model emits<|channel>thought ...itself, and only the markers get stripped.Suggested fix
When a tool parser is loaded, exclude its marker token ids (and the channel/thinking markers used by
_split_thinking) from the skip set built atserver/generation.py:1424, so the diffusion path preserves exactly the special tokens the server's downstream parsers consume. Deriving the ids at runtime viaconvert_tokens_to_ids(rather than hardcoding 48/49/50/51/100/101) keeps it robust to id drift across model revisions.Workaround (verified)
We run the server through a launcher that wraps
load_model_resourcesand, after model load:all_special_idsproperty excludes the six marker ids — the engine's skip set then preserves them.tokenizer.decode(skip_special_tokens=True)to still strip the remaining special tokens textually while keeping the six markers.With this applied, non-streaming and streaming tool calls both produce proper
tool_callswithfinish_reason: "tool_calls"(the gemma4 parser even converts the model's unquotedcall:syntax into valid quoted-JSONarguments), thethoughtleak is gone, and plain/streaming completions without tools are unchanged.Environment
mlx-community/diffusiongemma-26B-A4B-it-4bit