Diffusion path strips all special tokens at the token-id level, deleting the tool-call/channel markers its own gemma4 parser needs (tool_calls: null, 'thought' leaks into content)

## Summary

In mlx-vlm **0.6.3**, the diffusion generation path strips **all** special tokens at the token-id level during detokenization. This deletes the `<|tool_call>`/`<tool_call|>` and `<|channel>`/`<channel|>` markers that the server's **own** gemma4 tool parser (`mlx_vlm/tool_parsers/gemma4.py`, auto-selected — `/health` reports `loaded_tool_parser: "gemma4"`) and thinking splitter (`server/app.py` `_split_thinking`, which natively handles `<|channel>thought`) require.

Net effect for `mlx-community/diffusiongemma-26B-A4B-it-4bit`:

1. **Tool calls arrive as inert text content** — `tool_calls: null`, content is the raw payload like `call:get_weather{city:Austin}` with the surrounding markers deleted, so `process_tool_calls` never matches.
2. **The literal word `thought` leaks into content** after tool-result turns — the channel *name* is an ordinary string, so once the `<|channel>` marker around it is stripped, `_split_thinking` can't recognize the block and the prefix `thought\n` lands in `content`.

The server selects a tool parser whose markers its own detokenizer deletes before the parser ever sees them.

## Root cause

- `server/generation.py:1424` — the skip set is built from `tokenizer.all_special_ids` (every `special=True` token, no exclusions).
- `generate/diffusion.py:1053` — that set is passed per-token to `detokenizer.add_token(..., skip_special_token_ids=...)`, so the marker ids are dropped at the **token-id level**, before any decode call.
- `server/generation.py:1470` — the final decode additionally uses `skip_special_tokens=True`.

For DiffusionGemma the six ids the parsers need are `48`/`49` (`<|tool_call>`/`<tool_call|>`), `50`/`51` (`<|tool_response>`/`<tool_response|>`), and `100`/`101` (`<|channel>`/`<channel|>`) — all `special=True` in `tokenizer.json`, all in the skip set.

The autoregressive path doesn't have this problem (cf. #900, where markers leaked *through* to streamed content — the opposite failure; here they're deleted before `process_tool_calls` can fire). A decode-only patch is **not** sufficient — the stripping happens at the id level in `detokenizer.add_token`, before any `decode()` call.

## Reproduction

Serve the model:

```bash
python -m mlx_vlm server --model mlx-community/diffusiongemma-26B-A4B-it-4bit --port 8010
```

Send a standard tools request:

```bash
curl -s http://127.0.0.1:8010/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "mlx-community/diffusiongemma-26B-A4B-it-4bit",
  "messages": [{"role": "user", "content": "What is the weather in Austin? Use the tool."}],
  "tools": [{"type": "function", "function": {"name": "get_weather",
    "description": "Get current weather for a city",
    "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}}]
}'
```

**Observed:**

```json
{"message": {"content": "call:get_weather{city:Austin}", "tool_calls": null}, "finish_reason": "stop"}
```

**Expected** (and what we get with the workaround below applied):

```json
{"message": {"content": "", "tool_calls": [{"function": {"name": "get_weather", "arguments": "{\"city\": \"Austin\"}"}}]}, "finish_reason": "tool_calls"}
```

Same defect in streaming mode (no `delta.tool_calls` ever emitted), and the `thought\n` leak reproduces by following up with a `role: "tool"` result turn — the chat template doesn't emit a pre-closed empty thought channel after `tool` turns, so the model emits `<|channel>thought ...` itself, and only the markers get stripped.

## Suggested fix

When a tool parser is loaded, exclude its marker token ids (and the channel/thinking markers used by `_split_thinking`) from the skip set built at `server/generation.py:1424`, so the diffusion path preserves exactly the special tokens the server's downstream parsers consume. Deriving the ids at runtime via `convert_tokens_to_ids` (rather than hardcoding 48/49/50/51/100/101) keeps it robust to id drift across model revisions.

## Workaround (verified)

We run the server through a launcher that wraps `load_model_resources` and, after model load:

1. **ID-level** (the one that matters): retype the tokenizer into a dynamic subclass whose `all_special_ids` property excludes the six marker ids — the engine's skip set then preserves them.
2. **Text-level**: wrap `tokenizer.decode(skip_special_tokens=True)` to still strip the remaining special tokens textually while keeping the six markers.

```python
import mlx_vlm.server.generation as _gen

KEEP = {"<|tool_call>", "<tool_call|>", "<|tool_response>", "<tool_response|>",
        "<|channel>", "<channel|>"}

def _patch_tokenizer(tokenizer):
    keep_ids = {tid for m in KEEP
                if (tid := tokenizer.convert_tokens_to_ids(m)) is not None and tid >= 0}
    cls = type(tokenizer)
    orig_prop = cls.all_special_ids
    tokenizer.__class__ = type(f"{cls.__name__}Patched", (cls,), {
        "all_special_ids": property(
            lambda self: [i for i in orig_prop.fget(self) if i not in keep_ids])})

    strip = {t for t in tokenizer.all_special_tokens if t not in KEEP}
    orig_decode = tokenizer.decode
    def decode(token_ids, skip_special_tokens=False, **kw):
        text = orig_decode(token_ids, skip_special_tokens=False, **kw)
        if skip_special_tokens:
            for s in strip:
                text = text.replace(s, "")
        return text
    tokenizer.decode = decode

_orig_load = _gen.load_model_resources
def load_model_resources(model_path, adapter_path):
    model, processor, config = _orig_load(model_path, adapter_path)
    _patch_tokenizer(getattr(processor, "tokenizer", processor))
    return model, processor, config
_gen.load_model_resources = load_model_resources

from mlx_vlm.server import main
main()
```

With this applied, non-streaming and streaming tool calls both produce proper `tool_calls` with `finish_reason: "tool_calls"` (the gemma4 parser even converts the model's unquoted `call:` syntax into valid quoted-JSON `arguments`), the `thought` leak is gone, and plain/streaming completions without tools are unchanged.

## Environment

- mlx-vlm 0.6.3, mlx 0.31.2, mlx-lm 0.31.3, transformers 5.11.0, tokenizers 0.22.2
- macOS 26.4.1, Apple Silicon (M5 Max, arm64), Python 3.12
- Model: `mlx-community/diffusiongemma-26B-A4B-it-4bit`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Diffusion path strips all special tokens at the token-id level, deleting the tool-call/channel markers its own gemma4 parser needs (tool_calls: null, 'thought' leaks into content) #1351

Summary

Root cause

Reproduction

Suggested fix

Workaround (verified)

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Diffusion path strips all special tokens at the token-id level, deleting the tool-call/channel markers its own gemma4 parser needs (tool_calls: null, 'thought' leaks into content) #1351

Description

Summary

Root cause

Reproduction

Suggested fix

Workaround (verified)

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions