MTP speculative decoding crashes on Qwen3.6-35B-A3B

 Bug: MTP speculative decoding crashes on Qwen3.6-35B-A3B (`Qwen3_5MTPDraftModel has no attribute 'language_model'`)

## Environment
- `mlx-vlm` (latest from PyPI, installed via `uv tool install mlx-vlm[ui]`)
- mlx-vlm version: `0.6.2` (per `INFO:     server: mlx_vlm/0.6.2` in startup log)
- Hardware: MacBook Pro M4 Max, 128GB unified memory, macOS 26.4.1
- Target model: `mlx-community/Qwen3.6-35B-A3B-8bit` (loaded from local path, MLX format)
- Draft model: `mlx-community/Qwen3.6-35B-A3B-MTP-bf16`
- Command:
  ```bash
  mlx_vlm.server --host 0.0.0.0 --port 8084 \
    --model /Users/huanglibo/.omlx/models/Qwen3.6-35B-A3B-8bit \
    --draft-model mlx-community/Qwen3.6-35B-A3B-MTP-bf16 \
    --draft-kind mtp --enable-thinking
  ```

## Symptom
Server starts cleanly ("Drafter ready — speculative decoding enabled.") and the first request returns 200 OK. Subsequent requests (sometimes the first one too) hang and finally 502. The server log shows the same traceback on every failed request:

```
ERROR - Error in generation thread
Traceback (most recent call last):
  File ".../mlx_vlm/server/generation.py", line 1118, in _run
    self.model.language_model,
  File ".../mlx/nn/layers/base.py", line 103, in __getattr__
    super(Module, self).__getattribute__(key)
AttributeError: 'Qwen3_5MTPDraftModel' object has no attribute 'language_model'
```

`self.model` at line 1118 is the drafter (`Qwen3_5MTPDraftModel`) rather than the target model. Line 819 stores the target into `self.model = model` (verified by inspection of `_initialize_model`), so something between `_initialize_model` returning and `_run` reading `self.model.language_model` is either reassigning `self.model` or `self.model` is the wrapped drafter from the start.

## What I already tried
- Both `Qwen3.6-35B-A3B-MTP-bf16` and `Qwen3.6-35B-A3B-MTP-4bit` (same `Qwen3_5MTPDraftModel` class) — same crash.
- 27B target + 27B MTP drafter works fine (`self.model.language_model` resolves correctly). The bug is specific to the 35B-A3B MoE target + `Qwen3_5MTPDraftModel` drafter combination.
- `validate_drafter_compatibility(target, draft_model, "mtp")` does NOT raise — the hidden_size check passes, so the drafter is considered valid right up until the first generation call.
- Running 35B-A3B without `--draft-model` works perfectly (~86 tok/s, no crash). The MoE model itself is fine; the bug is only in the speculative decoding path for this combination.

## Workaround
Don't pass `--draft-model` to the 35B-A3B server. Speed drops from ~115 tok/s to ~86 tok/s, which is still faster than 27B+MTP thanks to MoE sparsity (3B active per token). I'd love to get the full ~115 tok/s back when this is fixed.

## Relevant code
- `mlx_vlm/server/generation.py:1118` — the failing access
- `mlx_vlm/server/generation.py:819` — where `self.model = model` is set
- `mlx_vlm/speculative/drafters/__init__.py:load_drafter` — returns `(load_model(path), resolved_kind)`
- `mlx_vlm/speculative/drafters/qwen3_5_mtp/qwen3_5_mtp.py:Qwen3_5MTPDraftModel` — the drafter class
- The 35B-A3B target is `Qwen3_5MoeForConditionalGeneration` (per `architectures` in the local `config.json`); its `.language_model` attribute is the standard `Qwen3_5MoeModel` accessible via `target.language_model`.

Happy to provide more logs or run with extra debug instrumentation if helpful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MTP speculative decoding crashes on Qwen3.6-35B-A3B #1317

Environment

Symptom

What I already tried

Workaround

Relevant code

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

MTP speculative decoding crashes on Qwen3.6-35B-A3B #1317

Description

Environment

Symptom

What I already tried

Workaround

Relevant code

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions