Skip to content

MTP speculative decoding crashes on Qwen3.6-35B-A3B #1317

@huanglibo389

Description

@huanglibo389

Bug: MTP speculative decoding crashes on Qwen3.6-35B-A3B (Qwen3_5MTPDraftModel has no attribute 'language_model')

Environment

  • mlx-vlm (latest from PyPI, installed via uv tool install mlx-vlm[ui])
  • mlx-vlm version: 0.6.2 (per INFO: server: mlx_vlm/0.6.2 in startup log)
  • Hardware: MacBook Pro M4 Max, 128GB unified memory, macOS 26.4.1
  • Target model: mlx-community/Qwen3.6-35B-A3B-8bit (loaded from local path, MLX format)
  • Draft model: mlx-community/Qwen3.6-35B-A3B-MTP-bf16
  • Command:
    mlx_vlm.server --host 0.0.0.0 --port 8084 \
      --model /Users/huanglibo/.omlx/models/Qwen3.6-35B-A3B-8bit \
      --draft-model mlx-community/Qwen3.6-35B-A3B-MTP-bf16 \
      --draft-kind mtp --enable-thinking

Symptom

Server starts cleanly ("Drafter ready — speculative decoding enabled.") and the first request returns 200 OK. Subsequent requests (sometimes the first one too) hang and finally 502. The server log shows the same traceback on every failed request:

ERROR - Error in generation thread
Traceback (most recent call last):
  File ".../mlx_vlm/server/generation.py", line 1118, in _run
    self.model.language_model,
  File ".../mlx/nn/layers/base.py", line 103, in __getattr__
    super(Module, self).__getattribute__(key)
AttributeError: 'Qwen3_5MTPDraftModel' object has no attribute 'language_model'

self.model at line 1118 is the drafter (Qwen3_5MTPDraftModel) rather than the target model. Line 819 stores the target into self.model = model (verified by inspection of _initialize_model), so something between _initialize_model returning and _run reading self.model.language_model is either reassigning self.model or self.model is the wrapped drafter from the start.

What I already tried

  • Both Qwen3.6-35B-A3B-MTP-bf16 and Qwen3.6-35B-A3B-MTP-4bit (same Qwen3_5MTPDraftModel class) — same crash.
  • 27B target + 27B MTP drafter works fine (self.model.language_model resolves correctly). The bug is specific to the 35B-A3B MoE target + Qwen3_5MTPDraftModel drafter combination.
  • validate_drafter_compatibility(target, draft_model, "mtp") does NOT raise — the hidden_size check passes, so the drafter is considered valid right up until the first generation call.
  • Running 35B-A3B without --draft-model works perfectly (~86 tok/s, no crash). The MoE model itself is fine; the bug is only in the speculative decoding path for this combination.

Workaround

Don't pass --draft-model to the 35B-A3B server. Speed drops from ~115 tok/s to ~86 tok/s, which is still faster than 27B+MTP thanks to MoE sparsity (3B active per token). I'd love to get the full ~115 tok/s back when this is fixed.

Relevant code

  • mlx_vlm/server/generation.py:1118 — the failing access
  • mlx_vlm/server/generation.py:819 — where self.model = model is set
  • mlx_vlm/speculative/drafters/__init__.py:load_drafter — returns (load_model(path), resolved_kind)
  • mlx_vlm/speculative/drafters/qwen3_5_mtp/qwen3_5_mtp.py:Qwen3_5MTPDraftModel — the drafter class
  • The 35B-A3B target is Qwen3_5MoeForConditionalGeneration (per architectures in the local config.json); its .language_model attribute is the standard Qwen3_5MoeModel accessible via target.language_model.

Happy to provide more logs or run with extra debug instrumentation if helpful.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions