Bug: MTP speculative decoding crashes on Qwen3.6-35B-A3B (Qwen3_5MTPDraftModel has no attribute 'language_model')
Environment
mlx-vlm (latest from PyPI, installed via uv tool install mlx-vlm[ui])
- mlx-vlm version:
0.6.2 (per INFO: server: mlx_vlm/0.6.2 in startup log)
- Hardware: MacBook Pro M4 Max, 128GB unified memory, macOS 26.4.1
- Target model:
mlx-community/Qwen3.6-35B-A3B-8bit (loaded from local path, MLX format)
- Draft model:
mlx-community/Qwen3.6-35B-A3B-MTP-bf16
- Command:
mlx_vlm.server --host 0.0.0.0 --port 8084 \
--model /Users/huanglibo/.omlx/models/Qwen3.6-35B-A3B-8bit \
--draft-model mlx-community/Qwen3.6-35B-A3B-MTP-bf16 \
--draft-kind mtp --enable-thinking
Symptom
Server starts cleanly ("Drafter ready — speculative decoding enabled.") and the first request returns 200 OK. Subsequent requests (sometimes the first one too) hang and finally 502. The server log shows the same traceback on every failed request:
ERROR - Error in generation thread
Traceback (most recent call last):
File ".../mlx_vlm/server/generation.py", line 1118, in _run
self.model.language_model,
File ".../mlx/nn/layers/base.py", line 103, in __getattr__
super(Module, self).__getattribute__(key)
AttributeError: 'Qwen3_5MTPDraftModel' object has no attribute 'language_model'
self.model at line 1118 is the drafter (Qwen3_5MTPDraftModel) rather than the target model. Line 819 stores the target into self.model = model (verified by inspection of _initialize_model), so something between _initialize_model returning and _run reading self.model.language_model is either reassigning self.model or self.model is the wrapped drafter from the start.
What I already tried
- Both
Qwen3.6-35B-A3B-MTP-bf16 and Qwen3.6-35B-A3B-MTP-4bit (same Qwen3_5MTPDraftModel class) — same crash.
- 27B target + 27B MTP drafter works fine (
self.model.language_model resolves correctly). The bug is specific to the 35B-A3B MoE target + Qwen3_5MTPDraftModel drafter combination.
validate_drafter_compatibility(target, draft_model, "mtp") does NOT raise — the hidden_size check passes, so the drafter is considered valid right up until the first generation call.
- Running 35B-A3B without
--draft-model works perfectly (~86 tok/s, no crash). The MoE model itself is fine; the bug is only in the speculative decoding path for this combination.
Workaround
Don't pass --draft-model to the 35B-A3B server. Speed drops from ~115 tok/s to ~86 tok/s, which is still faster than 27B+MTP thanks to MoE sparsity (3B active per token). I'd love to get the full ~115 tok/s back when this is fixed.
Relevant code
mlx_vlm/server/generation.py:1118 — the failing access
mlx_vlm/server/generation.py:819 — where self.model = model is set
mlx_vlm/speculative/drafters/__init__.py:load_drafter — returns (load_model(path), resolved_kind)
mlx_vlm/speculative/drafters/qwen3_5_mtp/qwen3_5_mtp.py:Qwen3_5MTPDraftModel — the drafter class
- The 35B-A3B target is
Qwen3_5MoeForConditionalGeneration (per architectures in the local config.json); its .language_model attribute is the standard Qwen3_5MoeModel accessible via target.language_model.
Happy to provide more logs or run with extra debug instrumentation if helpful.
Bug: MTP speculative decoding crashes on Qwen3.6-35B-A3B (
Qwen3_5MTPDraftModel has no attribute 'language_model')Environment
mlx-vlm(latest from PyPI, installed viauv tool install mlx-vlm[ui])0.6.2(perINFO: server: mlx_vlm/0.6.2in startup log)mlx-community/Qwen3.6-35B-A3B-8bit(loaded from local path, MLX format)mlx-community/Qwen3.6-35B-A3B-MTP-bf16Symptom
Server starts cleanly ("Drafter ready — speculative decoding enabled.") and the first request returns 200 OK. Subsequent requests (sometimes the first one too) hang and finally 502. The server log shows the same traceback on every failed request:
self.modelat line 1118 is the drafter (Qwen3_5MTPDraftModel) rather than the target model. Line 819 stores the target intoself.model = model(verified by inspection of_initialize_model), so something between_initialize_modelreturning and_runreadingself.model.language_modelis either reassigningself.modelorself.modelis the wrapped drafter from the start.What I already tried
Qwen3.6-35B-A3B-MTP-bf16andQwen3.6-35B-A3B-MTP-4bit(sameQwen3_5MTPDraftModelclass) — same crash.self.model.language_modelresolves correctly). The bug is specific to the 35B-A3B MoE target +Qwen3_5MTPDraftModeldrafter combination.validate_drafter_compatibility(target, draft_model, "mtp")does NOT raise — the hidden_size check passes, so the drafter is considered valid right up until the first generation call.--draft-modelworks perfectly (~86 tok/s, no crash). The MoE model itself is fine; the bug is only in the speculative decoding path for this combination.Workaround
Don't pass
--draft-modelto the 35B-A3B server. Speed drops from ~115 tok/s to ~86 tok/s, which is still faster than 27B+MTP thanks to MoE sparsity (3B active per token). I'd love to get the full ~115 tok/s back when this is fixed.Relevant code
mlx_vlm/server/generation.py:1118— the failing accessmlx_vlm/server/generation.py:819— whereself.model = modelis setmlx_vlm/speculative/drafters/__init__.py:load_drafter— returns(load_model(path), resolved_kind)mlx_vlm/speculative/drafters/qwen3_5_mtp/qwen3_5_mtp.py:Qwen3_5MTPDraftModel— the drafter classQwen3_5MoeForConditionalGeneration(perarchitecturesin the localconfig.json); its.language_modelattribute is the standardQwen3_5MoeModelaccessible viatarget.language_model.Happy to provide more logs or run with extra debug instrumentation if helpful.