-
Notifications
You must be signed in to change notification settings - Fork 161
Description
Description:
When attempting to serve mlx-community/Qwen3-8B-4bit using vllm-mlx, the server crashes during initialization with a TypeError. The internal load_model_with_fallback function appears to fail silently when loading the model and returns None, which then causes a crash when the code attempts to unpack the model and tokenizer.
Steps to Reproduce:
- Install
vllm-mlx(installed viauv). - Run the following command:
vllm-mlx serve mlx-community/Qwen3-8B-4bitExpected Behavior:
The model should load successfully and start the server, or vllm-mlx should throw a descriptive error explaining why the model configuration is incompatible.
Actual Behavior:
The model downloads completely, but the server crashes with a NoneType unpacking error.
Traceback / Logs:
...
INFO:vllm_mlx.server:Loading model with SimpleEngine: mlx-community/Qwen3-8B-4bit
INFO:vllm_mlx.models.llm:Loading model: mlx-community/Qwen3-8B-4bit
INFO:vllm_mlx.models.llm:Qwen3 detected: setting eos_token to <|im_end|>
...
Download complete: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 4.61G/4.61G [09:46<00:00, 7.85MB/s]
ERROR:vllm_mlx.models.llm:Failed to load model: cannot unpack non-iterable NoneType object
Traceback (most recent call last):
...
File "/Users/gavin/.local/share/uv/tools/vllm-mlx/lib/python3.12/site-packages/vllm_mlx/engine/simple.py", line 152, in start
self._model.load()
File "/Users/gavin/.local/share/uv/tools/vllm-mlx/lib/python3.12/site-packages/vllm_mlx/models/llm.py", line 92, in load
self.model, self.tokenizer = load_model_with_fallback(
^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: cannot unpack non-iterable NoneType object
Isolation Testing / Additional Context:
I verified that the issue is strictly isolated to vllm-mlx. Loading the exact same model directly with the core mlx-lm library works perfectly:
from mlx_lm import load
model, tokenizer = load("mlx-community/Qwen3-8B-4bit")
# Success! The core MLX library reads this model without issue.Because mlx-lm works, the model files are uncorrupted. The issue seems to be how vllm-mlx handles Qwen3 architectures specifically, causing load_model_with_fallback to swallow the underlying error and return None.
Environment:
- OS: macOS 15.7.2
- Installation Method:
uv tool - Python Version: 3.12.8
- vllm-mlx Version: v0.2.6