-
Notifications
You must be signed in to change notification settings - Fork 161
Open
Description
Problem
--continuous-batching + --mllm fails at runtime. The MLLM path creates ArraysCache (from mlx_vlm) but MLLMBatchGenerator._run_prefill_batch() requires KVCache (from mlx_lm) for KVCache.merge().
Error:
ERROR:vllm_mlx.mllm_scheduler:Error in MLLM process loop: MLLM continuous batching requires standard KVCache but got ArraysCache. Disable --kv-cache-quantization when using multimodal models with --continuous-batching.
Location: mllm_batch_generator.py:675
from mlx_lm.models.cache import KVCache
sample_cache = per_request_caches[0][0]
if not isinstance(sample_cache, KVCache):
raise ValueError(
f"MLLM continuous batching requires standard KVCache but got "
f"{type(sample_cache).__name__}. ..."
)Context
- Model: Qwen3.5-35B-A3B-8bit (has vision tower, requires
--mllmto load) - Without
--mllm: model fails to load (language_model.vision_tower.*weights rejected by mlx_lm) - Without
--continuous-batching: concurrent requests crash Metal GPU (_MTLCommandBufferassertion failure) - vllm-mlx version: 0.2.5 (git main as of 2026-03-14)
Impact
Any multimodal model (or model with vision weights like Qwen3.5) cannot use continuous batching. This means:
- Concurrent requests crash the server (Metal GPU assertion)
- Multi-user scenarios are broken for MLLM models
- Agents/tools that send parallel requests kill the server
Suggested Fix
Support ArraysCache in MLLMBatchGenerator._run_prefill_batch() — either by:
- Adding an
ArraysCache.merge()method analogous toKVCache.merge() - Converting
ArraysCache→KVCachebefore merge - Implementing a separate batch merge path for
ArraysCache
Environment
- macOS 15.5, Mac Studio M2 Ultra (128GB)
- MLX, mlx_vlm, mlx_lm (latest)
- Python 3.12
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels