Describe the issue
After upgrading mlx-vlm from 0.5 to 0.6, the APC cache hit rate dropped extremely low on Qwen3.6 models.
0.5.x: APC works perfectly, high hit rate, stable speed(prompt_tps > 23000).
0.6.0: APC seems only work for the first stored apc_cache_block, hit rate displays 1.0, but 'matched_tokens' only 258 (my apc_block_size is setting to 256)
DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx: APC works fine in 0.6.0.,0.6.2 and newest main branch source code
Only Qwen3.6 is affected.
Environment
mlx-vlm: 0.6.0, 0.6.1, 0.6.2 and newest source code from main branch
Model: Qwen3.6 (Qwen3.6-35B-A3B-mxfp4, Qwen3.6-35B-A3B-4bit, and others I have)
Device: Apple Silicon M2 Max 64GB mem
Key APC stats (Qwen3.6)
First query:
matched_tokens: 100
Second query (same prefix):
matched_tokens: 358
token_hit_rate: 1.0
Third query (same prefix):
matched_tokens: 616
token_hit_rate: 1.0
DeepSeek with same prompt, same params:
first round:
matched_tokens: 0
token_hit_rate: 0.0
second round:
matched_tokens: 2048
token_hit_rate: 0.33
0.5 used flexible block splitting and worked correctly.
0.6.x strict block padding breaks Qwen3.6 prefix matching.
I was told this is fixed, but I cannot experience it, is there any new parameters?
Please fix APC compatibility for Qwen3.6.
Thanks!
Describe the issue
After upgrading mlx-vlm from 0.5 to 0.6, the APC cache hit rate dropped extremely low on Qwen3.6 models.
0.5.x: APC works perfectly, high hit rate, stable speed(prompt_tps > 23000).
0.6.0: APC seems only work for the first stored apc_cache_block, hit rate displays 1.0, but 'matched_tokens' only 258 (my apc_block_size is setting to 256)
DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx: APC works fine in 0.6.0.,0.6.2 and newest main branch source code
Only Qwen3.6 is affected.
Environment
mlx-vlm: 0.6.0, 0.6.1, 0.6.2 and newest source code from main branch
Model: Qwen3.6 (Qwen3.6-35B-A3B-mxfp4, Qwen3.6-35B-A3B-4bit, and others I have)
Device: Apple Silicon M2 Max 64GB mem
Key APC stats (Qwen3.6)
First query:
matched_tokens: 100
Second query (same prefix):
matched_tokens: 358
token_hit_rate: 1.0
Third query (same prefix):
matched_tokens: 616
token_hit_rate: 1.0
DeepSeek with same prompt, same params:
first round:
matched_tokens: 0
token_hit_rate: 0.0
second round:
matched_tokens: 2048
token_hit_rate: 0.33
0.5 used flexible block splitting and worked correctly.
0.6.x strict block padding breaks Qwen3.6 prefix matching.
I was told this is fixed, but I cannot experience it, is there any new parameters?
Please fix APC compatibility for Qwen3.6.
Thanks!