0.6.x even installed from newest main branch source code, pure text RAG with qwen 3.6 35B A3B mxfp4 still could not using APC full speed in prefill

Describe the issue
After upgrading mlx-vlm from 0.5 to 0.6, the APC cache hit rate dropped extremely low on Qwen3.6 models.

0.5.x: APC works perfectly, high hit rate, stable speed(prompt_tps > 23000).

0.6.0: APC seems only work for the first stored apc_cache_block, hit rate displays 1.0, but 'matched_tokens' only 258 (my apc_block_size is setting to 256)

DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx: APC works fine in 0.6.0.,0.6.2 and newest main branch source code

Only Qwen3.6 is affected.

Environment
mlx-vlm: 0.6.0, 0.6.1, 0.6.2 and newest source code from main branch
Model: Qwen3.6 (Qwen3.6-35B-A3B-mxfp4, Qwen3.6-35B-A3B-4bit, and others I have)
Device: Apple Silicon M2 Max 64GB mem
Key APC stats (Qwen3.6)
First query:
matched_tokens: 100

Second query (same prefix):
matched_tokens: 358
token_hit_rate: 1.0

Third query (same prefix):
matched_tokens: 616
token_hit_rate: 1.0

DeepSeek with same prompt, same params:
first round:
matched_tokens: 0
token_hit_rate: 0.0
second round:
matched_tokens: 2048
token_hit_rate: 0.33

0.5 used flexible block splitting and worked correctly.
0.6.x strict block padding breaks Qwen3.6 prefix matching.
 
I was told this is fixed, but I cannot experience it, is there any new parameters?
Please fix APC compatibility for Qwen3.6.
Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

0.6.x even installed from newest main branch source code, pure text RAG with qwen 3.6 35B A3B mxfp4 still could not using APC full speed in prefill #1320

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

0.6.x even installed from newest main branch source code, pure text RAG with qwen 3.6 35B A3B mxfp4 still could not using APC full speed in prefill #1320

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions