Skip to content

0.6.x even installed from newest main branch source code, pure text RAG with qwen 3.6 35B A3B mxfp4 still could not using APC full speed in prefill #1320

@NeoInBJ

Description

@NeoInBJ

Describe the issue
After upgrading mlx-vlm from 0.5 to 0.6, the APC cache hit rate dropped extremely low on Qwen3.6 models.

0.5.x: APC works perfectly, high hit rate, stable speed(prompt_tps > 23000).

0.6.0: APC seems only work for the first stored apc_cache_block, hit rate displays 1.0, but 'matched_tokens' only 258 (my apc_block_size is setting to 256)

DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx: APC works fine in 0.6.0.,0.6.2 and newest main branch source code

Only Qwen3.6 is affected.

Environment
mlx-vlm: 0.6.0, 0.6.1, 0.6.2 and newest source code from main branch
Model: Qwen3.6 (Qwen3.6-35B-A3B-mxfp4, Qwen3.6-35B-A3B-4bit, and others I have)
Device: Apple Silicon M2 Max 64GB mem
Key APC stats (Qwen3.6)
First query:
matched_tokens: 100

Second query (same prefix):
matched_tokens: 358
token_hit_rate: 1.0

Third query (same prefix):
matched_tokens: 616
token_hit_rate: 1.0

DeepSeek with same prompt, same params:
first round:
matched_tokens: 0
token_hit_rate: 0.0
second round:
matched_tokens: 2048
token_hit_rate: 0.33

0.5 used flexible block splitting and worked correctly.
0.6.x strict block padding breaks Qwen3.6 prefix matching.

I was told this is fixed, but I cannot experience it, is there any new parameters?
Please fix APC compatibility for Qwen3.6.
Thanks!

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions