Skip to content

Commit 57cc574

Browse files
authored
[Docs] Clarify automatic prefix cache support matrix (#287)
Follow-up for #284 after #283. Signed-off-by: RickyChen / 陳昭儒 <ricky.chen@infinirc.com>
1 parent 7df0e53 commit 57cc574

1 file changed

Lines changed: 20 additions & 11 deletions

File tree

docs/supported_models.md

Lines changed: 20 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -13,16 +13,25 @@ vllm-metal currently focuses on text-only language models on Apple Silicon. Mult
1313

1414
## Text-Only Language Models
1515

16+
`Automatic Prefix Cache` describes the default behavior when the user does
17+
not pass `--enable-prefix-caching`. After
18+
[#283](https://github.com/vllm-project/vllm-metal/pull/283), unified paged-KV
19+
models on Metal can reuse shared prefixes by default. Upstream vLLM still
20+
keeps the default off for hybrid/Mamba models, so those rows remain ``
21+
unless prefix caching is explicitly forced. These values describe the
22+
default engine behavior, not exhaustive model-by-model benchmarking on
23+
Metal. Qwen3 is explicitly covered by the paged prefix-cache e2e test.
24+
1625
| Model | Support | Attention Kernel | Automatic Prefix Cache | PRs | Notes |
1726
| --- | --- | --- | --- | --- | --- |
18-
| Qwen3 || GQA (paged) | | [#232](https://github.com/vllm-project/vllm-metal/pull/232), [#237](https://github.com/vllm-project/vllm-metal/pull/237) | |
19-
| Qwen3.5 || Hybrid SDPA + GDN linear | | [#210](https://github.com/vllm-project/vllm-metal/pull/210), [#226](https://github.com/vllm-project/vllm-metal/pull/226), [#230](https://github.com/vllm-project/vllm-metal/pull/230), [#235](https://github.com/vllm-project/vllm-metal/pull/235), [#239](https://github.com/vllm-project/vllm-metal/pull/239), [#243](https://github.com/vllm-project/vllm-metal/pull/243), [#259](https://github.com/vllm-project/vllm-metal/pull/259), [#265](https://github.com/vllm-project/vllm-metal/pull/265), [#194](https://github.com/vllm-project/vllm-metal/issues/194) | |
20-
| Qwen3.6 || Hybrid SDPA + GDN linear (MoE) | | | |
21-
| Qwen3-Next || Hybrid SDPA + GDN linear | | [#240](https://github.com/vllm-project/vllm-metal/pull/240) | |
22-
| Gemma 4 | 🔵 | GQA + per-layer sliding window + YOCO | | [#251](https://github.com/vllm-project/vllm-metal/pull/251), [#260](https://github.com/vllm-project/vllm-metal/pull/260), [#269](https://github.com/vllm-project/vllm-metal/pull/269), [#275](https://github.com/vllm-project/vllm-metal/pull/275), [#277](https://github.com/vllm-project/vllm-metal/pull/277), [#278](https://github.com/vllm-project/vllm-metal/pull/278), [#282](https://github.com/vllm-project/vllm-metal/pull/282), [#276](https://github.com/vllm-project/vllm-metal/issues/276), [#279](https://github.com/vllm-project/vllm-metal/pull/279), [#281](https://github.com/vllm-project/vllm-metal/issues/281) | |
23-
| Gemma 3 | 🟡 | GQA (paged) | | | |
24-
| Llama 3 | 🟡 | GQA (paged) | | | |
25-
| Mistral-Small-24B | 🔵 | GQA (paged) | | [#166](https://github.com/vllm-project/vllm-metal/pull/166), [#190](https://github.com/vllm-project/vllm-metal/pull/190) | |
26-
| GPT-OSS | 🔵 | Sink attention (paged) | | [#190](https://github.com/vllm-project/vllm-metal/pull/190), [#221](https://github.com/vllm-project/vllm-metal/pull/221), [#212](https://github.com/vllm-project/vllm-metal/issues/212) | |
27-
| GLM-4.5 | 🟡 | MLA (paged latent cache, MLX SDPA — no Metal kernel) | | [#213](https://github.com/vllm-project/vllm-metal/pull/213), [#233](https://github.com/vllm-project/vllm-metal/pull/233) | |
28-
| GLM-4.7-Flash | 🔵 | GQA (paged) | | [#190](https://github.com/vllm-project/vllm-metal/pull/190) | |
27+
| Qwen3 || GQA (paged) | | [#232](https://github.com/vllm-project/vllm-metal/pull/232), [#237](https://github.com/vllm-project/vllm-metal/pull/237), [#283](https://github.com/vllm-project/vllm-metal/pull/283) | Validated by the paged prefix-cache e2e test |
28+
| Qwen3.5 || Hybrid SDPA + GDN linear | | [#210](https://github.com/vllm-project/vllm-metal/pull/210), [#226](https://github.com/vllm-project/vllm-metal/pull/226), [#230](https://github.com/vllm-project/vllm-metal/pull/230), [#235](https://github.com/vllm-project/vllm-metal/pull/235), [#239](https://github.com/vllm-project/vllm-metal/pull/239), [#243](https://github.com/vllm-project/vllm-metal/pull/243), [#259](https://github.com/vllm-project/vllm-metal/pull/259), [#265](https://github.com/vllm-project/vllm-metal/pull/265), [#194](https://github.com/vllm-project/vllm-metal/issues/194) | Upstream keeps automatic prefix caching off for hybrid/Mamba models |
29+
| Qwen3.6 || Hybrid SDPA + GDN linear (MoE) | | | Upstream keeps automatic prefix caching off for hybrid/Mamba models |
30+
| Qwen3-Next || Hybrid SDPA + GDN linear | | [#240](https://github.com/vllm-project/vllm-metal/pull/240) | Upstream keeps automatic prefix caching off for hybrid/Mamba models |
31+
| Gemma 4 | 🔵 | GQA + per-layer sliding window + YOCO | | [#251](https://github.com/vllm-project/vllm-metal/pull/251), [#260](https://github.com/vllm-project/vllm-metal/pull/260), [#269](https://github.com/vllm-project/vllm-metal/pull/269), [#275](https://github.com/vllm-project/vllm-metal/pull/275), [#277](https://github.com/vllm-project/vllm-metal/pull/277), [#278](https://github.com/vllm-project/vllm-metal/pull/278), [#282](https://github.com/vllm-project/vllm-metal/pull/282), [#276](https://github.com/vllm-project/vllm-metal/issues/276), [#279](https://github.com/vllm-project/vllm-metal/pull/279), [#281](https://github.com/vllm-project/vllm-metal/issues/281), [#283](https://github.com/vllm-project/vllm-metal/pull/283) | Default-on for non-hybrid paged models; overall model support remains experimental |
32+
| Gemma 3 | 🟡 | GQA (paged) | | [#283](https://github.com/vllm-project/vllm-metal/pull/283) | Default-on by upstream policy; model support not separately verified on Metal |
33+
| Llama 3 | 🟡 | GQA (paged) | | [#283](https://github.com/vllm-project/vllm-metal/pull/283) | Default-on by upstream policy; model support not separately verified on Metal |
34+
| Mistral-Small-24B | 🔵 | GQA (paged) | | [#166](https://github.com/vllm-project/vllm-metal/pull/166), [#190](https://github.com/vllm-project/vllm-metal/pull/190), [#283](https://github.com/vllm-project/vllm-metal/pull/283) | Default-on for non-hybrid paged models |
35+
| GPT-OSS | 🔵 | Sink attention (paged) | | [#190](https://github.com/vllm-project/vllm-metal/pull/190), [#221](https://github.com/vllm-project/vllm-metal/pull/221), [#212](https://github.com/vllm-project/vllm-metal/issues/212), [#283](https://github.com/vllm-project/vllm-metal/pull/283) | Default-on for non-hybrid paged models |
36+
| GLM-4.5 | 🟡 | MLA (paged latent cache, MLX SDPA — no Metal kernel) | 🟡 | [#213](https://github.com/vllm-project/vllm-metal/pull/213), [#233](https://github.com/vllm-project/vllm-metal/pull/233) | Automatic prefix caching is not yet verified on the MLX MLA path |
37+
| GLM-4.7-Flash | 🔵 | GQA (paged) | | [#190](https://github.com/vllm-project/vllm-metal/pull/190), [#283](https://github.com/vllm-project/vllm-metal/pull/283) | Default-on for non-hybrid paged models |

0 commit comments

Comments
 (0)