@@ -13,16 +13,25 @@ vllm-metal currently focuses on text-only language models on Apple Silicon. Mult
1313
1414## Text-Only Language Models
1515
16+ ` Automatic Prefix Cache ` describes the default behavior when the user does
17+ not pass ` --enable-prefix-caching ` . After
18+ [ #283 ] ( https://github.com/vllm-project/vllm-metal/pull/283 ) , unified paged-KV
19+ models on Metal can reuse shared prefixes by default. Upstream vLLM still
20+ keeps the default off for hybrid/Mamba models, so those rows remain ` ❌ `
21+ unless prefix caching is explicitly forced. These values describe the
22+ default engine behavior, not exhaustive model-by-model benchmarking on
23+ Metal. Qwen3 is explicitly covered by the paged prefix-cache e2e test.
24+
1625| Model | Support | Attention Kernel | Automatic Prefix Cache | PRs | Notes |
1726| --- | --- | --- | --- | --- | --- |
18- | Qwen3 | ✅ | GQA (paged) | | [ #232 ] ( https://github.com/vllm-project/vllm-metal/pull/232 ) , [ #237 ] ( https://github.com/vllm-project/vllm-metal/pull/237 ) | |
19- | Qwen3.5 | ✅ | Hybrid SDPA + GDN linear | | [ #210 ] ( https://github.com/vllm-project/vllm-metal/pull/210 ) , [ #226 ] ( https://github.com/vllm-project/vllm-metal/pull/226 ) , [ #230 ] ( https://github.com/vllm-project/vllm-metal/pull/230 ) , [ #235 ] ( https://github.com/vllm-project/vllm-metal/pull/235 ) , [ #239 ] ( https://github.com/vllm-project/vllm-metal/pull/239 ) , [ #243 ] ( https://github.com/vllm-project/vllm-metal/pull/243 ) , [ #259 ] ( https://github.com/vllm-project/vllm-metal/pull/259 ) , [ #265 ] ( https://github.com/vllm-project/vllm-metal/pull/265 ) , [ #194 ] ( https://github.com/vllm-project/vllm-metal/issues/194 ) | |
20- | Qwen3.6 | ✅ | Hybrid SDPA + GDN linear (MoE) | | | |
21- | Qwen3-Next | ✅ | Hybrid SDPA + GDN linear | | [ #240 ] ( https://github.com/vllm-project/vllm-metal/pull/240 ) | |
22- | Gemma 4 | 🔵 | GQA + per-layer sliding window + YOCO | | [ #251 ] ( https://github.com/vllm-project/vllm-metal/pull/251 ) , [ #260 ] ( https://github.com/vllm-project/vllm-metal/pull/260 ) , [ #269 ] ( https://github.com/vllm-project/vllm-metal/pull/269 ) , [ #275 ] ( https://github.com/vllm-project/vllm-metal/pull/275 ) , [ #277 ] ( https://github.com/vllm-project/vllm-metal/pull/277 ) , [ #278 ] ( https://github.com/vllm-project/vllm-metal/pull/278 ) , [ #282 ] ( https://github.com/vllm-project/vllm-metal/pull/282 ) , [ #276 ] ( https://github.com/vllm-project/vllm-metal/issues/276 ) , [ #279 ] ( https://github.com/vllm-project/vllm-metal/pull/279 ) , [ #281 ] ( https://github.com/vllm-project/vllm-metal/issues/281 ) | |
23- | Gemma 3 | 🟡 | GQA (paged) | | | |
24- | Llama 3 | 🟡 | GQA (paged) | | | |
25- | Mistral-Small-24B | 🔵 | GQA (paged) | | [ #166 ] ( https://github.com/vllm-project/vllm-metal/pull/166 ) , [ #190 ] ( https://github.com/vllm-project/vllm-metal/pull/190 ) | |
26- | GPT-OSS | 🔵 | Sink attention (paged) | | [ #190 ] ( https://github.com/vllm-project/vllm-metal/pull/190 ) , [ #221 ] ( https://github.com/vllm-project/vllm-metal/pull/221 ) , [ #212 ] ( https://github.com/vllm-project/vllm-metal/issues/212 ) | |
27- | GLM-4.5 | 🟡 | MLA (paged latent cache, MLX SDPA — no Metal kernel) | | [ #213 ] ( https://github.com/vllm-project/vllm-metal/pull/213 ) , [ #233 ] ( https://github.com/vllm-project/vllm-metal/pull/233 ) | |
28- | GLM-4.7-Flash | 🔵 | GQA (paged) | | [ #190 ] ( https://github.com/vllm-project/vllm-metal/pull/190 ) | |
27+ | Qwen3 | ✅ | GQA (paged) | ✅ | [ #232 ] ( https://github.com/vllm-project/vllm-metal/pull/232 ) , [ #237 ] ( https://github.com/vllm-project/vllm-metal/pull/237 ) , [ # 283 ] ( https://github.com/vllm-project/vllm-metal/pull/283 ) | Validated by the paged prefix-cache e2e test |
28+ | Qwen3.5 | ✅ | Hybrid SDPA + GDN linear | ❌ | [ #210 ] ( https://github.com/vllm-project/vllm-metal/pull/210 ) , [ #226 ] ( https://github.com/vllm-project/vllm-metal/pull/226 ) , [ #230 ] ( https://github.com/vllm-project/vllm-metal/pull/230 ) , [ #235 ] ( https://github.com/vllm-project/vllm-metal/pull/235 ) , [ #239 ] ( https://github.com/vllm-project/vllm-metal/pull/239 ) , [ #243 ] ( https://github.com/vllm-project/vllm-metal/pull/243 ) , [ #259 ] ( https://github.com/vllm-project/vllm-metal/pull/259 ) , [ #265 ] ( https://github.com/vllm-project/vllm-metal/pull/265 ) , [ #194 ] ( https://github.com/vllm-project/vllm-metal/issues/194 ) | Upstream keeps automatic prefix caching off for hybrid/Mamba models |
29+ | Qwen3.6 | ✅ | Hybrid SDPA + GDN linear (MoE) | ❌ | | Upstream keeps automatic prefix caching off for hybrid/Mamba models |
30+ | Qwen3-Next | ✅ | Hybrid SDPA + GDN linear | ❌ | [ #240 ] ( https://github.com/vllm-project/vllm-metal/pull/240 ) | Upstream keeps automatic prefix caching off for hybrid/Mamba models |
31+ | Gemma 4 | 🔵 | GQA + per-layer sliding window + YOCO | ✅ | [ #251 ] ( https://github.com/vllm-project/vllm-metal/pull/251 ) , [ #260 ] ( https://github.com/vllm-project/vllm-metal/pull/260 ) , [ #269 ] ( https://github.com/vllm-project/vllm-metal/pull/269 ) , [ #275 ] ( https://github.com/vllm-project/vllm-metal/pull/275 ) , [ #277 ] ( https://github.com/vllm-project/vllm-metal/pull/277 ) , [ #278 ] ( https://github.com/vllm-project/vllm-metal/pull/278 ) , [ #282 ] ( https://github.com/vllm-project/vllm-metal/pull/282 ) , [ #276 ] ( https://github.com/vllm-project/vllm-metal/issues/276 ) , [ #279 ] ( https://github.com/vllm-project/vllm-metal/pull/279 ) , [ #281 ] ( https://github.com/vllm-project/vllm-metal/issues/281 ) , [ # 283 ] ( https://github.com/vllm-project/vllm-metal/pull/283 ) | Default-on for non-hybrid paged models; overall model support remains experimental |
32+ | Gemma 3 | 🟡 | GQA (paged) | ✅ | [ # 283 ] ( https://github.com/vllm-project/vllm-metal/pull/283 ) | Default-on by upstream policy; model support not separately verified on Metal |
33+ | Llama 3 | 🟡 | GQA (paged) | ✅ | [ # 283 ] ( https://github.com/vllm-project/vllm-metal/pull/283 ) | Default-on by upstream policy; model support not separately verified on Metal |
34+ | Mistral-Small-24B | 🔵 | GQA (paged) | ✅ | [ #166 ] ( https://github.com/vllm-project/vllm-metal/pull/166 ) , [ #190 ] ( https://github.com/vllm-project/vllm-metal/pull/190 ) , [ # 283 ] ( https://github.com/vllm-project/vllm-metal/pull/283 ) | Default-on for non-hybrid paged models |
35+ | GPT-OSS | 🔵 | Sink attention (paged) | ✅ | [ #190 ] ( https://github.com/vllm-project/vllm-metal/pull/190 ) , [ #221 ] ( https://github.com/vllm-project/vllm-metal/pull/221 ) , [ #212 ] ( https://github.com/vllm-project/vllm-metal/issues/212 ) , [ # 283 ] ( https://github.com/vllm-project/vllm-metal/pull/283 ) | Default-on for non-hybrid paged models |
36+ | GLM-4.5 | 🟡 | MLA (paged latent cache, MLX SDPA — no Metal kernel) | 🟡 | [ #213 ] ( https://github.com/vllm-project/vllm-metal/pull/213 ) , [ #233 ] ( https://github.com/vllm-project/vllm-metal/pull/233 ) | Automatic prefix caching is not yet verified on the MLX MLA path |
37+ | GLM-4.7-Flash | 🔵 | GQA (paged) | ✅ | [ #190 ] ( https://github.com/vllm-project/vllm-metal/pull/190 ) , [ # 283 ] ( https://github.com/vllm-project/vllm-metal/pull/283 ) | Default-on for non-hybrid paged models |
0 commit comments