vllm-metal currently focuses on text-only language models on Apple Silicon. Multi-modal (vision / audio input) models are not yet supported.
| Symbol | Meaning |
|---|---|
| ✅ | Supported model/feature |
| 🔵 | Experimental supported model/feature |
| ❌ | Not supported model/feature |
| 🟡 | Not tested or verified |
Automatic Prefix Cache describes the default behavior when the user does
not pass --enable-prefix-caching. After
#283, unified paged-KV
models on Metal can reuse shared prefixes by default. Upstream vLLM still
keeps the default off for hybrid/Mamba models, so those rows remain ❌
unless prefix caching is explicitly forced. These values describe the
default engine behavior, not exhaustive model-by-model benchmarking on
Metal. Qwen3 is explicitly covered by the paged prefix-cache e2e test.
| Model | Support | Attention Kernel | Automatic Prefix Cache | PRs | Notes |
|---|---|---|---|---|---|
| Qwen3 | ✅ | GQA (paged) | ✅ | #232, #237, #283 | Validated by the paged prefix-cache e2e test |
| Qwen3.5 | ✅ | Hybrid SDPA + GDN linear | ❌ | #210, #226, #230, #235, #239, #243, #259, #265, #194 | Upstream keeps automatic prefix caching off for hybrid/Mamba models |
| Qwen3.6 | ✅ | Hybrid SDPA + GDN linear (MoE) | ❌ | Upstream keeps automatic prefix caching off for hybrid/Mamba models | |
| Qwen3-Next | ✅ | Hybrid SDPA + GDN linear | ❌ | #240 | Upstream keeps automatic prefix caching off for hybrid/Mamba models |
| Gemma 4 | 🔵 | GQA + per-layer sliding window + YOCO | ✅ | #251, #260, #269, #275, #277, #278, #282, #276, #279, #281, #283 | Default-on for non-hybrid paged models; overall model support remains experimental |
| Gemma 3 | 🟡 | GQA (paged) | ✅ | #283 | Default-on by upstream policy; model support not separately verified on Metal |
| Llama 3 | ✅ | GQA (paged) | ✅ | #294 | tested on llama3.2-1B |
| Mistral-Small-24B | 🔵 | GQA (paged) | ✅ | #166, #190, #283 | Default-on for non-hybrid paged models |
| GPT-OSS | 🔵 | Sink attention (paged) | ✅ | #190, #221, #212, #283 | Default-on for non-hybrid paged models |
| GLM-4.5 | 🟡 | MLA (paged latent cache, MLX SDPA — no Metal kernel) | 🟡 | #213, #233 | Automatic prefix caching is not yet verified on the MLX MLA path |
| GLM-4.7-Flash | 🔵 | GQA (paged) | ✅ | #190, #283 | Default-on for non-hybrid paged models |