Skip to content

Latest commit

 

History

History
37 lines (31 loc) · 4.44 KB

File metadata and controls

37 lines (31 loc) · 4.44 KB

Supported Models

vllm-metal currently focuses on text-only language models on Apple Silicon. Multi-modal (vision / audio input) models are not yet supported.

Legend

Symbol Meaning
Supported model/feature
🔵 Experimental supported model/feature
Not supported model/feature
🟡 Not tested or verified

Text-Only Language Models

Automatic Prefix Cache describes the default behavior when the user does not pass --enable-prefix-caching. After #283, unified paged-KV models on Metal can reuse shared prefixes by default. Upstream vLLM still keeps the default off for hybrid/Mamba models, so those rows remain unless prefix caching is explicitly forced. These values describe the default engine behavior, not exhaustive model-by-model benchmarking on Metal. Qwen3 is explicitly covered by the paged prefix-cache e2e test.

Model Support Attention Kernel Automatic Prefix Cache PRs Notes
Qwen3 GQA (paged) #232, #237, #283 Validated by the paged prefix-cache e2e test
Qwen3.5 Hybrid SDPA + GDN linear #210, #226, #230, #235, #239, #243, #259, #265, #194 Upstream keeps automatic prefix caching off for hybrid/Mamba models
Qwen3.6 Hybrid SDPA + GDN linear (MoE) Upstream keeps automatic prefix caching off for hybrid/Mamba models
Qwen3-Next Hybrid SDPA + GDN linear #240 Upstream keeps automatic prefix caching off for hybrid/Mamba models
Gemma 4 🔵 GQA + per-layer sliding window + YOCO #251, #260, #269, #275, #277, #278, #282, #276, #279, #281, #283 Default-on for non-hybrid paged models; overall model support remains experimental
Gemma 3 🟡 GQA (paged) #283 Default-on by upstream policy; model support not separately verified on Metal
Llama 3 GQA (paged) #294 tested on llama3.2-1B
Mistral-Small-24B 🔵 GQA (paged) #166, #190, #283 Default-on for non-hybrid paged models
GPT-OSS 🔵 Sink attention (paged) #190, #221, #212, #283 Default-on for non-hybrid paged models
GLM-4.5 🟡 MLA (paged latent cache, MLX SDPA — no Metal kernel) 🟡 #213, #233 Automatic prefix caching is not yet verified on the MLX MLA path
GLM-4.7-Flash 🔵 GQA (paged) #190, #283 Default-on for non-hybrid paged models