Releases: vllm-project/vllm
v0.18.1
This is a patch release on top of v0.18.0 to address a few issues:
- Change default SM100 MLA prefill backend back to TRT-LLM (#38562)
- Fix mock.patch resolution failure for standalone_compile.FakeTensorMode on Python <= 3.10 (#37158)
- Disable monolithic TRTLLM MoE for Renormalize routing #37605
- Pre-download missing FlashInfer headers in Docker build #38391
- Fix DeepGemm E8M0 accuracy degradation for Qwen3.5 FP8 on Blackwell (#38083)
v0.18.0
vLLM v0.18.0
Known issues
- Degraded accuracy when serving Qwen3.5 with FP8 KV cache on B200 (#37618)
- If you previously ran into
CUBLAS_STATUS_INVALID_VALUEand had to use a workaround inv0.17.0, you can reinstalltorch 2.10.0. PyTorch published an updated wheel that addresses this bug.
Highlights
This release features 445 commits from 213 contributors (61 new)!
- gRPC Serving Support: vLLM now supports gRPC serving via the new
--grpcflag (#36169), enabling high-performance RPC-based serving alongside the existing HTTP/REST interface. - GPU-less Render Serving: New
vllm launch rendercommand (#36166, #34551) enables GPU-less preprocessing and rendering, allowing separation of multimodal preprocessing from GPU inference. - NGram GPU Speculative Decoding: NGram speculative decoding now runs on GPU and is compatible with the async scheduler (#29184), significantly reducing spec decode overhead.
- KV Cache Offloading Improvements: Smart CPU offloading that stores only frequently-reused blocks (#35342), plus FlexKV as a new offloading backend (#34328) and support for multiple KV groups in offloading spec (#36610).
- Elastic Expert Parallelism Milestone 2: NIXL-EP integration (#35627) enables dynamic GPU scaling for MoE experts, with new
--enable-ep-weight-filterCLI option (#37351) for faster EP model loading. - FlashInfer 0.6.6: Updated FlashInfer dependency (#36768) with numerous performance and correctness improvements.
- Responses API Streaming Tool Calls: The OpenAI Responses API now supports tool/function calling with streaming (#29947).
- Online Beam Search for ASR: Beam search support for encoder/decoder models both offline (#36153) and online transcriptions (#36160).
- Ray No Longer a Default Dependency: Ray has been removed as a default dependency (#36170) — install it explicitly if needed.
Model Support
- New architectures: Sarvam MoE (#33942), OLMo Hybrid (#32550), HyperCLOVAX-SEED-Think-32B VLM (#31471), HyperCLOVAX-SEED-Think-14B (#37107), Kimi-Audio-7B-Instruct (#36127), ColPali late-interaction retrieval (#36818), ERNIE pooling models (#36385).
- Speculative decoding: Eagle3 for Qwen3.5 (#36658), Eagle3 for Kimi K2.5 MLA (#36361), Eagle for Mistral Large 3 with dense layers (#36163).
- LoRA: Whisper LoRA (#29856), FP8 LoRA dense kernel (#35242).
- Multimodal: Online use_audio_in_video (#36319), audio extraction from MP4 for Nemotron Nano VL (#35539), audio transcription for MP4/M4A/WebM (#35109), expose media_io_kwargs at runtime (#34778), fast media preprocessing for Nano Nemotron VL (#35657).
- Compatibility: Gemma/Gemma2 inputs_embeds (#36787), SigLIP/CLIP Transformers v5 (#37200), fused expert weights in Transformers backend (#36997).
- Performance: Qwen3 Next fused GDN kernel (#35777), LFM2 tuned H100 MoE configs (#36699).
- Fixes: DeepSeek-V3.2 tokenizer space stripping (#37004), Qwen3.5 tool calling (#36774), Qwen3-VL timestamp mismatch (#36136), Qwen3-Next TP>1 weight sharding (#36242), Qwen3-ASR torch.compile (#35869), MiniCPM-V audio inference (#36751), MiniCPM-O 4.5 ViT attention (#34127), routed experts for hybrid models (#35744), Qwen2.5-Omni/Qwen3-Omni multi-video audio_in_video (#37147), DeepSeek-OCR empty images crash (#36670).
Engine Core
- Model Runner V2: Probabilistic rejection sampling for spec decode (#35461), pooling models (#36019), extensible CUDA graph dispatch (#35959), WhisperModelState (#35790), XD-RoPE (#36817), model_state CUDA graph capture (#36544).
- KV cache offloading: Reuse-frequency-gated CPU stores (#35342), FlexKV offloading backend (#34328), multiple KV groups (#36610), async scheduling fix (#33881).
- Speculative decoding: NGram GPU implementation with async scheduler (#29184), fused EAGLE step slot mapping (#33503).
- Performance: Remove busy loop from idle buffer readers (#28053), 2.7% E2E throughput for pooling via worker-side maxsim (#36159), 3.2% via batched maxsim (#36710), CUDA graph memory accounting during profiling (#30515), checkpoint prefetch to OS page cache (#36012), InstantTensor weight loader (#36139), sporadic stall fix via pin_memory removal (#37006).
- Stability: VLM concurrent throughput degradation fix (#36557), DP deadlock fix (#35194), DeepSeek V3.2 OOM during CG profiling (#36691), Ray DP startup crash (#36665), NCCL rank calculation fix (#36940), zero-init MLA output buffers for NaN prevention (#37442), CUDA OOM fix (#35594).
- Defaults: Cascade attention disabled by default (#36318).
- Extensibility: OOT linear method registration (#35981), custom collective ops registration for non-CUDA platforms (#34760).
Kernel
- FA4 for MLA prefill (#34732).
- FlashInfer Sparse MLA: FP8 KV cache support (#35891), CUDA graphs on ROCm (#35719), MTP lens > 1 on ROCm (#36681).
- TRTLLM FP8 MoE modular kernel (#36307).
- FP8 KV cache for Triton MLA decode (#34597).
- FlashInfer MoE A2A kernel (#36022).
- Remove chunking from FusedMoE for full batch processing (#34086).
- CustomOp FusedRMSNormGated for torch.compile compatibility (#35877).
- Mamba2 SSD prefill Triton kernel optimization (#35397).
- DeepSeek-V3.2: Vectorized MLA query concat kernel (#34917), optimized FP8 KV cache gather for context parallel (#35290).
- 320-dimension MLA head size support (#36161).
- Packed recurrent fast path for decode (#36596).
- EP scatter race condition fix (#34991).
Hardware & Performance
- NVIDIA: FA4 for MLA prefill (#34732), DeepSeek-V3.2 MLA kernel optimizations (#34917, #35290).
- AMD ROCm: Sparse MLA CUDA graphs (#35719), MTP lens > 1 in Sparse MLA (#36681), MLA with nhead<16 + FP8 KV for TP=8 (#35850), RoPE+KV cache fusion for AITER FA (#35786), AITER MLA CPU sync avoidance (#35765), Quark W4A8 MXFP4/FP8 (#35316), gfx1152/gfx1153 Krackan support (#36499), fused_topk_bias AITER optimization (#36253), skinny GEMM improvements (#34304), DeepEP in ROCm Dockerfile (#36086), startup OOM fix (#36720).
- Intel XPU: Model Runner V2 enabled (#36078), MLA Sparse backend for DeepSeek V3.2 (#33230), LoRA via torch.compile (#36962), block FP8 MoE fallback (#36458), deepseek_scaling_rope fused kernel (#36612).
- CPU: aarch64 int8 matmul via OneDNN upgrade (#36147), AMD Zen CPU backend via zentorch (#35970).
- RISC-V: CPU backend support (#36578).
- Performance: 5% E2E improvement for PD disaggregation scheduling (#35781), packed recurrent decode fast path (#36596), pooling model maxsim 2.7%+3.2% throughput (#36159, #36710).
- torch.compile: FakeTensors instead of real GPU tensors for single-size compilation (#36093), non-contiguous fused RMSNorm + group quant (#36551), stop lazy compiling (#35472).
Large Scale Serving
- Elastic EP Milestone 2: NIXL-EP integration (#35627),
--enable-ep-weight-filterfor faster EP loading (#37351). - PD Disaggregation: ~5% scheduler overhead reduction (#35781), KV transfer fix with spec decode (#35158), P/D for hybrid SSM-FA models via NIXL (#36687), PP for multimodal models on Transformers backend (#37057).
- KV Connectors: HMA + NIXL connector (#35758), FlexKV offloading (#34328), worker→scheduler metadata (#31964), All-to-All DCP backend (#34883).
- LMCache: Fault tolerance mechanism (#36586), memory leak fix (#35931), race condition fix (#35831), TP size for MLA multi-reader locking (#36129).
- EP loading: Skip non-local expert weights (#37136).
Quantization
- ModelOpt MXFP8 MoE support (#35986).
- MXFP4 MoE routing simulation override for accuracy (#33595).
- FP8 LoRA dense kernel (#35242).
- ROCm: Quark W4A8 MXFP4/FP8 for LinearLayer (#35316), compressed-tensors fix for DeepSeek-R1 on MI300x (#36247).
- Fixes: MLA crash with AWQ/GPTQ quantized models (#34695), score layer quantization for reranker models (#35849), GLM-4.1V non-default quantization (#36321), FP8 k_scale/v_scale loading for Qwen3-MoE (#35656).
API & Frontend
- gRPC: New
--grpcflag for gRPC serving (#36169). - GPU-less serving:
vllm launch renderfor preprocessing-only serving (#36166),vllm launchfor GPU-less preprocessing (#34551). - Responses API: Streaming tool/function calling (#29947), reasoning item fixes (#34499, #36516).
- Anthropic API: Accept redacted thinking blocks (#36992).
- ASR: Online beam search transcriptions (#36160), offline beam search (#36153), audio transcription for MP4/M4A/WebM (#35109), realtime endpoint metrics (#35500).
- Tool calling: Granite4 tool parser (#36827), Qwen3Coder anyOf double encoding fix (#36032).
- New options:
--distributed-timeout-seconds(#36047),--attention-backend auto(#35738),reasoning_effort=none(#36238), PyTorch profiler schedule (#35240). - Cohere Embed v2 API support (#37074).
- Azure Blob Storage support for RunAI Model Streamer (#34614).
- Graceful shutdown timeout for in-flight requests (#36666).
- Fixes: tool_choice=required exceeding max_tokens crash (#36841), negative max_tokens with long prompts (#36789), concurrent classify/token_classify race (#36614), Anthropic billing header prefix cache miss (#36829), render endpoint crash for multimodal requests (#35684), xgrammar dtype mismatch on macOS CPU (#32384), minimax_m2 tool parser with stream interval > 1 (#35895).
Security
- Respect user
trust_remote_codesetting in NemotronVL and KimiK25 (#36192). - Upgrade xgrammar for security fix (#36168).
- Guard RLHF weight sync deserialization behind insecure serialization flag (#35928).
Dependencies
- FlashInfer 0.6.6 (#36768).
- Ray removed from default dependencies (#36170).
kaldi_native_fbankmade optional (#35996).- OpenAI dependency bounded to 2.24.0 (#36471).
- Deprecated items from v0.18 removed (#36470, #36006).
- Mistral common v10 (#36971).
Breaki...
v0.17.1
This is a patch release on top of v0.17.0 to address a few issues:
- New Model: Nemotron 3 Super
- Fix passing of activation_type to trtllm fused MoE NVFP4 and FP8 (#36017)
- Fix/resupport nongated fused moe triton (#36412)
- Re-enable EP for trtllm MoE FP8 backend (#36494)
- [Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU (#35219)
- Fix TRTLLM Block FP8 MoE Monolithic (#36296)
- [DSV3.2][MTP] Optimize Indexer MTP handling (#36723)
v0.17.0
vLLM v0.17.0
Known Issue: If you are on CUDA 12.9+ and encounter a CUBLAS_STATUS_INVALID_VALUE error, this is caused by a CUDA library mismatch. To resolve, try one of the following:
- Remove the path to system CUDA shared library files (e.g.
/usr/local/cuda) fromLD_LIBRARY_PATH, or simplyunset LD_LIBRARY_PATH. - Install vLLM with
uv pip install vllm --torch-backend=auto. - Install vLLM with
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu129(change the CUDA version to match your system).
Highlights
This release features 699 commits from 272 contributors (48 new)!
- PyTorch 2.10 Upgrade: This release upgrades to PyTorch 2.10.0, which is a breaking change for environment dependencies.
- FlashAttention 4 Integration: vLLM now supports the FlashAttention 4 backend (#32974), bringing next-generation attention performance.
- Model Runner V2 Maturation: Model Runner V2 has reached a major milestone with Pipeline Parallel (#33960), Decode Context Parallel (#34179), Eagle3 speculative decoding with CUDA graphs (#35029, #35040), pooling model support (#35120), piecewise & mixed CUDA graph capture (#32771), DP+EP for spec decoding (#35294), and a new ModelState architecture. Design docs are now available (#35819).
- Qwen3.5 Model Family: Full support for the Qwen3.5 model family (#34110) featuring GDN (Gated Delta Networks), with FP8 quantization, MTP speculative decoding, and reasoning parser support.
- New
--performance-modeFlag: A new--performance-mode {balanced, interactivity, throughput}flag (#34936) simplifies performance tuning for common deployment scenarios. - Anthropic API Compatibility: Added support for Anthropic thinking blocks (#33671),
count_tokensAPI (#35588),tool_choice=none(#35835), and streaming/image handling fixes. - Weight Offloading V2 with Prefetching: The weight offloader now hides onloading latency via prefetching (#29941), plus selective CPU weight offloading (#34535) and CPU offloading without pinned memory doubling (#32993).
- Elastic Expert Parallelism Milestone 2: Initial support for elastic expert parallelism enabling dynamic GPU scaling for MoE models (#34861).
- Quantized LoRA Adapters: Users can now load quantized LoRA adapters (e.g. QLoRA) directly (#30286).
- Transformers v5 Compatibility: Extensive work to ensure compatibility with HuggingFace Transformers v5 across models and utilities.
Model Support
- New architectures: Qwen3.5 (#34110), COLQwen3 (#34398), ColModernVBERT (#34558), Ring 2.5 (#35102), skt/A.X-K1 (#32407), Ovis 2.6 (#34426), nvidia/llama-nemotron-embed-vl-1b-v2 (#35297), nvidia/llama-nemotron-rerank-vl-1b-v2 (#35735), nvidia/nemotron-colembed (#34574).
- ASR models: FunASR (#33247), FireRedASR2 (#35727), Qwen3-ASR realtime streaming (#34613).
- Multimodal: OpenPangu-VL video input (#34134), audio chunking for offline LLM (#34628), Parakeet audio encoder for nemotron-nano-vl (#35100), MiniCPM-o flagos (#34126).
- LoRA: LFM2 (#34921), Llama 4 Vision tower/connector (#35147), max vocab size increased to 258048 (#34773), quantized LoRA adapters (#30286).
- Task expansion: ColBERT extended to non-standard BERT backbones (#34170), multimodal scoring for late-interaction models (#34574).
- Performance: Qwen3.5 GDN projector fusion (#34697), FlashInfer cuDNN backend for Qwen3 VL ViT (#34580), Step3.5-Flash NVFP4 (#34478), Qwen3MoE tuned configs for H200 (#35457).
- Fixes: DeepSeek-VL V2 simplified loading (#35203), Qwen3/Qwen3.5 reasoning parser (#34779), Qwen2.5-Omni/Qwen3-Omni mixed-modality (#35368), Ernie4.5-VL garbled output (#35587), Qwen-VL tokenizer (#36140), Qwen-Omni audio cache (#35994), Nemotron-3-Nano NVFP4 accuracy with TP>1 (#34476).
Engine Core
- Model Runner V2: Pipeline Parallel (#33960), Decode Context Parallel (#34179), piecewise & mixed CUDA graphs (#32771), Eagle3 with CUDA graphs (#35029, #35040), pooling models (#35120), DP+EP for spec decoding (#35294), bad_words sampling (#33433), ModelState architecture (#35350, #35383, #35564, #35621, #35774), design docs (#35819).
- Weight offloading: V2 prefetching to hide latency (#29941), selective CPU weight offloading (#34535), CPU offloading without pinned memory doubling (#32993).
- Sleep level 0 mode with enqueue/wait pattern (#33195), pause/resume moved into engine (#34125).
- Fixes: allreduce_rms_fusion disabled by default with PP > 1 (#35424), DCP + FA3 crash (#35082), prefix caching for Mamba "all" mode (#34874), num_active_loras fix (#34119), async TP reduce-scatter reduction fix (#33088).
- Repetitive token pattern detection flags (#35451).
Kernel
- FlashAttention 4 integration (#32974).
- FlashInfer Sparse MLA backend (#33451).
- Triton-based top-k and top-p sampler kernels (#33538).
- Faster topKperRow decode kernel for DeepSeek-V3.2 sparse attention (#33680).
- Optimized grouped topk kernel (#34206).
- TRTLLM DSV3 Router GEMM kernel, 6% batch-1 speedup (#34302).
- FA3 swizzle optimization (#34043).
- 256-bit LDG/STG activation kernels (#33022).
- TMA support for fused_moe_lora kernel (#32195).
- Helion kernel framework: silu_mul_fp8 kernel (#33373), autotuning infrastructure (#34025), num_tokens autotuning (#34185), fx tracing via HOP (#34390), GPU variant canonicalization (#34928).
- FlashInfer TRTLLM fused MoE non-gated FP8 & NVFP4 (#33506).
- Optimized sample_recovered_tokens kernel (#34974).
- KV cache update ops extraction from FlashInfer forward (#35422) and MLA backends (#34627).
Hardware & Performance
- NVIDIA: SM100 FMHA FP8 prefill for MLA (#31195), SM100 MXFP8 blockscaled grouped MM and quant kernels (#34448), SM100 Oink RMSNorm path (#31828), SM120 FP8 GEMM optimization (#34424), FlashInfer DeepGEMM swapAB on SM90 by default (#34924), DeepSeek R1 BF16 min latency QKV GEMM 0.5% E2E speedup (#34758), Cublas BF16 gate with FP32 output (#35121), FlashInfer All Reduce default to TRTLLM backend (#35793).
- AMD ROCm: AITER fused RoPE+KVCache (#33443), MXFP4 MoE weight pre-shuffling on gfx950 (#34192), bitsandbytes quantization (#34688), CK backend for MoE quantization (#34301), dynamic MXFP4 for DeepSeek V2 (#34157), GPT-OSS Quark format (#29008), GPT-OSS WMXFP4_AFP8 static scales (#30357), encoder/encoder-decoder on AITER (#35334), device capability derivation without CUDA init (#35069),
aiterpackage renamed toamd-aiter(#35198). - Intel XPU: CUDA graph support (#34482), GPUDirect RDMA via NIXL (#35270), TORCH_SDPA/TRITON_ATTN as ViT backend (#35010), vllm-xpu-kernels v0.1.3 (#35984).
- CPU: ARM BF16 cross-compilation (#33079), FP16 for s390x (#34116), KleidiAI INT8_W4A8 for all input dtypes (#34890), s390x vector intrinsics for attention (#34434), prefix caching for ppc64le (#35081), CPU release supports both AVX2 and AVX512 (#35466).
- Performance: Pipeline Parallel async send/recv 2.9% E2E throughput (#33368), pooling maxsim 13.9% throughput improvement (#35330), Triton ViT attention backend (#32183), Mamba1 kernel-level chunk alignment for prefix caching (#34798), detokenizer optimization (#32975), pooling model copy optimization 1.8% throughput (#35127).
Large Scale Serving
- Pipeline Parallel async send/recv, 2.9% throughput improvement (#33368).
- Elastic EP Milestone 2 (#34861).
- EPLB: Async rebalance algorithm (#30888), sync enforcement for NCCL backend (#35212).
- Native weight syncing API via IPC for RL workflows (#34171).
- Decode Context Parallel in Model Runner V2 (#34179).
- Ray env var propagation to workers (#34383).
- Breaking: KV load failure policy default changed from "recompute" to "fail" (#34896).
- Cross-node data parallelism message queue fix (#35429).
- NIXL: Token-based IPC API (#34175), version bound (#35495), NUMA core binding (#32365).
Speculative Decoding
- Nemotron-H MTP and Mamba speculative decoding (#33726).
- Eagle3 on Model Runner V2 with CUDA graphs (#35029, #35040), Eagle3 + disaggregated serving (#34529).
- Hidden states extraction system (#33736).
min_tokenssupport with speculative decoding (#32642).- Reduced TP communication for draft generation (#34049).
- MTP num_speculative_tokens > 1 with sparse MLA (#34552).
- Sparse MLA + MTP with full CUDA graphs (#34457).
- Spec decoding in Mamba cache align mode (#33705).
- DP+EP for spec decoding in Model Runner V2 (#35294).
MoE Refactor
- MoERunner abstraction (#32344) with modular kernel architecture.
- MXFP4 Cutlass Experts to modular kernel (#34542), MXFP4 Marlin to modular kernel format (#34588), TRTLLM Kernels MK (#32564).
- MoEActivation enum (#33843).
- Improved default Triton fused MoE configs (#34846).
- Fused MoE + LoRA shared expert dual stream, 1.07x throughput (#34933).
- DSV3 QKVAProj GEMM custom op for torch.compile (#35751).
- Fix routing for models without expert groups (MiniMax-M2.1) (#34673).
torch.compile
- AOT compile with PyTorch 2.10 (#34155).
- AR+RMSNorm fusion by default at -O2 (#34299).
- SiLU+FP4 quant fusion by default at O1+ (#34718).
- Sequence parallelism threshold compile ranges (#28672).
- Various compile fixes: recursive pre_grad_passes (#34092), FakeTensorProp elimination (#34093), time discrepancy logging (#34912), artifact load errors (#35115), atomic artifact saving (#35117), pytree slice caching (#35308), fast_moe_cold_start undo for torch>=2.11 (#35475).
Quantization
- Quantized LoRA adapters (#30286).
- Per-head KV cache scales in attention selector (#34281).
- FP8 MoE bias for GPT-OSS (#34906).
- SM100 MXFP8 blockscaled grouped MM and quant kernels (#34448).
- Mixed precision support for ModelOpt (#35047).
- Llama-4 attention quantization (int8, fp8) (#34243).
- Sparse24 compressed tensors fix (#33446)...
v0.16.0
vLLM v0.16.0
Please note that this release was branch cut on Feb 8, so any features added to vLLM after that date is not included.
Highlights
This release features 440 commits from 203 contributors (7 new)!
- Async scheduling + Pipeline Parallelism is now fully supported, delivering 30.8% E2E throughput improvement and 31.8% TPOT improvement (#32618).
- Realtime API: A new WebSocket-based Realtime API enables streaming audio interactions (#33187), building on the Voxtral realtime infrastructure.
- RLHF workflow improvements: Native NCCL-based weight syncing API (#31943), layerwise weight reloading for QeRL (#32133), and engine pause/resume with request preservation (#32351).
- Unified Parallel Drafting for speculative decoding (#32887), plus spec decode now works with structured outputs (#33374) and penalty application in Model Runner V2 (#33251).
- Major XPU platform overhaul: Deprecated IPEX in favor of vllm-xpu-kernels (#33379), adding MoE (#33659), MXFP4 MoE (#33679), WNA16 (#33973), scaled_mm (#34117), and FP8 MoE (#34202) support.
Model Support
- New architectures: GLM-OCR with MTP (#33005), Qwen3-ASR (#33312), DeepSeek-OCR-2 (#33165), Intern-S1-Pro (#33636), MiniCPM-o 4.5 (#33431), openPangu7B-VL (#32449), NemotronHPuzzle heterogeneous (#32549), MusicFlamingo (#32696), FunAudioChat (#2), ColBERT late interaction (#33686), voyage-4-nano (#33720), GLM-5 (#34124).
- Speculative decoding: EAGLE3 for Hunyuan/HunyuanVL (#33035), AFMoE (#33111), Mistral3 (#33939).
- LoRA expansion: Gemma3 vision components (#32764), Nemotron-H MTP models (#32265), Qwen3 output embedding (#29816). Optimized fused MoE-LoRA kernel indexing (#32770, #32774), unpermute-aware fused MoE LoRA path (#32655), reduced kernel overhead for fewer active LoRAs with multiple CUDA graphs (#32005).
- Features: Qwen3-Omni transcription (#29828), Mistral Large 3 with FlashInfer MoE (#33174), LFM2 SigLIP2 intermediate encoder layers (#33370), Qwen3-Omni/GLM-4.xV MRoPE positioning fixes (#33010, #33039), embedding input for disabled modalities (#32493).
- Performance: GLM-4.7-GPTQ decode and MTP acceptance rate regression fix (#33771), DeepSeek V3.2 fast detokenization (#33855), DeepSeek V3.2 tokenizer fix (#33832), GLM-5 MTP accuracy fix (#34385).
Engine Core
- Async scheduling + Pipeline Parallelism: Full support with 30.8% throughput improvement (#32618), optimized spec decode + async scheduling with 1.5% throughput improvement (#33612), deadlock fix for torchrun PP broadcast (#33701).
- Speculative decoding: Unified Parallel Drafting (#32887), structured output support (#33374), penalty application in MRV2 (#33251), skip softmax for all-greedy rejection sampling (#32852), correctness fix for spec tokens with prefill chunks (#33652).
- RLHF: Native NCCL weight syncing API (#31943), layerwise reloading for QeRL (#32133), engine pause/resume with request preservation (#32351).
- Helion kernel framework: ConfigManager (#32740), kernel wrapper (#32964), kernel registry (#33203).
- PluggableLayer: Applied to linear layers (#33152) and Mamba layers (#33660).
- Batch invariance: Disable Cascade Attention (#32561), enable Triton attention (#33688).
- Performance: Grammar bitmask H2D copy on separate stream (#33059), zero-copy GQA for multimodal and CPU (#33732), early-reject oversized MM requests (#33502), CPU memory leak fix from Request reference cycle in prefix caching (#34183).
Hardware & Performance
- NVIDIA: FlashInfer TRTLLM BF16 MoE integration (#32954), SM100 INT4 W4A16 kernel (#32437), SM121 (DGX Spark) CUTLASS support (#33517), MNNVL protocol for GB series (#33540), FlashInfer MLA concat optimization (#31171), GDN attention layout optimization (#33291), DeepGEMM FP8 MLA performance (#33568), wvSplitK_fp8 performance (#33527, #33493), B200 MoE configs for Nemotron Nano (#32804), Super B200 TP2 (#33510), GLM 4.6 (#32958), Mamba selective scan tuning for B200 (#32873). Fix: DeepSeek R1 CUTLASS MLA on B200 (#33637), QK Norm+RoPE fusion on B200+FP8 (#33967), CUTLASS FP8 blockwise on SM103a (#32224).
- AMD ROCm: QWEN3-NEXT FP8 tunings (#32042), AITER attention backend for Qwen3-Next (#32492), fused_add_rmsnorm_pad for GPT-OSS (#30976), Qwen3-Omni startup fix (#33077).
- Intel XPU: Platform overhaul - deprecated IPEX, switched to vllm-xpu-kernels (#33379). New: unquantized MoE (#33659), MXFP4 MoE (#33679), WNA16 kernel (#33973), scaled_mm kernel (#34117), FP8 MoE (#34202).
- ARM CPU: KleidiAI INT4 dynamic quant with BF16 activations (#33122), NEON BFMMLA BF16 paged attention (#32263), vectorization backend optimization (#30329), attention dispatch by head_dim alignment (#32161).
- IBM Z: BF16 kernel type for s390x (#33788).
- torch.compile: Stop compiling identical artifacts (#34003), MoE cold start optimization option (#33735), fix 32-bit indexing assumption (#33113), attention fusion pass fix (#33945).
- Performance: Chat completion streaming optimization (#33782), ORJSONResponse for faster API responses (#33548), MoE permute optimization for CUTLASS FP8 (#32892), shared/routed overlap for latent MoE on Nemotron-H (#32790), FlashInfer autotune control flag (#34006).
Large Scale Serving
- Disaggregated serving: Mooncake connector rework with bootstrap server (#31034), cross-layer KV cache layout at NIXL Connector V2 (#33339), delay freeing blocks for aborted async loads (#32255), async double-free fix (#33377), Ray multi-replica single-instance fix (#33604).
- EPLB: Capture logical experts with router replay (#33013), DP metadata fix for dense models (#32739).
- Metrics: KV offloading connector metrics (#27942), labeled prompt token metrics for P/D disaggregation (#33290).
Quantization
- New: FP8 block quant for CompressedTensorsW8A16Fp8 (#33280), ModelOpt MXFP8 for dense models (#33786), NVFP4/FP8 on Turing GPUs (#33076), TP > 4 for FP4 Gemm (#31099).
- Bugfixes: FP8 online quantization memory fix (#31914), asymmetric W4A16 (ConchLinear) for CT (#33200), DeepSeek V3.2 NVFP4 (#33932), LoRA FP8 (#33879), quantized Falcon-H1 model loading (#32728), quantized Mamba TP with n_groups=1 (#33257), CPU W8A8 with bias (#33582), CPU W8A8 3D input support (#33727).
- Deprecation: Removed BitBlas (#32683) and Marlin 24 (#32688).
API & Frontend
- Realtime API: WebSocket-based streaming API (#33187) with Voxtral realtime support.
- Responses API: Sampling parameters (#32609), return token IDs (#33212), return prompt token IDs (#33378), parser implementation (#32712).
- Pooling API: Request schema consensus for ScoreRequest (#33060) and final standardization (#31127).
- Tool calling: Fix multi-turn tool call ID preservation (#32768), fix indexing double-counting (#33141), GLM-4 incremental string streaming (#33218), DSV3.2 fast detokenization fix (#33964), MCP tools non-streaming fix (#32762).
- Structured outputs: Performance optimization with reasoning (#33557), guidance vocab size fix (#33509).
- CLI:
--disable-access-log-for-endpointsoption (#30011). - UX: Nested configs in YAML files (#33193), GGUF
repo_id:quant_typesyntax (#33371), DeepSeek ReasoningParser with thinking enabled by default (#33221), remove noisy CT warning (#33273), early tokenization validation (#31366), reasoning_content backward compatibility (#33635), only include Authorization header when OPENAI_API_KEY is set (#33488). - Features: run_batch transcription/translation support (#33934), /server_info collect_env (#33246), OTEL tracing during model loading (#31162), clear MM and encoder cache (#33452), HF Hub LoRA resolver (#20320).
- Scoring: Fix multi-document scoring returning single result (#33837).
Security
- Patch protobuf for CVE-2026-0994 (#34253).
Dependencies
- huggingface-hub updates for Transformers v5 preparation (#33473).
- Transformers v5 compatibility fixes across multiple models (#33977, #33683).
Deprecation & Breaking Changes
- Removed BitBlas quantization (#32683) and Marlin 24 (#32688).
- Removed deprecated
reasoning_contentmessage field (#33402). - Removed deprecated pooling items (#33477).
- Removed deprecated
VLLM_ALL2ALL_BACKENDenvironment variable (#33535). - Deprecated IPEX for XPU, switched to vllm-xpu-kernels (#33379).
New Contributors 🎉
- @aabbccddwasd made their first contribution in #33771
- @Code4me2 made their first contribution in #33517
- @ikchifo made their first contribution in #33967
- @jiangwu300 made their first contribution in #33604
- @pjs102793 made their first contribution in #33963
- @sleepcoo made their first contribution in #33978
- @TundeAtSN made their first contribution in #33939
v0.15.1
v0.15.1 is a patch release with security fixes, RTX Blackwell GPU fixes support, and bug fixes.
Security
- CVE-2025-69223: Updated aiohttp dependency (#33621)
- CVE-2026-0994: Updated Protobuf dependency (#33619)
Highlights
Bugfix Hardware Support
- RTX Blackwell (SM120): Fixed NVFP4 MoE kernel support for RTX Blackwell workstation GPUs. Previously, NVFP4 MoE models would fail to load on these GPUs (#33417)
- FP8 kernel selection: Fixed FP8 CUTLASS group GEMM to properly fall back to Triton kernels on SM120 GPUs (#33285)
Model Support
- Step-3.5-Flash: New model support (#33523)
Bugfix Model Support
- Qwen3-VL-Reranker: Fixed model loading (#33298)
- Whisper: Fixed FlashAttention2 with full CUDA graphs (#33360)
Performance
- torch.compile cold-start: Fixed regression that increased cold-start compilation time (Llama3-70B: ~88s → ~22s) (#33441)
- MoE forward pass: Optimized by caching layer name computation (#33184)
Bug Fixes
- Fixed prefix cache hit rate of 0% with GPT-OSS style hybrid attention models (#33524)
- Enabled Triton MoE backend for FP8 per-tensor dynamic quantization (#33300)
- Disabled unsupported Renormalize routing methods for TRTLLM per-tensor FP8 MoE (#33620)
- Fixed speculative decoding metrics crash when no tokens generated (#33729)
- Disabled fast MoE cold start optimization with speculative decoding (#33624)
- Fixed ROCm skinny GEMM dispatch logic (#33366)
Dependencies
- Pinned LMCache >= v0.3.9 for API compatibility (#33440)
New Contributors 🎉
- @zaristei2 made their first contribution in #33621
Full Changelog: v0.15.0...v0.15.1
v0.15.0
Highlights
This release features 335 commits from 158 contributors (39 new)!
Model Support
- New architectures: Kimi-K2.5 (#33131), Molmo2 (#30997), Step3vl 10B (#32329), Step1 (#32511), GLM-Lite (#31386), Eagle2.5-8B VLM (#32456).
- LoRA expansion: Nemotron-H (#30802), InternVL2 (#32397), MiniMax M2 (#32763).
- Speculative decoding: EAGLE3 for Pixtral/LlavaForConditionalGeneration (#32542), Qwen3 VL MoE (#32048), draft model support (#24322).
- Embeddings: BGE-M3 sparse embeddings and ColBERT embeddings (#14526).
- Model enhancements: Voxtral streaming architecture (#32861), SharedFusedMoE for Qwen3MoE (#32082), dynamic resolution for Nemotron Nano VL (#32121), Molmo2 vision backbone quantization (#32385).
Engine Core
- Async scheduling + Pipeline Parallelism:
--async-schedulingnow works with pipeline parallelism (#32359). - Mamba prefix caching: Block-aligned prefix caching for Mamba/hybrid models with
--enable-prefix-caching --mamba-cache-mode align. Achieves ~2x speedup by caching Mamba states directly (#30877). - Session-based streaming input: New incremental input support for interactive workloads like ASR. Accepts async generators producing
StreamingInputobjects while maintaining KV cache alignment (#28973). - Model Runner V2: VLM support (#32546), architecture improvements.
- LoRA: Inplace loading for memory efficiency (#31326).
- AOT compilation: torch.compile inductor artifacts support (#25205).
- Performance: KV cache offloading redundant load prevention (#29087), FlashAttn attention/cache update separation (#25954).
Hardware & Performance
NVIDIA
- Blackwell defaults: FlashInfer MLA is now the default MLA backend on Blackwell, with TRTLLM as default prefill (#32615).
- MoE performance: 1.2-2% E2E throughput improvement via grouped topk kernel fusion (#32058), NVFP4 small-batch decoding improvement (#30885), faster cold start for MoEs with torch.compile (#32805).
- FP4 kernel optimization: Up to 65% faster FP4 quantization on Blackwell (SM100F) using 256-bit loads, ~4% E2E throughput improvement (#32520).
- Kernel improvements: topk_sigmoid kernel for MoE routing (#31246), atomics reduce counting for SplitK skinny GEMMs (#29843), fused cat+quant for FP8 KV cache in MLA (#32950).
- torch.compile: SiluAndMul and QuantFP8 CustomOp compilation (#32806), Triton prefill attention performance (#32403).
AMD ROCm
- MoRI EP: High-performance all2all backend for Expert Parallel (#28664).
- Attention improvements: Shuffle KV cache layout and assembly paged attention kernel for AiterFlashAttentionBackend (#29887).
- FP4 support: MLA projection GEMMs with dynamic quantization (#32238).
- Consumer GPU support: Flash Attention Triton backend on RDNA3/RDNA4 (#32944).
Other Platforms
- TPU: Pipeline parallelism support (#28506), backend option (#32438).
- Intel XPU: AgRsAll2AllManager for distributed communication (#32654).
- CPU: NUMA-aware acceleration for TP/DP inference on ARM (#32792), PyTorch 2.10 (#32869).
- Whisper: torch.compile support (#30385).
- WSL: Platform compatibility fix for Windows Subsystem for Linux (#32749).
Quantization
- MXFP4: W4A16 support for compressed-tensors MoE models (#32285).
- Non-gated MoE: Quantization support with Marlin, NVFP4 CUTLASS, FP8, INT8, and compressed-tensors (#32257).
- Intel: Quantization Toolkit integration (#31716).
- FP8 KV cache: Per-tensor and per-attention-head quantization via llmcompressor (#30141).
API & Frontend
- Responses API: Partial message generation (#32100),
include_stop_str_in_outputtuning (#32383),prompt_cache_keysupport (#32824). - OpenAI API:
skip_special_tokensconfiguration (#32345). - Score endpoint: Flexible input formats with
data_1/data_2andqueries/documents(#32577). - Render endpoints: New endpoints for prompt preprocessing (#32473).
- Whisper API:
avg_logprobandcompression_ratioin verbose_json segments (#31059). - Security: FIPS 140-3 compliant hash option for enterprise/government users (#32386),
--ssl-ciphersCLI argument (#30937). - UX improvements: Auto
api_server_countbased ondp_size(#32525), wheel variant auto-detection during install (#32948), custom profiler URI schemes (#32393).
Dependencies
- FlashInfer v0.6.1 (#30993)
- Transformers 4.57.5 (#32287)
- PyTorch 2.10 for CPU backend (#32869)
- DeepGEMM newer version (#32479)
Breaking Changes & Deprecations
- Metrics: Removed deprecated
vllm:time_per_output_token_secondsmetric - usevllm:inter_token_latency_secondsinstead (#32661). - Environment variables: Removed deprecated environment variables (#32812).
- Quantization: DeepSpeedFp8 removed (#32679), RTN removed (#32697), HQQ deprecated (#32681).
Bug Fixes
- Speculative decoding: Eagle draft_model_config fix (#31753).
- DeepSeek: DeepSeek-V3.1 + DeepGEMM incompatible scale shapes fix (#32361).
- Distributed: DP+MoE inference fix via CpuCommunicator (#31867), P/D with non-MoE DP fix (#33037).
- EPLB: Possible deadlock fix (#32418).
- NIXL: UCX memory leak fix by exporting UCX_MEM_MMAP_HOOK_MODE=none (#32181).
- Structured output: Outlines byte fallback handling fix (#31391).
New Contributors 🎉
- @YunzhuLu made their first contribution in #32126
- @emricksini-h made their first contribution in #30784
- @dsfaccini made their first contribution in #32289
- @ofirzaf made their first contribution in #32312
- @seekskyworld made their first contribution in #32321
- @brian033 made their first contribution in #31715
- @TomerBN-Nvidia made their first contribution in #32257
- @vanshilshah97 made their first contribution in #32448
- @George-Polya made their first contribution in #32385
- @T1mn made their first contribution in #32411
- @mritunjaysharma394 made their first contribution in #31492
- @randzero made their first contribution in #32511
- @DemingCheng made their first contribution in #32556
- @iboiko-habana made their first contribution in #32471
- @honglyua-il made their first contribution in #32462
- @hyeongyun0916 made their first contribution in #32473
- @DanielMe made their first contribution in #32560
- @netanel-haber made their first contribution in #32121
- @longregen made their first contribution in #28784
- @jasonyanwenl made their first contribution in #32749
- @Wauplin made their first contribution in #32788
- @ikaadil made their first contribution in #32775
- @alexsun07 made their first contribution in #28664
- @liranschour made their first contribution in #30207
- @AuYang261 made their first contribution in #32844
- @diviramon made their first contribution in #32393
- @RishabhSaini made their first contribution in #32884
- @MatteoFari made their first contribution in #32397
- @peakcrosser7 made their first contribution in #30877
- @orionr made their first contribution in #30443
- @marksverdhei made their first contribution in #32614
- @joninco made their first contribution in #32935
- @monajafi-amd made their first contribution in #32944
- @ruizcrp made their first contribution in #32988
- @sjhddh made their first contribution in #32983
- @HirokenOvo made their first contribution in #32646
- @Chenhao-Guan made their first contribution in #32763
- @joshuadeng made their first contribution in #28973
- @ZhanqiuHu made their first contribution in #33016
Full Changelog: v0.14.1...v0.15.0
v0.14.1
v0.14.0
Highlights
This release features approximately 660 commits from 251 contributors (86 new contributors).
Breaking Changes:
- Async scheduling is now enabled by default - Users who experience issues can disable with
--no-async-scheduling.- Excludes some not-yet-supported configurations: pipeline parallel, CPU backend, non-MTP/Eagle spec decoding.
- PyTorch 2.9.1 is now required and the default wheel is compiled against cu129.
- Deprecated quantization schemes have been removed (#31688, #31285).
- When using speculative decoding, unsupported sampling parameters will fail rather than being silently ignored (#31982).
Key Improvements:
- Async scheduling enabled by default (#27614): Overlaps engine core scheduling with GPU execution, improving throughput without user configuration. Now also works with speculative decoding (#31998) and structured outputs (#29821).
- gRPC server entrypoint (#30190): Alternative to REST API with binary protocol, HTTP/2 multiplexing.
--max-model-len auto(#29431): Automatically fits context length to available GPU memory, eliminating OOM startup failures.- Model inspection view (#29450): View the modules, attention backends, and quantization of your model in vLLM by specifying
VLLM_LOG_MODEL_INSPECTION=1or by simply printing theLLMobject. - Model Runner V2 enhancements: UVA block tables (#31965), M-RoPE (#32143),
logit_bias/allowed_token_ids/min_tokenssupport (#32163).- Please note that Model Runner V2 is still experimental and disabled by default.
Model Support
New Model Architectures:
- Grok-2 with tiktoken tokenizer (#31847)
- LFM2-VL vision-language model (#31758)
- MiMo-V2-Flash (#30836)
- openPangu MoE (#28775)
- IQuestCoder (#31575)
- Nemotron Parse 1.1 (#30864)
- GLM-ASR audio (#31436)
- Isaac vision model v0.1/v0.2 (#28367, #31550)
- Kanana-1.5-v-3b-instruct (#29384)
- K-EXAONE-236B-A23B MoE (#31621)
LoRA Support Expansion:
- Multimodal tower/connector LoRA (#26674): LLaVA (#31513), BLIP2 (#31620), PaliGemma (#31656), Pixtral (#31724), DotsOCR (#31825), GLM4-V (#31652)
- DeepSeek-OCR (#31569), Qwen3-Next (#31719), NemotronH (#31539), PLaMo 2/3 (#31322)
- Vision LoRA mm_processor_cache support (#31927)
- MoE expert base_layer loading (#31104)
Model Enhancements:
- Qwen3-VL as reranker (#31890)
- DeepSeek v3.2 chat prefix completion (#31147)
- GLM-4.5/GLM-4.7
enable_thinking: false(#31788) - Ernie4.5-VL video timestamps (#31274)
- Score template expansion (#31335)
- LLaMa4 vision encoder compilation (#30709)
- NemotronH quantized attention (#31898)
Engine Core
- Async scheduling default with spec decode (#27614, #31998) and structured outputs (#29821)
- Hybrid allocator + KV connector (#30166) with multiple KV cache groups (#31707)
- Triton attention: encoder-only/cross attention (#31406), cross-layer blocks (#30687)
- Mamba2 prefix cache optimization (#28047)
- Batch invariant LoRA (#30097)
- LoRA name in BlockStored for KV-cache reconstruction (#27577)
- Request ID collision prevention (#27987)
- Dense model DP without overhead (#30739)
- Async + spec decode penalties/bad_words (#30495)
Hardware & Performance
CUTLASS MoE Optimizations:
- 2.9% throughput + 10.8% TTFT via fill(0) optimization (#31754)
- 5.3% throughput + 2.2% TTFT via problem size calculation (#31830)
- Fused SiLU+Mul+Quant for NVFP4 (#31832)
- NVFP4 stride fusion (#31837)
Other Performance:
- GDN attention decode speedup (Qwen3-Next) (#31722)
- Fused RoPE + MLA KV-cache write (#25774)
- Sliding window attention optimization (#31984)
- FlashInfer DeepGEMM swapAB SM90 (#29213)
- Unpermute-aware fused MoE + small-batch fallback (#29354)
- GDN Attention blocking copy removal (#31167)
- FusedMoE LoRA small rank performance (#32019)
- EPLB numpy optimization (#29499)
- FlashInfer rotary for DeepSeek (#30729)
- Vectorized activations (#29512)
- NUMA interleaved memory (#30800)
- Async spec decode logprobs (#31336)
Hardware Configs:
- SM103 support (#30705, #31150)
- B300 Blackwell MoE configs (#30629)
- Qwen3-Next FP8 CUTLASS configs (#29553)
- Qwen3Moe B200 Triton configs (#31448)
- GLM-4.5/4.6 RTX Pro 6000 kernels (#31407)
- MiniMax-M2/M2.1 QKNorm (#31493)
- NVFP4 small batch tuning (#30897)
Platform:
- ROCm: AITER RMSNorm fusion (#26575), MTP for AITER MLA (#28624), moriio connector (#29304), xgrammar upstream (#31327)
- XPU: FP8 streaming quant (#30944), custom workers (#30935)
- CPU: Head sizes 80/112 (#31968), async disabled by default (#31525), LoRA MoE CPU pinning (#31317)
- TPU: tpu-inference path (#30808), Sophgo docs (#30949)
Large Scale Serving
- XBO (Extended Dual-Batch Overlap) (#30120)
- NIXL asymmetric TP (P > D tensor-parallel-size) (#27274)
- NIXL heterogeneous BlockSize/kv_layout (#30275)
- Cross-layers KV layout for MultiConnector (#30761)
- Mooncake protocol expansion (#30133)
- LMCache KV cache registration (#31397)
- EPLB default all2all backend (#30559)
Quantization
- Marlin for Turing (sm75) (#29901, #31000)
- Quark int4-fp8 w4a8 MoE (#30071)
- MXFP4 W4A16 dense models (#31926)
- ModelOpt FP8 variants (FP8_PER_CHANNEL_PER_TOKEN, FP8_PB_WO) (#30957)
- ModelOpt KV cache quantization update (#31895)
- NVFP4 Marlin for NVFP4A16 MoEs (#30881)
- Static quant all group shapes (#30833)
- Default MXFP4 LoRA backend: Marlin (#30598)
- compressed-tensors 0.13.0 (#30799)
API & Frontend
New Features:
- gRPC server (#30190)
--max-model-len auto(#29431)- Model inspection view (#29450)
- Offline FastAPI docs (#30184)
attention_configin LLM() (#30710)- MFU metrics (#30738)
- Iteration logging + NVTX (#31193)
reasoning_effortparameter (#31956)
Tool Calling:
CLI:
-epfor--enable-expert-parallel(#30890)- Complete help messages (#31226)
- Bench serve auto-discovery +
--input-len(#30816) - Spec decode acceptance stats (#31739)
--enable-log-deltas(renamed) (#32020)--default-chat-template-kwargs(#31343)
API:
/server_infoenv info (#31899)- MCP streaming in Responses API (#31761)
/embeddingscontinue_final_message(#31497)- Reranking score templates (#30550)
- Chat template warmup (#30700)
- Configurable handshake timeout (#27444)
- Better 500 errors (#20610)
- Worker init logging (#29493)
- Bench error reporting (#31808)
- Corrupted video recovery (#29197)
- Spec-decode param validation (#31982)
- Validation error metadata (#30134)
Security
Dependencies
- PyTorch 2.9.1 (#28495)
- compressed-tensors 0.13.0 (#30799)
- CUDA 13 LMCache/NIXL in Docker (#30913)
- Configurable NVSHMEM version (#30732)
Bug Fixes (User-Facing)
- Invalid UTF-8 tokens (#28874)
- CPU RoPE gibberish with
--enforce-eager(#31643) - Tool call streaming finish chunk (#31438)
- Encoder cache leak CPU scheduling stuck (#31857)
- Engine crash: tools + response_format (#32127)
- Voxtral transcription API (#31388)
- Safetensors download optimization (#30537)
Deprecations
Documentation
New Contributors 🎉
- @penfree made their first contribution in #30237
- @jiangkuaixue123 made their first contribution in #30120
- @jr-shen made their first contribution in #29663
- @grzegorz-k-karch made their first contribution in #30795
- @shanjiaz made their first contribution in #30799
- @Somoku made their first contribution in #29569
- @baoqian426 made their first contribution in #30841
- @SongDI911 made their first contribution in #30852
- @www-spam made their first contribution in #30827
- @Xunzhuo made their first contribution in #30844
- @TheCodeWrangler made their first contribution in #30700
- @SungMinCho made their first contribution in #30738
- @sarathc-cerebras made their first contribution in #30188
- @wzyrrr made their first contribution in #30949
- @navmarri14 made their first contribution in #30629
- @HaloWorld made their first contribution in #30867
- @jeffreywang-anyscale made their first contribution in #31013
- @AmeenP made their first contribution in #31093
- @westers made their first contribution in #31071
- @CedricHwong made their first contribution in #30957
- @c0de128 made their first contribution in #31114
- @Bounty-hunter made their first contribution in #30242
- @jzakrzew made their first contribution in #30550
- @1643661061leo made their first contribution in #30760
- @NickCao made their first contribution in https:/...
v0.13.0
vLLM v0.13.0 Release Notes Highlights
Highlights
This release features 442 commits from 207 contributors (61 new contributors)!
Breaking Changes: This release includes deprecation removals, PassConfig flag renames, and attention configuration changes from environment variables to CLI arguments. Please review the breaking changes section carefully before upgrading.
Model Support
- New models: BAGEL (AR only) (#28439), AudioFlamingo3 (#30539), JAIS 2 (#30188), latent MoE architecture support (#30203).
- Tool parsers: DeepSeek-V3.2 (#29848), Gigachat 3 (#29905), Holo2 reasoning (#30048).
- Model enhancements: Qwen3-VL embeddings support (#30037), Qwen3-VL EVS (Efficient Video Sampling) (#29752), DeepSeek V3.2 proper
drop_thinkinglogic (#30490), DeepSeek V3.2 top-k fix (#27568). - Task expansion: Automatic TokenClassification model conversion (#30666), Ultravox v0.7 transformer projector (#30089).
- Quantization: BitsAndBytes for Qwen3-Omni-MoE (#29896).
- Speculative decoding: Eagle/Eagle3 Transformers backend (#30340), Mamba
selective_state_updatespec decode (#29488).
Engine Core
- Compilation: Conditional compilation via
compile_rangesfor selective kernel compilation (#24252). - Prefix caching: xxHash high-performance hash option (#29163).
- Attention: PrefixLM support for FlexAttention (#27938) and TritonAttention (#30386), CUDA graphs for 3D Triton attention (#28306),
TRITON_MLAwithout prefix-caching (#29125). - Batch invariance: FA2 and LoRA batch-invariant support (#30018).
- Pooling: Chunked prefill for ALL pooling tasks (#27145), multi-vector retrieval API (#26686).
- Model Runner V2: Min-p sampling (#30171), NaN detection in logits (#30187).
- Speculative decoding: Medusa GPU-CPU sync avoidance (#29723), async spec-decode improvements (#29624).
- Whisper: Major performance improvements - V1 is now faster than V0 (~3x speedup vs v0.12.0). Encoder batching (#29421),
FULL_DECODE_ONLYCUDA graph (#30072), CPU backend support (#30062). - Performance: Fused blockwise quant RMS norm (#27883), MoE LoRA loading reduction (#30243), encoder cache optimization (#30475), CPU KV offloading streams (#29013).
Hardware & Performance
- NVIDIA Blackwell Ultra: SM103 (GB300) support with CUDA 13 (#30484).
- DeepSeek optimizations (benchmarked on DeepSeek-V3.1):
- DeepEP High-Throughput CUDA graph enabled by default: 5.3% throughput, 4.4% TTFT improvement (#29558)
- DeepGEMM fused layout kernel: 4.3% throughput, 10.7% TTFT improvement (#29546)
- DeepGEMM experts initialization: 3.9% TTFT improvement (#30494)
group_topkkernel: 1.9% throughput, 2.1% TPOT improvement (#30159)- Sparse prefill kernel for FP8 KV-cache in DeepSeek-V3.2 (#27532)
- MLA FP8 optimization with ReduceScatterSum (#29795), direct k_nope/k_pe copy (#29710)
- CPU: Whisper support (#30062), Arm Optimized Routines vectorized exp (#30068), x86 CPU wheel pipeline (#28848).
- AMD ROCm: Aiter quantization kernels (#25552), torch.compile layernorm/silu + FP8 quant (#25693), Triton ScaledMM fallback (#26668), MXFP4 w4a4 inference (#29775).
- Intel XPU: wNa16 compressed tensors (#29484).
- Build: CUDA 13 aarch64 wheels (#30341), Docker kernel build stage (#29452), Ascend NPU Docker (#30015).
Large Scale Serving & Disaggregated Prefill/Decode
- KV connectors: Mooncake Transfer Engine (#24718), cache reset via
/reset_prefix_cache(#27170), KV events (#28309), failure recovery config (#26813). - NIXL: Compatibility checking in handshake (#29503), large batch proxy support (#28782).
- EPLB: NVFP4 support (#29804), algorithm abstraction (#26471).
- Multi-node: External launcher mode (#29833).
- Hybrid allocator: Optional KV connector integration (#29805).
- Performance: silu_mul_per_token_group_quant_fp8 kernel for DP/EP (#29470).
Quantization
- New: W4A8 grouped GEMM on Hopper (#29691), online FP8 with streaming post-processing (#29196), FP8 weight reloading for RLHF (#28480).
- MoE + LoRA: AWQ Marlin (#30442) and GPTQ Marlin (#30254) support.
- GGUF: MoE + GGUF restored for Qwen3 MoE (#30116), Qwen2 MoE (#30307), HF defaults override (#30118).
- Compatibility: Transformers v5 RoPE support (#30046).
API & Frontend
- Responses API: MCP type infrastructure (#30054), Browser/Container MCP tools (#29989), full MCP Python loop (#29798), extra body parameters (#30532).
- Configuration:
AttentionConfigreplacesVLLM_ATTENTION_BACKENDenv var (#26315). - Chat templates: DeepSeek-V3.2 (#29837), DeepSeek-V3.2 developer tools (#30040).
- Anthropic API: Streaming fixes (#29971, #30266).
- Embeddings: Binary format with
encoding_format=bytes_only(#30249), multiple image/audio per request (#29988), tokenization_kwargs override (#29794). - Metrics: Prefill KV compute metric excluding cached tokens (#30189).
- Profiling: Layer-wise NVTX (#29990), profiling CLI config (#29912).
- UX: Better OOM errors (#28051), ModelConfig validation (#30213), distributed executor errors (#30140).
Security
- Additional protection for CVE-2025-62164 (#30649).
Dependencies
Breaking Changes & Deprecations
- PassConfig flags renamed per RFC #27995 (#29646)
- Attention env vars → CLI args:
VLLM_ATTENTION_BACKENDreplaced with--attention-backend(#26315) - Removed
-O.xxflag (#29991) - Removed deprecated plugin/compilation fields (#30396)
- Removed deprecated task, seed, MM settings (#30397)
- Removed
embed_input_ids/embed_multimodalfallbacks (#30458) - Removed tokenizer setter (#30400)
- Deprecations:
merge_by_field_config(#30035, #30170),--convert reward→--convert embed(#30463)
New Contributors 🎉
- @ajpqs made their first contribution in #29905
- @amitz-nv made their first contribution in #29978
- @amrmahdi made their first contribution in #29452
- @andrewbriand made their first contribution in #29804
- @anker-c2 made their first contribution in #30344
- @AuruTus made their first contribution in #30182
- @avigny made their first contribution in #19425
- @Bhanu068 made their first contribution in #30254
- @Copilot made their first contribution in #29025
- @dbotwinick made their first contribution in #30583
- @dependabot[bot] made their first contribution in #30234
- @desertfire made their first contribution in #29919
- @dmitry-tokarev-nv made their first contribution in #30149
- @drslark made their first contribution in #30632
- @dtcccc made their first contribution in #24718
- @elizabetht made their first contribution in #28671
- @Elm8116 made their first contribution in #30068
- @gausah01 made their first contribution in #29604
- @gh-wf made their first contribution in #30285
- @hdlj-h made their first contribution in #30056
- @HF-001 made their first contribution in #30051
- @hzxuzhonghu made their first contribution in #29931
- @JaviS-Rei made their first contribution in #29882
- @johannesflommersfeld made their first contribution in #30390
- @KevinMusgrave made their first contribution in #30529
- @kitaekatt made their first contribution in #30408
- @lashahub made their first contribution in #30539
- @LuminolT made their first contribution in #29163
- @majiayu000 made their first contribution in #30615
- @MaoJianwei made their first contribution in #29797
- @Mercykid-bash made their first contribution in #26471
- @mgehre-amd made their first contribution in #30364
- @mivehk made their first contribution in #30512
- @mondaylord made their first contribution in #30671
- @noa-neria made their first contribution in #29320
- @PatrykSaffer made their first contribution in #30330
- @Peng-YM made their first contribution in #29074
- @realliujiaxu made their first contribution in #30059
- @redwrasse made their first contribution in #29261
- @Ri0S made their first contribution in #30532
- @sarathc-cerebras made their first contribution in #30188
- @scr...