Releases · vllm-project/vllm

15 May 03:38

github-actions

v0.9.0

5873877

v0.9.0 Latest

Latest

Highlights

This release features 649 commits, from 215 contributors (82 new contributors!)

vLLM has upgraded to PyTorch 2.7! (#16859) This is a breaking change for environment dependency.
- The default wheel has been upgraded from CUDA 12.4 to CUDA 12.8. We will distribute CUDA 12.6 wheel on GitHub artifact.
- As a general rule of thumb, our CUDA version policy follow PyTorch's CUDA version policy.
Enhanced NVIDIA Blackwell support. vLLM now ships with initial set of optimized kernels on NVIDIA Blackwell with both attention and mlp.
- You can use our docker image or install FlashInfer nightly wheel pip install https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.5%2Bcu128torch2.7-cp38-abi3-linux_x86_64.whl then set VLLM_ATTENTION_BACKEND=FLASHINFER for better performance.
- Upgraded support for the new FlashInfer main branch. (#15777)
- Please checkout #18153 for the full roadmap
Initial DP, EP, PD support for large scale inference
- EP:
  - Permute and unpermute kernel for moe optimization (#14568)
  - Modularize fused experts and integrate PPLX kernels (#15956)
  - Refactor pplx init logic to make it modular (prepare for deepep) (#18200)
  - Add ep group and all2all interface (#18077)
- DP:
  - Decouple engine process management and comms (#15977)
- PD:
  - NIXL Integration (#17751)
  - Local attention optimization for NIXL (#18170)
  - Support multiple kv connectors (#17564)
Migrate docs from Sphinx to MkDocs (#18145, #18610, #18614, #18616. #18622, #18626, #18627, #18635, #18637, #18657, #18663, #18666, #18713)

Notable Changes

Removal of CUDA 12.4 support due to PyTorch upgrade to 2.7.
Change top_k to be disabled with 0 (still accept -1 for now) (#17773)
The seed is now set to 0 by default for V1 Engine, meaning that different vLLM runs now yield the same outputs even if temperature > 0. This does not modify the random state in user code since workers are run in separate processes unless VLLM_USE_V1_MULTIPROCESSING=0. (#17929, #18741)

Model Enhancements

Support MiMo-7B (#17433), MiniMax-VL-01 (#16328), Ovis 1.6 (#17861), Ovis 2 (#15826), GraniteMoeHybrid 4.0 (#17497), FalconH1* (#18406), LlamaGuard4 (#17315)
- Please install the development version of transformers (from source) to use Falcon-H1.
Embedding models: nomic-embed-text-v2-moe (#17785), new class of gte models (#17986)
Progress in Hybrid Memory Allocator (#17394, #17479, #17474, #17483, #17193, #17946, #17945, #17999, #18001, #18593)
DeepSeek: perf enhancement by moving more calls into cuda-graph region(#17484, #17668), Function Call (#17784), MTP in V1 (#18435)
Qwen2.5-1M: Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support (#11844)
Qwen2.5-VL speed enhancement via rotary_emb optimization (#17973)
InternVL-Qwen2.5 models now support video inputs (#18499)

Performance, Production and Scaling

Support full cuda graph in v1 (#16072)
Pipeline Parallelism: MultiprocExecutor support (#14219), torchrun (#17827)
Support sequence parallelism combined with pipeline parallelism (#18243)
Async tensor parallelism using compilation pass (#17882)
Perf: Use small max_num_batched_tokens for A100 (#17885)
Fast Model Loading: Tensorizer support for V1 and LoRA (#17926)
Multi-modality: Automatically cast multi-modal input dtype before transferring device (#18756)

Security

Prevent side-channel attacks via cache salting (#17045)
Fix image hash collision in certain edge cases (#17378)
Add VLLM_ALLOW_INSECURE_SERIALIZATION env var (#17490)
Migrate to REGEX Library to prevent catastrophic backtracking (#18454, #18750)

Features

CLI: deprecated=True (#17426)
Frontend: progress bar for adding requests (#17525), chat_template_kwargs in LLM.chat (#17356), /classify endpoint (#17032), truncation control for embedding models (#14776), cached_tokens in response usage (#18149)
LoRA: default local directory LoRA resolver plugin. (#16855)
Metrics: kv event publishing (#16750), API for accessing in-memory Prometheus metrics (#17010)
Quantization: nvidia/DeepSeek-R1-FP4 (#16362), Quark MXFP4 format (#16943), AutoRound (#17850), torchao models with AOPerModuleConfig (#17826), CUDA Graph support for V1 GGUF support (#18646)
Reasoning: deprecate --enable-reasoning (#17452)
Spec Decode: EAGLE share input embedding (#17326), torch.compile & cudagraph to EAGLE (#17211), EAGLE3 (#17504), log accumulated metrics(#17913), Medusa (#17956)
Structured Outputs: Thinking compatibility (#16577), Spec Decoding (#14702), Qwen3 reasoning parser (#17466), tool_choice: required for Xgrammar (#17845), Structural Tag with Guidance backend (#17333)
Transformers backend: named parameters (#16868), interleaved sliding window attention (#18494)

Hardwares

NVIDIA: cutlass support for blackwell fp8 blockwise gemm (#14383)
TPU: Multi-LoRA implementation(#14238), default max-num-batched-tokens (#17508), V1 backend by default (#17673), top-logprobs (#17072)
Neuron: NeuronxDistributedInference support (#15970), Speculative Decoding, Dynamic on-device sampling (#16357), Mistral Model (#18222), Multi-LoRA (#18284)
AMD: Enable FP8 KV cache on V1 (#17870), Tuned fused moe config for Qwen3 MoE on MI300X (#17535, #17530), AITER biased group topk (#17955), Block-Scaled GEMM (#14968), MLA (#17523), Radeon GPU use Custom Paged Attention (#17004), reduce the number of environment variables in command line (#17229)
Extensibility: Make PiecewiseBackend pluggable and extendable (#18076)

Documentation

Update quickstart and install for cu128 using --torch-backend=auto (#18505)
NVIDIA TensorRT Model Optimizer (#17561)
Usage of Qwen3 thinking (#18291)

Developer Facing

Benchmark: Add single turn MTBench to Serving Bench (#17202)
Usability: Decrease import time of vllm.multimodal (#18031)
Code Format: Code formatting using ruff format (#17656, #18068, #18400)
Readability:
- Configuration and arguments unification is now complete! (#17130, #17453, #17562)
- Update deprecated type hinting from Python 3.7 (#18056, #18130, #18132, #18129, #18073, #18072, #18126, #18128, #18057, #18058)
Process:
- Propose a deprecation policy for the project (#17063)
Testing: expanding torch nightly tests (#18004)

What's Changed

Support loading transformers models with named parameters by @wuisawesome in #16868
Add tuned triton fused_moe configs for Qwen3Moe by @mgoin in #17328
[Benchmark] Add single turn MTBench to Serving Bench by @ekagra-ranjan in #17202
[Optim] Compute multimodal hash only once per item by @DarkLight1337 in #17314
implement Structural Tag with Guidance backend by @mmoskal in #17333
[V1][Spec Decode] Make Eagle model arch config driven by @ekagra-ranjan in #17323
[model] make llama4 compatible with pure dense layers by @luccafong in #17315
[Bugfix] Fix numel() downcast in fused_layernorm_dynamic_per_token_quant.cu by @r-barnes in #17316
Ignore '<string>' filepath by @zou3519 in #17330
[Bugfix] Add contiguous call inside rope kernel wrapper by @timzsu in #17091
[Misc] Add a Jinja template to support Mistral3 function calling by @chaunceyjiang in #17195
[Model] support MiniMax-VL-01 model by @qscqesze in #16328
[Misc] Move config fields to MultiModalConfig by @DarkLight1337 in #17343
[Misc]Use a platform independent interface to obtain the device attributes by @ponix-j in #17100
[Fix] Documentation spacing in compilation config help text by @Zerohertz in #17342
[Build][Bugfix] Restrict setuptools version to <80 by @gshtras in #17320
[Model] Ignore rotary embed load for Cohere model by @ekagra-ranjan in #17319
Update docs requirements by @hmellor in #17379
[Doc] Fix QWen3MOE info by @jeejeelee in #17381
[Bugfix] Clean up MiniMax-VL and fix processing by @DarkLight1337 in #17354
pre-commit autoupdate by @hmellor in #17380
[Frontend] Support chat_template_kwargs in LLM.chat by @DarkLight1337 in #17356
Transformers backend tweaks by @hmellor in #17365
Fix: Spelling of inference by @a2q1p in #17387
Improve literal dataclass field conversion to argparse argument by @hmellor in #17391
[V1] Remove num_input_tokens from attn_metadata by @heheda12345 in #17193
[Bugfix] add qwen3 reasoning-parser fix content is None when disable … by @mofanke in #17369
fix gemma3 results all zero by @mayuyuace in #17364
[Misc][ROCm] Exclude cutlass_mla_decode for ROCm build by @tywuAMD in #17289
Enabling multi-group kernel tests. by @Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pu...

Contributors

markmc, rabi, and 198 other contributors

Assets 6

02 May 18:03

github-actions

v0.8.5.post1

3015d56

v0.8.5.post1

This post release contains two bug fix for memory leak and model accuracy

Fix Memory Leak in _cached_reqs_data (#17567)
Fix sliding window attention in V1 giving incorrect results (#17574)

Full Changelog: v0.8.5...v0.8.5.post1

Assets 6

28 Apr 21:13

github-actions

v0.8.5

ba41cc9

v0.8.5

This release contains 310 commits from 143 contributors (55 new contributors!).

Highlights

This release features important multi-modal bug fixes, day 0 support for Qwen3, and xgrammar's structure tag feature for tool calling.

Model Support

Day 0 support for Qwen3 and Qwen3MoE. This release fixes fp8 weight loading (#17318) and adds tuned MoE configs (#17328).
Add ModernBERT (#16648)
Add Granite Speech Support (#16246)
Add PLaMo2 (#14323)
Add Kimi-VL model support (#16387)
Add Qwen2.5-Omni model support (thinker only) (#15130)
Snowflake Arctic Embed (Family) (#16649)
Accuracy fixes for Llama4 Int4 (#16801), chat template for Llama 4 models (#16428), enhanced AMD support (#16674, #16847)

V1 Engine

Add structural_tag support using xgrammar (#17085)
Disaggregated serving:
- KV Connector API V1 (#15960)
- Adding LMCache KV connector for v1 (#16625)
Clean up: Remove Sampler from Model Code (#17084)
MLA: Simplification to batch P/D reordering (#16673)
Move usage stats to worker and start logging TPU hardware (#16211)
Support FlashInfer Attention (#16684)
Faster incremental detokenization (#15137)
EAGLE-3 Support (#16937)

Features

Validate urls object for multimodal content parts (#16990)
Prototype support sequence parallelism using compilation pass (#16155)
Add sampling params to v1/audio/transcriptions endpoint (#16591)
Enable vLLM to Dynamically Load LoRA from a Remote Server (#10546)
Add vllm bench [latency, throughput] CLI commands (#16508)

Performance

Attention:
- FA3 decode perf improvement - single mma warp group support for head dim 128 (#16864)
- Update to lastest FA3 code (#13111)
- Support Cutlass MLA for Blackwell GPUs (#16032)
MoE:
- Add expert_map support to Cutlass FP8 MOE (#16861)
- Add fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 on NVIDIA H20 (#16753)
Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS (#6036)
Optimize rotary_emb implementation to use Triton operator for improved performance (#16457)

Hardwares

TPU:
- Enable structured decoding on TPU V1 (#16499)
- Capture multimodal encoder during model compilation (#15051)
- Enable Top-P (#16843)
AMD:
- AITER Fused MOE V1 Support (#16752)
- Integrate Paged Attention Kernel from AITER (#15001)
- Support AITER MLA (#15893)
- Upstream prefix prefill speed up for vLLM V1 (#13305)
- Adding fp8 and variable length sequence support to Triton FAv2 kernel (#12591)
- Add skinny gemms for unquantized linear on ROCm (#15830)
- Follow-ups for Skinny Gemms on ROCm. (#17011)

Documentation

Add open-webui example (#16747)
Document Matryoshka Representation Learning support (#16770)
Add a security guide (#17230)
Add example to run DeepSeek with Ray Serve LLM (#17134)
Benchmarks for audio models (#16505)

Security and Dependency Updates

Don't bind tcp zmq socket to all interfaces (#17197)
Use safe serialization and fix zmq setup for mooncake pipe (#17192)
Bump Transformers to 4.51.3 (#17116)

Build and testing

Add property-based testing for vLLM endpoints using an API defined by an OpenAPI 3.1 schema (#16721)

Breaking changes 🚨

--enable-chunked-prefill, --multi-step-stream-outputs, --disable-chunked-mm-input can no longer explicitly be set to False. Instead, add no- to the start of the argument (i.e. --enable-chunked-prefill and --no-enable-chunked-prefill) (#16533)

What's Changed

Improve configs - SchedulerConfig by @hmellor in #16533
[Misc] remove warning if triton>=3.2.0 by @DefTruth in #16553
[Misc] refactor examples by @reidliu41 in #16563
[Misc] Update usage with mooncake lib for kv transfer by @ShangmingCai in #16523
[fix]: Dockerfile.ppc64le fixes for opencv-python and hf-xet by @Shafi-Hussain in #16048
[Bugfix] Multi-modal caches not acting like LRU caches by @DarkLight1337 in #16593
[TPU][V1] Fix exponential padding when max-num-batched-tokens is not a power of 2 by @NickLucche in #16596
Fix triton install condition on CPU by @hmellor in #16600
s390x: Fix PyArrow build and add CPU test script for Buildkite CI by @Nash-123 in #16036
[Model][VLM] Add Kimi-VL model support by @courage17340 in #16387
[Hardware][TPU] Add torchvision to tpu dependency file by @lsy323 in #16616
[DOC][TPU] Add core idea about avoiding recompilation after warmup by @yaochengji in #16614
config check sleep mode support oot platforms by @celestialli in #16562
[Core][Bugfix] Fix Offline MM Beam Search by @alex-jw-brooks in #16390
[Kernel] moe wna16 marlin kernel by @jinzhen-lin in #14447
[BugFix]: Update minimum pyzmq version by @taneem-ibrahim in #16549
[Bugfix] Fix tests/kernels/test_mamba_ssm_ssd.py by @tlrmchlsmth in #16623
[Bugfix] Fix broken GritLM model and tests (missing pooling_metadata) by @pooyadavoodi in #16631
Add vllm bench [latency, throughput] CLI commands by @mgoin in #16508
Fix vLLM x torch.compile config caching by @zou3519 in #16491
[Misc] refactor argument parsing in examples by @reidliu41 in #16635
[CI/Build] Fix LoRA OOM by @jeejeelee in #16624
Add "/server_info" endpoint in api_server to retrieve the vllm_config. by @Cangxihui in #16572
[Kernel] Remove redundant Exp calculations by @DefTruth in #16123
[Misc] Update compressed-tensors WNA16 to support zero-points by @dsikka in #14211
[Misc] Enable vLLM to Dynamically Load LoRA from a Remote Server by @angkywilliam in #10546
[Model] Add PLaMo2 by @Alnusjaponica in #14323
[Bugfix] fix gpu docker image mis benchmarks dir by @lengrongfu in #16628
[Misc] Modify LRUCache touch by @jeejeelee in #16689
Disable remote caching when calling compile_fx by @zou3519 in #16611
[Feature] add model aware kv ops helper by @billishyahao in #16020
[ROCM] Bind triton version to 3.2 in requirements-built.txt by @SageMoore in #16664
[V1][Structured Output] Move xgrammar related utils to backend_xgrammar.py by @shen-shanshan in #16578
[CI] Cleanup additional_dependencies: [toml] for pre-commit yapf hook by @yankay in #16405
[Misc] refactor examples series by @reidliu41 in #16708
[Doc] Improve OOM troubleshooting by @DarkLight1337 in #16704
[Bugfix][Kernel] fix potential cuda graph broken for merge_attn_states kernel by @DefTruth in #16693
[Model] support modernbert by @xsank in #16648
[Hardware] Add processor inputs to platform validation by @joerunde in #16680
Improve error for structured output backend selection by @hmellor in #16717
[Misc] Remove redundant comment by @jianzs in #16703
Help user create custom model for Transformers backend remote code models by @hmellor in #16719
[V1][Performance] Implement custom serializaton for MultiModalKwargs [Rebased] by @p88h in #16432
[V1][Spec Dec Bug Fix] Respect Spec Dec Method Specification by @luyuzhe111 in #16636
Adding vllm buildkite job for IBM Power by @AaruniAggarwal in #16679
[V1][Frontend] Improve Shutdown And Logs by @robertgshaw2-redhat in #11737
[rocm][V0] fix selection logic for custom PA in V0 by @divakar-amd in #16426
[Bugfix] Update Florence-2 tokenizer to make grounding tasks work by @Isotr0py in #16734
[Bugfix] Revert max_prompt_len validation for decoder-only models. by @davidheineman in #16741
[V1] Remove log noise when idle by @russellb in #16735
[Ray] Improve documentation on batch inference by @richardliaw in #16609
[misc] ignore marlin_moe_wna16 local gen codes by @DefTruth in #16760
[Doc] Add more tips to avoid OOM by @DarkLight1337 in #16765
[doc] add open-webui example by @reidliu41 in #16747...

Contributors

markmc, rasmith, and 130 other contributors

Assets 6

14 Apr 06:14

github-actions

v0.8.4

dc1b4a6

v0.8.4

This release contains 180 commits from 84 contributors (25 new contributors!).

Highlights

This release includes important accuracy fixes for Llama4 models, if you are using it, we highly recommend you to update.

Model

Llama4 (#16113,#16509) bug fix and enhancements:
- qknorm should be not shared across head (#16311)
- Enable attention temperature tuning by default for long context (>32k) (#16439)
- Index Error When Single Request Near Max Context (#16209)
- Add tuned FusedMoE kernel config for Llama4 Scout, TP=8 on H100 (#16488)
- Update to transformers==4.51.1 (#16257)
- Added chat templates for LLaMa4 pythonic tool calling (#16463)
- Optimized topk for topk=1(#16512)
- Add warning for Attention backends that do not support irope yet (#16212)
Support Qwen3 and Qwen3MoE (#15289), smolvlm (#16017), jinaai/jina-embeddings-v3 (#16120), InternVL3 (#16495), GLM-4-0414 (#16338)

API

Estimate max-model-len use available KV cache memory. The error message nows hints at how to set --max-model-len (#16168)
Add hf_token to EngineArgs (#16093)
Enable regex support with xgrammar in V0 engine (#13228)
Support matryoshka representation / support embedding API dimensions (#16331)
Add bucket for request_latency, time_to_first_token and time_per_output_token (#15202)
Support for TorchAO quantization (#14231)

Hardware

Intel-Gaudi: Multi-step scheduling implementation for HPU (#12779)
TPU:
- Make @support_torch_compile work for XLA backend (#15782)
- Use language_model interface for getting text backbone in MM (#16410)

Performance

DeepSeek MLA: a new merge_attn_states CUDA kernel, 3x speedup (#16173)
MoE: Support W8A8 channel-wise weights and per-token activations in triton fused_moe_kernel (#16366)
Add support to modelopt quantization of Mixtral model (#15961)
Enable PTPC FP8 for CompressedTensorsW8A8Fp8MoEMethod (triton fused_moe) (#16537)

V1 Engine Core

Enable multi-input by default (#15799)
Scatter and gather placeholders in the model runner (#16076)
Set structured output backend to auto by default (#15724)
Zero-copy tensor/ndarray serialization/transmission (#13790)
Eagle Model loading (#16035)
KV cache slots for eagle heads (#16370)
Add supports_structured_output() method to Platform (#16148)

Developer Facing

Add sampling parameters to benchmark_serving. (#16022)
AutoWeightsLoader refacotring (#16383, #16325, #16088, #16203, #16103)
Unifieid configuration with engine args: LoadConfig (#16422), ParallelConfig (#16332)

What's Changed

[Misc] Auto detect bitsandbytes pre-quantized models by @tristanleclercq in #16027
[CI] Fix benchmark script level by @khluu in #16089
fix: support clang17 for macos and fix the real libomp by @yihong0618 in #16086
[doc] fix 404 by @reidliu41 in #16082
Revert "doc: add info for macos clang errors (#16049)" by @yihong0618 in #16091
Fix some capitalisations in generated examples doc titles by @hmellor in #16094
[Misc] format output for encoder_decoder.py by @reidliu41 in #16095
[Misc] Remove redundant code by @chaunceyjiang in #16098
[Bugfix] fix use_atomic_add support of marlin kernel when using v1 engine by @jinzhen-lin in #15946
[Model] use AutoWeightsLoader for phi, gemma, deepseek by @jonghyunchoe in #16088
[Model] fix model testing for TeleChat2ForCausalLM and V0 llama4 by @luccafong in #16112
[Benchmark] Add sampling parameters to benchmark_serving. by @hyeygit in #16022
[Frontend] Fix typo in tool chat templates for llama3.2 and toolace by @bjj in #14501
[CI][V1] Fix passing tokenizer as kwarg to validate_guidance_grammar by @ywang96 in #16117
[Misc] refactor example eagle by @reidliu41 in #16100
[Doc][Bugfix] Add missing EOF in k8s deploy doc by @psschwei in #16025
[Misc] Improve model redirect to accept json dictionary by @Isotr0py in #16119
[Model] use AutoWeightsLoader for stablelm,starcoder2,zamba2 by @lengrongfu in #16103
[Bugfix] LoRA : Fix the order in which the kernels process LoRAs by @varun-sundar-rabindranath in #16040
[Bugfix] add hf_token to EngineArgs by @paolovic in #16093
[Misc] update requires-python in pyproject.toml by @reidliu41 in #16116
[TPU] Update PyTorch/XLA by @yaochengji in #16130
[V1][Minor] Optimize get_cached_block by @WoosukKwon in #16135
Fix requires-python by @martinhoyer in #16132
[Metrics] Add bucket for request_latency, time_to_first_token and time_per_output_token by @yankay in #15202
[V1][Minor] Minor simplification for get_computed_blocks by @WoosukKwon in #16139
[Misc] Update Mistral-3.1 example by @DarkLight1337 in #16147
[Bugfix] Make dummy encoder prompt padding alternative and add missing warnings by @Isotr0py in #16129
[CI] Set max transformers version for Ultravox model test by @ywang96 in #16149
doc: fix some typos in doc by @yihong0618 in #16154
[VLM] Florence-2 supports online serving by @Isotr0py in #16164
[V1][Structured Output] Add supports_structured_output() method to Platform by @shen-shanshan in #16148
[Model] Add Qwen3 and Qwen3MoE by @YamPengLi in #15289
[Misc] improve example mlpspeculator and llm_engine_example by @reidliu41 in #16175
[Doc]Update image to latest version by @WangErXiao in #16186
Upstream Llama4 Support to Main by @houseroad in #16113
[Bugfix] Re-enable support for ChatGLMForConditionalGeneration by @DarkLight1337 in #16187
[V1] Revert the default max_num_seqs to V0 values for most hardware by @DarkLight1337 in #16158
[Misc] Print encoder seq len to short warning only once by @gshtras in #16193
[Misc] Human-readable max-model-len cli arg by @NickLucche in #16181
[Misc] Move Llama 4 projector call into encoder execution by @ywang96 in #16201
[Bugfix] Fix guidance backend for Qwen models by @benchislett in #16210
[V1][BugFix] Exit properly if engine core fails during startup by @njhill in #16137
[Misc] add description attribute in CLI by @reidliu41 in #15921
[Bugfix][V0] XGrammar structured output supports Enum by @leon-seidel in #15878
Torchao by @drisspg in #14231
[ROCm][Bugfix][FP8] Make fp8 quant respect fused modules mapping by @mgoin in #16031
[core] do not send error across process by @youkaichao in #16174
[Misc] Update compressed-tensors to version 0.9.3 by @mlsw in #16196
Update BASE_IMAGE to 2.22 release of Neuron by @aws-satyajith in #16218
[V1] Scatter and gather placeholders in the model runner by @ywang96 in #16076
[Bugfix] fix use-ep bug to enable ep by dp/tp size > 1 by @zxfan-cpu in #16161
Add warning for Attention backends that do not support irope yet by @sarckk in #16212
[Bugfix] Do not skip "empty" parts of chats that are parsable by @mgoin in #16219
[Bugfix] Fix and reorganize broken GGUF tests and bump gguf version by @Isotr0py in #16194
[torch.compile][TPU] Make @support_torch_compile work for XLA backend by @lsy323 in #15782
[V1] Add disable_chunked_mm_input arg to disable partial mm input prefill by @mgoin in #15837
[Misc] Merge the logs of pp layers partitions by @kebe7jun in #16225
[Docs] Add Slides from Singapore Meetup by @simon-mo in #16213
[Misc] format and refactor some examples by @reidliu41 in #16252
[Misc] Add warning for multimodal data in LLM.beam_search by @alex-jw-brooks in #16241
[Model] use AutoWeightsLoader for phimoe,qwen2_moe,qwen3_moe b...

Contributors

bjj, russellb, and 82 other contributors

Assets 6

06 Apr 04:11

github-actions

v0.8.3

296c657

v0.8.3

Highlights

This release features 260 commits, 109 contributors, 38 new contributors.

We are excited to announce Day 0 Support for Llama 4 Scout and Maverick (#16104). Please see our blog for detailed user guide.
- Please note that Llama4 is only supported in V1 engine only for now.
V1 engine now supports native sliding window attention (#14097) with the hybrid memory allocator.

Cluster Scale Serving

Single node data parallel with API server support (#13923)
Multi-node offline DP+EP example (#15484)
Expert parallelism enhancements
- CUTLASS grouped gemm fp8 MoE kernel (#13972)
- Fused experts refactor (#15914)
- Fp8 Channelwise Dynamic Per Token GroupedGEMM (#15587)
- Adding support for fp8 gemm layer input in fp8 (#14578)
- Add option to use DeepGemm contiguous grouped gemm kernel for fused MoE operations. (#13932)
Support XpYd disaggregated prefill with MooncakeStore (#12957)

Model Supports

Llama 4 (#16104), Aya Vision (#15441), MiniMaxText01(#13454), Skywork-R1V (#15397), jina-reranker-v2 (#15876)
Add Reasoning Parser for Granite Models (#14202)
Add Phi-4-mini function calling support (#14886)

V1 Engine

Collective RPC (#15444)
Faster top-k only implementation (#15478)
BitsAndBytes support (#15611)
Speculative Decoding: metrics (#15151), Eagle Proposer (#15729), n-gram interface update (#15750), EAGLE Architecture with Proper RMS Norms (#14990)

Features

API

Support Enum for xgrammar based structured output in V1. (#15594, #15757)
A new tags parameter for wake_up (#15500)
V1 LoRA support CPU offload (#15843)
Prefix caching support: FIPS enabled machines with MD5 hashing (#15299), SHA256 as alternative hashing algorithm (#15297)
Addition of http service metrics (#15657)

Performance

LoRA Scheduler optimization bridging V1 and V0 performance (#15422).

Hardwares

AMD:
- Add custom allreduce support for ROCM (#14125)
- Quark quantization documentation (#15861)
- AITER integration: int8 scaled gemm kernel (#15433), fused moe (#14967)
- Paged attention for V1 (#15720)
CPU:
- CPU MLA (#14744)
TPU
- Improve Memory Usage Estimation (#15671)
- Optimize the all-reduce performance (#15903)
- Support sliding window and logit soft capping in the paged attention kernel. (#15732)
- TPU-optimized top-p implementation (avoids scattering). (#15736)

Doc, Build, Ecosystem

V1 user guide update: fp8 kv cache support (#15585), multi-modality (#15460)
Recommend developing with Python 3.12 in developer guide (#15811)
Clean up: move dockerfiles into their own directory (#14549)
Add minimum version for huggingface_hub to enable Xet downloads (#15873)
TPU CI: Add basic perf regression test (#15414)

What's Changed

Fix CUDA kernel index data type in vllm/csrc/quantization/gptq_marlin/awq_marlin_repack.cu +10 by @houseroad in #15160
[Hardware][TPU][Bugfix] Fix v1 mp profiler by @lsy323 in #15409
[Kernel][CPU] CPU MLA by @gau-nernst in #14744
Dockerfile.ppc64le changes to move to UBI by @Shafi-Hussain in #15402
[Misc] Clean up MiniCPM-V/O code by @DarkLight1337 in #15337
[Misc] Remove redundant num_embeds by @DarkLight1337 in #15443
[Doc] Update V1 user guide for multi-modality by @DarkLight1337 in #15460
[Kernel] Fix conflicting macro names for gguf kernels by @SzymonOzog in #15456
[bugfix] fix inductor cache on max_position_embeddings by @youkaichao in #15436
[CI/Build] Add tests for the V1 tpu_model_runner. by @yarongmu-google in #14843
[Bugfix] Support triton==3.3.0+git95326d9f for RTX 5090 (Unsloth + vLLM compatibility) by @oteroantoniogom in #15471
[bugfix] add supports_v1 platform interface by @joerunde in #15417
Add workaround for shared field_names in pydantic model class by @maxdebayser in #13925
[TPU][V1] Fix Sampler recompilation by @NickLucche in #15309
[V1][Minor] Use SchedulerInterface type for engine scheduler field by @njhill in #15499
[V1] Support long_prefill_token_threshold in v1 scheduler by @houseroad in #15419
[core] add bucket padding to tpu_model_runner by @Chenyaaang in #14995
[Core] LoRA: V1 Scheduler optimization by @varun-sundar-rabindranath in #15422
[CI/Build] LoRA: Delete long context tests by @varun-sundar-rabindranath in #15503
Transformers backend already supports V1 by @hmellor in #15463
[Model] Support multi-image for Molmo by @DarkLight1337 in #15438
[Misc] Warn about v0 in benchmark_paged_attn.py by @tlrmchlsmth in #15495
[BugFix] Fix nightly MLA failure (FA2 + MLA chunked prefill, i.e. V1, producing bad results) by @LucasWilkinson in #15492
[misc] LoRA - Skip LoRA kernels when not required by @varun-sundar-rabindranath in #15152
Fix raw_request extraction in load_aware_call decorator by @daniel-salib in #15382
[Feature] Enhance EAGLE Architecture with Proper RMS Norms by @luyuzhe111 in #14990
[FEAT][ROCm] Integrate Fused MoE Kernels from AITER by @vllmellm in #14967
[Misc] Enhance warning information to user-defined chat template by @wwl2755 in #15408
[Misc] improve example script output by @reidliu41 in #15528
Separate base model from TransformersModel by @hmellor in #15467
Apply torchfix by @cyyever in #15532
Improve validation of TP in Transformers backend by @hmellor in #15540
[Model] Add Reasoning Parser for Granite Models by @alex-jw-brooks in #14202
multi-node offline DP+EP example by @youkaichao in #15484
Fix weight loading for some models in Transformers backend by @hmellor in #15544
[Refactor] Remove passthrough backend when generate grammar by @aarnphm in #15317
[V1][Sampler] Faster top-k only implementation by @njhill in #15478
Support SHA256 as hash function in prefix caching by @dr75 in #15297
Applying some fixes for K8s agents in CI by @Alexei-V-Ivanov-AMD in #15493
[V1] TPU - Revert to exponential padding by default by @alexm-redhat in #15565
[V1] TPU CI - Fix test_compilation.py by @alexm-redhat in #15570
Use Cache Hinting for fused_moe kernel by @wrmedford in #15511
[TPU] support disabling xla compilation cache by @yaochengji in #15567
Support FIPS enabled machines with MD5 hashing by @MattTheCuber in #15299
[Kernel] CUTLASS grouped gemm fp8 MoE kernel by @ElizaWszola in #13972
Add automatic tpu label to mergify.yml by @mgoin in #15560
add platform check back by @Chenyaaang in #15578
[misc] LoRA: Remove unused long context test data by @varun-sundar-rabindranath in #15558
[Doc] Update V1 user guide for fp8 kv cache support by @wayzeng in #15585
[moe][quant] add weight name case for offset by @MengqingCao in #15515
[V1] Refactor num_computed_tokens logic by @comaniac in #15307
Allow torchao quantization in SiglipMLP by @jerryzh168 in #15575
[ROCm] Env variable to trigger custom PA by @gshtras in #15557
[TPU] [V1] fix cases when max_num_reqs is set smaller than MIN_NUM_SEQS by @yaochengji in #15583
[Misc] Restrict ray version dependency and update PP feature warning in V1 by @ruisearch42 in #15556
[TPU] Avoid Triton Import by @robertgshaw2-redhat in #15589
[Misc] Consolidate LRUCache implementations by @Avabowler in #15481
[Quantization] Fp8 Channelwise Dynamic Per Token GroupedGEMM by @robertgshaw2-redhat in #15587
[Misc] Clean up scatter_patch_features by @DarkLight1337 in #15559
[Misc] Use model_redirect to redirect the model name to a local folder. by @noooop in https://github.com/vllm-proj...

Contributors

markmc, russellb, and 107 other contributors

Assets 6

05 Apr 19:46

github-actions

v0.8.3rc1

63375f0

v0.8.3rc1 Pre-release

Pre-release

What's Changed

Fix CUDA kernel index data type in vllm/csrc/quantization/gptq_marlin/awq_marlin_repack.cu +10 by @houseroad in #15160
[Hardware][TPU][Bugfix] Fix v1 mp profiler by @lsy323 in #15409
[Kernel][CPU] CPU MLA by @gau-nernst in #14744
Dockerfile.ppc64le changes to move to UBI by @Shafi-Hussain in #15402
[Misc] Clean up MiniCPM-V/O code by @DarkLight1337 in #15337
[Misc] Remove redundant num_embeds by @DarkLight1337 in #15443
[Doc] Update V1 user guide for multi-modality by @DarkLight1337 in #15460
[Kernel] Fix conflicting macro names for gguf kernels by @SzymonOzog in #15456
[bugfix] fix inductor cache on max_position_embeddings by @youkaichao in #15436
[CI/Build] Add tests for the V1 tpu_model_runner. by @yarongmu-google in #14843
[Bugfix] Support triton==3.3.0+git95326d9f for RTX 5090 (Unsloth + vLLM compatibility) by @oteroantoniogom in #15471
[bugfix] add supports_v1 platform interface by @joerunde in #15417
Add workaround for shared field_names in pydantic model class by @maxdebayser in #13925
[TPU][V1] Fix Sampler recompilation by @NickLucche in #15309
[V1][Minor] Use SchedulerInterface type for engine scheduler field by @njhill in #15499
[V1] Support long_prefill_token_threshold in v1 scheduler by @houseroad in #15419
[core] add bucket padding to tpu_model_runner by @Chenyaaang in #14995
[Core] LoRA: V1 Scheduler optimization by @varun-sundar-rabindranath in #15422
[CI/Build] LoRA: Delete long context tests by @varun-sundar-rabindranath in #15503
Transformers backend already supports V1 by @hmellor in #15463
[Model] Support multi-image for Molmo by @DarkLight1337 in #15438
[Misc] Warn about v0 in benchmark_paged_attn.py by @tlrmchlsmth in #15495
[BugFix] Fix nightly MLA failure (FA2 + MLA chunked prefill, i.e. V1, producing bad results) by @LucasWilkinson in #15492
[misc] LoRA - Skip LoRA kernels when not required by @varun-sundar-rabindranath in #15152
Fix raw_request extraction in load_aware_call decorator by @daniel-salib in #15382
[Feature] Enhance EAGLE Architecture with Proper RMS Norms by @luyuzhe111 in #14990
[FEAT][ROCm] Integrate Fused MoE Kernels from AITER by @vllmellm in #14967
[Misc] Enhance warning information to user-defined chat template by @wwl2755 in #15408
[Misc] improve example script output by @reidliu41 in #15528
Separate base model from TransformersModel by @hmellor in #15467
Apply torchfix by @cyyever in #15532
Improve validation of TP in Transformers backend by @hmellor in #15540
[Model] Add Reasoning Parser for Granite Models by @alex-jw-brooks in #14202
multi-node offline DP+EP example by @youkaichao in #15484
Fix weight loading for some models in Transformers backend by @hmellor in #15544
[Refactor] Remove passthrough backend when generate grammar by @aarnphm in #15317
[V1][Sampler] Faster top-k only implementation by @njhill in #15478
Support SHA256 as hash function in prefix caching by @dr75 in #15297
Applying some fixes for K8s agents in CI by @Alexei-V-Ivanov-AMD in #15493
[V1] TPU - Revert to exponential padding by default by @alexm-redhat in #15565
[V1] TPU CI - Fix test_compilation.py by @alexm-redhat in #15570
Use Cache Hinting for fused_moe kernel by @wrmedford in #15511
[TPU] support disabling xla compilation cache by @yaochengji in #15567
Support FIPS enabled machines with MD5 hashing by @MattTheCuber in #15299
[Kernel] CUTLASS grouped gemm fp8 MoE kernel by @ElizaWszola in #13972
Add automatic tpu label to mergify.yml by @mgoin in #15560
add platform check back by @Chenyaaang in #15578
[misc] LoRA: Remove unused long context test data by @varun-sundar-rabindranath in #15558
[Doc] Update V1 user guide for fp8 kv cache support by @wayzeng in #15585
[moe][quant] add weight name case for offset by @MengqingCao in #15515
[V1] Refactor num_computed_tokens logic by @comaniac in #15307
Allow torchao quantization in SiglipMLP by @jerryzh168 in #15575
[ROCm] Env variable to trigger custom PA by @gshtras in #15557
[TPU] [V1] fix cases when max_num_reqs is set smaller than MIN_NUM_SEQS by @yaochengji in #15583
[Misc] Restrict ray version dependency and update PP feature warning in V1 by @ruisearch42 in #15556
[TPU] Avoid Triton Import by @robertgshaw2-redhat in #15589
[Misc] Consolidate LRUCache implementations by @Avabowler in #15481
[Quantization] Fp8 Channelwise Dynamic Per Token GroupedGEMM by @robertgshaw2-redhat in #15587
[Misc] Clean up scatter_patch_features by @DarkLight1337 in #15559
[Misc] Use model_redirect to redirect the model name to a local folder. by @noooop in #14116
Fix incorrect filenames in vllm_compile_cache.py by @zou3519 in #15494
[Doc] update --system for transformers installation in docker doc by @reidliu41 in #15616
[Model] MiniCPM-V/O supports V1 by @DarkLight1337 in #15487
[Bugfix] Fix use_cascade_attention handling for Alibi-based models on vllm/v1 by @h-sugi in #15211
[Doc] Link to onboarding tasks by @DarkLight1337 in #15629
[Misc] Replace is_encoder_decoder_inputs with split_enc_dec_inputs by @DarkLight1337 in #15620
[Feature] Add middleware to log API Server responses by @terrytangyuan in #15593
[Misc] Avoid direct access of global mm_registry in compute_encoder_budget by @DarkLight1337 in #15621
[Doc] Use absolute placement for Ask AI button by @hmellor in #15628
[Bugfix][TPU][V1] Fix recompilation by @NickLucche in #15553
Correct PowerPC to modern IBM Power by @clnperez in #15635
[CI] Update rules for applying tpu label. by @russellb in #15634
[V1] AsyncLLM data parallel by @njhill in #13923
[TPU] Lazy Import by @robertgshaw2-redhat in #15656
[Quantization][V1] BitsAndBytes support V1 by @jeejeelee in #15611
[Bugfix] Fix failure to launch in Tensor Parallel TP mode on macOS. by @kebe7jun in #14948
[Doc] Fix dead links in Job Board by @wwl2755 in #15637
[CI][TPU] Temporarily Disable Quant Test on TPU by @robertgshaw2-redhat in #15649
Revert "Use Cache Hinting for fused_moe kernel (#15511)" by @wrmedford in #15645
[Misc]add coding benchmark for speculative decoding by @CXIAAAAA in #15303
[Quantization][FP8] Adding support for fp8 gemm layer input in fp8 by @gshtras in #14578
Refactor error handling for multiple exceptions in preprocessing by @JasonZhu1313 in #15650
[Bugfix] Fix mm_hashes forgetting to be passed by @DarkLight1337 in #15668
[V1] Remove legacy input registry by @DarkLight1337 in #15673
[TPU][CI] Fix TPUModelRunner Test by @robertgshaw2-redhat in...

Contributors

markmc, russellb, and 107 other contributors

Assets 2

23 Mar 21:05

github-actions

v0.8.2

25f560a

v0.8.2

This release contains important bug fix for the V1 engine's memory usage. We highly recommend you upgrading!

Highlights

Revert "Use uv python for docker rather than ppa:deadsnakess/ppa (#13569)" (#15377)
Remove openvino support in favor of external plugin (#15339)

V1 Engine

Fix V1 Engine crash while handling requests with duplicate request id (#15043)
Support FP8 KV Cache (#14570, #15191)
Add flag to disable cascade attention (#15243)
Scheduler Refactoring: Add Scheduler Interface (#15250)
Structured Output
- Add disable-any-whitespace option support for xgrammar (#15316)
- guidance backend for structured output + auto fallback mode (#14779)
Spec Decode
- Enable spec decode for top-p & top-k sampling (#15063)
- Use better defaults for N-gram (#15358)
- Update target_logits in place for rejection sampling (#15427)
AMD
- Enable Triton(ROCm) Attention backend for Nvidia GPUs (#14071)
TPU
- Support V1 Sampler for ragged attention (#14227)
- Tensor parallel MP support (#15059)
- MHA Pallas backend (#15288)

Features

Integrate fastsafetensors loader for loading model weights (#10647)
Add guidance backend for structured output (#14589)

Others

Add Kubernetes deployment guide with CPUs (#14865)
Support reset prefix cache by specified device (#15003)
Support tool calling and reasoning parser (#14511)
Support --disable-uvicorn-access-log parameters (#14754)
Support Tele-FLM Model (#15023)
Add pipeline parallel support to TransformersModel (#12832)
Enable CUDA graph support for llama 3.2 vision (#14917)

What's Changed

[FEAT]Support reset prefix cache by specified device by @maobaolong in #15003
[BugFix][V1] Update stats.py by @WrRan in #15139
[V1][TPU] Change kv cache shape. by @vanbasten23 in #15145
[FrontEnd][Perf] merge_async_iterators fast-path for single-prompt requests by @njhill in #15150
[Docs] Annouce Ollama and Singapore Meetups by @simon-mo in #15161
[V1] TPU - Tensor parallel MP support by @alexm-redhat in #15059
[BugFix] Lazily import XgrammarBackend to avoid early cuda init by @njhill in #15171
[Doc] Clarify run vllm only on one node in distributed inference by @ruisearch42 in #15148
Fix broken tests by @jovsa in #14713
[Bugfix] Fix embedding assignment for InternVL-based models by @DarkLight1337 in #15086
fix "Total generated tokens:" is 0 if using --backend tgi and --endpo… by @sywangyi in #14673
[V1][TPU] Support V1 Sampler for ragged attention by @NickLucche in #14227
[Benchmark] Allow oversample request in benchmark dataset by @JenZhao in #15170
[Core][V0] Add guidance backend for structured output by @russellb in #14589
[Doc] Update Mistral Small 3.1/Pixtral example by @ywang96 in #15184
[Misc] support --disable-uvicorn-access-log parameters by @chaunceyjiang in #14754
[Attention] Flash Attention 3 - fp8 by @mickaelseznec in #14570
[Doc] Update README.md by @DarkLight1337 in #15187
Enable CUDA graph support for llama 3.2 vision by @mritterfigma in #14917
typo: Update config.py by @WrRan in #15189
[Frontend][Bugfix] support prefill decode disaggregation on deepseek by @billishyahao in #14824
[release] Tag vllm-cpu with latest upon new version released by @khluu in #15193
Fixing Imprecise Type Annotations by @WrRan in #15192
[macOS] Ugrade pytorch to 2.6.0 by @linktohack in #15129
[Bugfix] Multi-video inference on LLaVA-Onevision by @DarkLight1337 in #15082
Add user forum to README by @hmellor in #15220
Fix env vars for running Ray distributed backend on GKE by @richardsliu in #15166
Replace misc issues with link to forum by @hmellor in #15226
[ci] feat: make the test_torchrun_example run with tp=2, external_dp=2 by @vermouth1992 in #15172
[Bugfix] fix V1 Engine crash while handling requests with duplicate request id by @JasonJ2021 in #15043
[V1] Add flag to disable cascade attention by @WoosukKwon in #15243
Enforce that TP > 1 is not supported for Mamba2 if Quantization is Enabled. by @fabianlim in #14617
[V1] Scheduler Refactoring [1/N] - Add Scheduler Interface by @WoosukKwon in #15250
[CI/Build] LoRA : make add_lora_test safer by @varun-sundar-rabindranath in #15181
Fix CUDA kernel index data type in vllm/csrc/quantization/fused_kernels/layernorm_utils.cuh +10 by @houseroad in #15159
[Misc] Clean up the BitsAndBytes arguments by @jeejeelee in #15140
[ROCM] Upgrade torch to 2.6 by @SageMoore in #15244
[Bugfix] Fix incorrect qwen2.5-vl attention mask pre-computation by @Isotr0py in #15200
Mention extra_body as a way top pass vLLM only parameters using the OpenAI client by @hmellor in #15240
[V1][TPU] Speed up top-k on TPU by using torch.topk by @hyeygit in #15242
[Bugfix] detect alibi and revert to FA2 by @tjohnson31415 in #15231
[Model] RE: Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnecessary Memory Copies by @cyang49 in #14857
[Docs] Trim the latest news in README by @WoosukKwon in #15261
[Misc] Better RayExecutor and multiprocessing compatibility by @comaniac in #14705
Add an example for reproducibility by @WoosukKwon in #15262
[Hardware][TPU] Add check for no additional graph compilation during runtime by @lsy323 in #14710
[V1] Enable Triton(ROCm) Attention backend for Nvidia GPUs by @Isotr0py in #14071
[Doc] Update LWS docs by @Edwinhr716 in #15163
[V1] Avoid redundant input processing in n>1 case by @njhill in #14985
[Feature] specify model in config.yaml by @wayzeng in #14855
[Bugfix] Add int8 torch dtype for KVCache by @shen-shanshan in #15260
[Misc] Add attention mask pre-computation optimization back to Qwen2.5-VL by @Isotr0py in #15273
[Bugfix] Fix incorrect resolving order for transformers fallback by @Isotr0py in #15279
[V1] Fix wrong import path of get_flash_attn_version by @lhtin in #15280
[Bugfix] Fix broken kernel test due to missing rename for v1 Triton backend by @Isotr0py in #15282
[Misc] Add cProfile helpers by @russellb in #15074
[v1] Refactor KVCacheConfig by @heheda12345 in #14079
[Bugfix][VLM] fix llava processor by @MengqingCao in #15285
Revert "[Feature] specify model in config.yaml (#14855)" by @DarkLight1337 in #15293
[TPU][V1] MHA Pallas backend by @NickLucche in #15288
[Build/CI] Fix env var typo by @russellb in #15305
[Misc] Increase RayDistributedExecutor RAY_CGRAPH_get_timeout by @ruisearch42 in #15301
[Bugfix][V0] Multi-sequence logprobs streaming edge case by @andylolu2 in #15259
[FEAT] [ROCm]: Add AITER RMS Norm (Layer Norm) Feature by @tjtanaa in #14959
[Doc] add load_format items in docs by @wwl2755 in #14804
[Bugfix] Fix torch.compile raise FileNotFoundError by @jeejeelee in #15278
[Bugfix] LoRA V0 - Fix case where max_num_seqs is between cudagraph capture sizes by @varun-sundar-rabindranath in #15308
[Model] Support Tele-FLM Model by @atone in #15023
[V1] Add disable-any-whitespace option support for xgrammar by @russellb in #15316
[BugFix][Typing] Fix Imprecise Type Annotations by @WrRan in #15...

Contributors

russellb, Qubitium, and 62 other contributors

Assets 6

19 Mar 17:40

github-actions

v0.8.1

61c7a1b

v0.8.1

This release contains important bug fixes for v0.8.0. We highly recommend upgrading!

V1 Fixes
- Ensure using int64 for sampled token ids (#15065)
- Fix long dtype in topk sampling (#15049)
- Refactor Structured Output for multiple backends (#14694)
- Fix size calculation of processing cache (#15114)
- Optimize Rejection Sampler with Triton Kernels (#14930)
- Fix oracle for device checking (#15104)
TPU
- Fix chunked prefill with padding (#15037)
- Enhanced CI/CD (#15054, 14974)
Model
- Re-enable Gemma3 for V1 (#14980)
- Embedding model support LoRA (#14935)
- Pixtral: Remove layer instantiation duplication (#15053)

What's Changed

[Bugfix] Fix interface for Olmo2 on V1 by @ywang96 in #14976
[CI/Build] Use AutoModelForImageTextToText to load image models in tests by @DarkLight1337 in #14945
[V1] Guard Against Main Thread Usage by @robertgshaw2-redhat in #14972
[V1] TPU - Fix CI/CD runner for V1 and remove V0 tests by @alexm-redhat in #14974
[Bugfix] Fix bnb quantization for models with both HF-format and Mistral-format weights by @tristanleclercq in #14950
[Neuron] trim attention kernel tests to fit trn1.2x instance by @liangfu in #14988
[Doc][V1] Fix V1 APC doc by @shen-shanshan in #14920
[Kernels] LoRA - Retire SGMV and BGMV Kernels by @varun-sundar-rabindranath in #14685
[Mistral-Small 3.1] Update docs and tests by @patrickvonplaten in #14977
[Misc] Embedding model support LoRA by @jeejeelee in #14935
[Bugfix] torchrun compatibility by @hiyouga in #14899
[Bugfix][Frontend] Fix validation of logprobs in ChatCompletionRequest by @schoennenbeck in #14352
[Misc][Docs] fix the comments of KV_T and CACHE_T in CALL_RESHAPE_AND_CACHE_XX macros by @yangsijia-serena in #14347
[Bugfix] Loosen type check to avoid errors in V1 by @DarkLight1337 in #15021
[Bugfix] Register serializers for V0 MQ Engine by @simon-mo in #15009
[TPU][V1][Bugfix] Fix chunked prefill with padding by @NickLucche in #15037
MI325 configs, fused_moe_kernel bugfix by @ekuznetsov139 in #14987
[MODEL] Add support for Zamba2 models by @yury-tokpanov in #13185
[Bugfix] Fix broken CPU quantization due to triton import by @Isotr0py in #15038
[Bugfix] Fix LoRA extra vocab size by @jeejeelee in #15047
[V1] Refactor Structured Output for multiple backends by @russellb in #14694
[V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels by @WoosukKwon in #14930
[V1] TPU - CI/CD use smaller model by @alexm-redhat in #15054
fix long dtype in topk sampling by @chujiezheng in #15049
[Doc] Minor v1_user_guide update by @JenZhao in #15064
[Misc][V1] Skip device checking if not available by @comaniac in #15061
[Model] Pixtral: Remove layer instantiation duplication by @juliendenize in #15053
[Model] Remove duplicated message check in Mistral chat completion request by @b8zhong in #15069
[Core] Update dtype detection and defaults by @DarkLight1337 in #14858
[V1] Ensure using int64 for sampled token ids by @WoosukKwon in #15065
[Bugfix] Re-enable Gemma3 for V1 by @DarkLight1337 in #14980
[CI][Intel GPU] update XPU dockerfile and CI script by @jikunshang in #15109
[V1][Bugfix] Fix oracle for device checking by @ywang96 in #15104
[Misc] Avoid unnecessary HF do_rescale warning when passing dummy data by @DarkLight1337 in #15107
[Bugfix] Fix size calculation of processing cache by @DarkLight1337 in #15114
[Doc] Update tip info on using latest transformers when creating a custom Dockerfile by @MarcCote in #15070
[Misc][Benchmark] Add support for different tokenizer_mode by @aarnphm in #15040
[Bugfix] Adjust mllama to regional compilation by @jkaniecki in #15112
[Doc] Update the "the first vLLM China Meetup" slides link to point to the first page by @imkero in #15134
[Frontend] Remove custom_cache_manager by @fulvius31 in #13791
[V1] Minor V1 async engine test refactor by @andoorve in #15075

New Contributors

@tristanleclercq made their first contribution in #14950
@hiyouga made their first contribution in #14899
@ekuznetsov139 made their first contribution in #14987
@yury-tokpanov made their first contribution in #13185
@juliendenize made their first contribution in #15053
@MarcCote made their first contribution in #15070
@jkaniecki made their first contribution in #15112
@fulvius31 made their first contribution in #13791

Full Changelog: v0.8.0...v0.8.1

Contributors

russellb, MarcCote, and 30 other contributors

Assets 6

18 Mar 17:52

github-actions

v0.8.0

966f933

v0.8.0

v0.8.0 featured 523 commits from 166 total contributors (68 new contributors)!

Highlights

V1

We have now enabled V1 engine by default (#13726) for supported use cases. Please refer to V1 user guide for more detail. We expect better performance for supported scenarios. If you'd like to disable V1 mode, please specify the environment variable VLLM_USE_V1=0, and send us a GitHub issue sharing the reason!

Support variety of sampling parameters (#13376, #10980, #13210, #13774)
Compatability prompt logprobs + prefix caching (#13949), sliding window + prefix caching (#13069)
Stability fixes (#14380, #14379, #13298)
Pluggable scheduler (#14466)
SupportsV0Only protocol for model definitions (#13959)
Metrics enhancements (#13299, #13504, #14695, #14082)
V1 user guide (#13991) and design doc (#12745)
Support for Structured Outputs (#12388, #14590, #14625, #14630, #14851)
Support for LoRA (#13705, #13096, #14626)
Enhance Pipeline Parallelism (#14585, #14643)
Ngram speculative decoding (#13729, #13933)

DeepSeek Improvements

We observe state of the art performance with vLLM running DeepSeek model on latest version of vLLM:

MLA Enhancements:
- FlashMLA integration (#13747, #13867, #14451)
- MLA support for V1 (#13789, #14253, #14384, #14540, #14921)
- MLA with chunked prefill (#12639)
- Holistic memory and performance optimization (#14769, #14770,#14842)
- Support MLA for CompressedTensorsWNA16 (#13725)
Distributed Expert Parallelism (EP) and Data Parallelism (DP)
- EP Support for DeepSeek Models (#12583)
- Add enable_expert_parallel arg (#14305)
- EP/TP MoE + DP Attention (#13931)
- Set up data parallel communication (#13591)
MTP: Expand DeepSeek MTP code to support k > n_predict (#13626)
Pipeline Parallelism:
- DeepSeek V2/V3/R1 only place lm_head on last pp rank (#13833)
- Improve pipeline partitioning (#13839)
GEMM
- Add streamK for block-quantized CUTLASS kernels (#12978)
- Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM (#13917)
- Add more tuned configs for H20 and others (#14877)

New Models

Gemma 3 (#14660)
- Note: You have to install transformers from main branch (pip install git+https://github.com/huggingface/transformers.git) to use this model. Also, there may be numerical instabilities for float16/half dtype. Please use bfloat16 (preferred by HF) or float32 dtype.
Mistral Small 3.1 (#14957)
Phi-4-multimodal-instruct (#14119)
Grok1 (#13795)
QwQ-32B and toll calling (#14479, #14478)
Zamba2 (#13185)

NVIDIA Blackwell

Support nvfp4 cutlass gemm (#13571)
Add cutlass support for blackwell fp8 gemm (#13798)
Update the flash attn tag to support Blackwell (#14244)
Add ModelOpt FP4 Checkpoint Support (#12520)

Breaking Changes

The default value of seed is now None to align with PyTorch and Hugging Face. Please explicitly set seed for reproduciblity. (#14274)
The kv_cache and attn_metadata arguments for model's forward method has been removed; as the attention backend has access to these value via forward_context. (#13887)
vLLM will now default generation_config from model for chat template, sampling parameters such as temperature, etc. (#12622)
Several request time metrics (vllm:time_in_queue_requests, vllm:model_forward_time_milliseconds, vllm:model_execute_time_milliseconds) has been deprecated and subject to removal (#14135)

Updates

Update to PyTorch 2.6.0 (#12721, #13860)
Update to Python 3.9 typing (#14492, #13971)
Update to CUDA 12.4 as default for release and nightly wheels (#12098)
Update to Ray 2.43 (#13994)
Upgrade aiohttp to incldue CVE fix (#14840)
Upgrade jinja2 to get 3 moderate CVE fixes (#14839)

Features

Frontend API

API Server
- Support return_tokens_as_token_id as a request param (#14066)
- Support Image Emedding as input (#13955)
- New /load endpoint for load statistics (#13950)
- New API endpoint /is_sleeping (#14312)
- Enables /score endpoint for embedding models (#12846)
- Enable streaming for Transcription API (#13301)
- Make model param optional in request (#13568)
- Support SSL Key Rotation in HTTP Server (#13495)
Reasoning
- Support reasoning output (#12955)
- Support outlines engine with reasoning outputs (#14114)
- Update reasoning with stream example to use OpenAI library (#14077)
CLI
- Ensure out-of-tree quantization method recognize by cli args (#14328)
- Add vllm bench CLI (#13993)
Make LLM API compatible for torchrun launcher (#13642)

Disaggregated Serving

Support KV cache offloading and disagg prefill with LMCache connector (#12953)
Support chunked prefill for LMCache connector (#14505)

LoRA

Add LoRA support for TransformersModel (#13770)
Make the deviceprofilerinclude LoRA memory. (#14469)
Gemma3ForConditionalGeneration supports LoRA (#14797)
Retire SGMV and BGMV Kernels (#14685) (#14685)

VLM

Generalized prompt updates for multi-modal processor (#13964)
Deprecate legacy input mapper for OOT multimodal models (#13979)
Refer code examples for common cases in dev multimodal processor (#14278)

Quantization

BaiChuan SupportsQuant (#13710)
BartModel SupportsQuant (#14699)
Bamba SupportsQuant (#14698)
Deepseek GGUF support (#13167)
GGUF MoE kernel (#14613)
Add GPTQAllSpark Quantization (#12931)
Better performance of gptq marlin kernel when n is small (#14138)

Structured Output

xgrammar: Expand list of unsupported jsonschema keywords (#13783)

Hardware Support

AMD

Faster Custom Paged Attention kernels (#12348)
Improved performance for V1 Triton (ROCm) backend (#14152)
Chunked prefill/paged attention in MLA on ROCm (#14316)
Perf improvement for DSv3 on AMD GPUs (#13718)
MoE fp8 block quant tuning support (#14068)

TPU

Integrate the new ragged paged attention kernel with vLLM v1 on TPU (#13379)
Support start_profile/stop_profile in TPU worker (#13988)
Add TPU v1 test (#14834)
TPU multimodal model support for ragged attention (#14158)
Add tensor parallel support via Ray (#13618)
Enable prefix caching by default (#14773)

Neuron

Add Neuron device communicator for vLLM v1 (#14085)
Add custom_ops for neuron backend (#13246)
Add reshape_and_cache (#14391)
Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth (#13245)

CPU

Upgrade CPU backend to torch-2.6 (#13381)
Support FP8 KV cache in CPU Backend(#14741)

s390x

Adding cpu inference with VXE ISA for s390x architecture (#12613)
Add documentation for s390x cpu implementation (#14198)

Plugins

Remove cuda hard code in models and layers (#13658)
Move use allgather to platform (#14010)

Bugfix and Enhancements

Illegal memory access for MoE On H20 (#13693)
Fix FP16 overflow for DeepSeek V2 (#13232)
Illegal Memory Access in the blockwise cutlass fp8 GEMMs (#14396)
Pass all driver env vars to ray workers unless excluded (#14099)
Use xgrammar shared context to avoid copy overhead for offline engine (#13837)
Capture and log the time of loading weights (#13666)

Developer Tooling

Benchmarks

Consolidate performance benchmark datasets (#14036)
Update benchmarks README (#14646)

CI and Build

Add RELEASE.md (#13926)
Use env var to control whether to use S3 bucket in CI (#13634)

Documentation

Add RLHF document (#14482)
Add nsight guide to profiling docs (#14298)
Add K8s deployment guide (#14084)
Add developer documentation for torch.compile integration (#14437)

What's Changed

Update pre-commit's isort version to remove warnings by @hmellor in #13614
[V1][Minor] Print KV cache size in token counts by @WoosukKwon in #13596
fix neuron performance issue by @ajayvohra2005 in #13589
[Frontend] Add backend-specific options for guided decoding by @joerunde in #13505
[Bugfix] Fix max_num_batched_tokens for MLA by @mgoin in #13620
[Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth by @lingfanyu in #13245
Add llmaz as another integration by @kerthcet in #13643
[Misc] Adding script to setup ray for multi-node vllm deployments by @Edwinhr716 in #12913
[NVIDIA] Fix an issue to use current stream for the nvfp4 quant by @kaixih in #13632
Use pre-commit to update requirements-test.txt by @hmellor in #13617
[Bugfix] Add mm_processor_kwargs to chat-related protocols by @ywang96 in #13644
[V1][Sampler] Avoid an operation during temperature application by @njhill in #13587
Missing comment explaining VDR variable in GGUF kernels by @SzymonOzog in #13290
[FEATURE] Enables /score endpoint for embedding models by @gmarinho2 in #12846
[ci] Fix metrics test model path by @khluu in #13635
[Kernel]Add streamK for block-quantized CUTLASS kernels by @Hongbosherlock in #12978
[Bugfix][CPU] Fix cpu all-reduce using native pytorch implementation by @Isotr0py in #13586
fix typo of grafana dashboard, with correct datasource by @johnzheng1975 in https://...

Contributors

markmc, russellb, and 164 other contributors

Assets 6

17 Mar 17:08

github-actions

v0.8.0rc2

37e3806

v0.8.0rc2 Pre-release

Pre-release

What's Changed

[V1] Remove input cache client by @DarkLight1337 in #14864
[Misc][XPU] Use None as device capacity for XPU by @yma11 in #14932
[Doc] Add vLLM Beijing meetup slide by @heheda12345 in #14938
setup.py: drop assumption about local main branch by @russellb in #14692
[MISC] More AMD unused var clean up by @houseroad in #14926
fix minor miscalled method by @kushanam in #14327
[V1][TPU] Apply the ragged paged attention kernel fix and remove the padding. by @vanbasten23 in #14846
[Bugfix] Fix Ultravox on V1 by @DarkLight1337 in #14929
[Misc] Add --seed option to offline multi-modal examples by @DarkLight1337 in #14934
[Bugfix][ROCm] running new process using spawn method for rocm in tests. by @vllmellm in #14810
[Doc] Fix misleading log during multi-modal profiling by @DarkLight1337 in #14955
Add patch merger by @patrickvonplaten in #14957
[V1] Default MLA to V1 by @simon-mo in #14921
[Bugfix] Fix precommit - line too long in pixtral.py by @tlrmchlsmth in #14960
[Bugfix][Model] Mixtral: use unused head_dim config argument by @qtrrb in #14961
[Fix][Structured Output] using vocab_size to construct matcher by @aarnphm in #14868
[Bugfix] Make Gemma3 MM V0 only for now by @ywang96 in #14971

New Contributors

@vllmellm made their first contribution in #14810
@qtrrb made their first contribution in #14961

Full Changelog: v0.8.0rc1...v0.8.0rc2

Contributors

russellb, tlrmchlsmth, and 12 other contributors

Assets 2

Uh oh!

Releases: vllm-project/vllm

v0.9.0

Highlights

Notable Changes

Model Enhancements

Performance, Production and Scaling

Security

Features

Hardwares

Documentation

Developer Facing

What's Changed

Contributors

Uh oh!

v0.8.5.post1

Uh oh!

v0.8.5

Highlights

Model Support

V1 Engine

Features

Performance

Hardwares

Documentation

Security and Dependency Updates

Build and testing

Breaking changes 🚨

What's Changed

Contributors

Uh oh!

v0.8.4

Highlights

Model

API

Hardware

Performance

V1 Engine Core

Developer Facing

What's Changed

Contributors

Uh oh!

v0.8.3

Highlights

Cluster Scale Serving

Model Supports

V1 Engine

Features

API

Performance

Hardwares

Doc, Build, Ecosystem

What's Changed

Contributors

Uh oh!

v0.8.3rc1

What's Changed

Contributors

Uh oh!

v0.8.2

Highlights

V1 Engine

Features

Others

What's Changed

Contributors

Uh oh!

v0.8.1

What's Changed

New Contributors

Contributors

Uh oh!

v0.8.0

Highlights

V1

DeepSeek Improvements

New Models

NVIDIA Blackwell

Breaking Changes

Updates