Skip to content

Releases: vllm-project/vllm

v0.9.0

15 May 03:38
5873877
Compare
Choose a tag to compare

Highlights

This release features 649 commits, from 215 contributors (82 new contributors!)

  • vLLM has upgraded to PyTorch 2.7! (#16859) This is a breaking change for environment dependency.
    • The default wheel has been upgraded from CUDA 12.4 to CUDA 12.8. We will distribute CUDA 12.6 wheel on GitHub artifact.
    • As a general rule of thumb, our CUDA version policy follow PyTorch's CUDA version policy.
  • Enhanced NVIDIA Blackwell support. vLLM now ships with initial set of optimized kernels on NVIDIA Blackwell with both attention and mlp.
    • You can use our docker image or install FlashInfer nightly wheel pip install https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.5%2Bcu128torch2.7-cp38-abi3-linux_x86_64.whl then set VLLM_ATTENTION_BACKEND=FLASHINFER for better performance.
    • Upgraded support for the new FlashInfer main branch. (#15777)
    • Please checkout #18153 for the full roadmap
  • Initial DP, EP, PD support for large scale inference
    • EP:
      • Permute and unpermute kernel for moe optimization (#14568)
      • Modularize fused experts and integrate PPLX kernels (#15956)
      • Refactor pplx init logic to make it modular (prepare for deepep) (#18200)
      • Add ep group and all2all interface (#18077)
    • DP:
      • Decouple engine process management and comms (#15977)
    • PD:
      • NIXL Integration (#17751)
      • Local attention optimization for NIXL (#18170)
      • Support multiple kv connectors (#17564)
  • Migrate docs from Sphinx to MkDocs (#18145, #18610, #18614, #18616. #18622, #18626, #18627, #18635, #18637, #18657, #18663, #18666, #18713)

Notable Changes

  • Removal of CUDA 12.4 support due to PyTorch upgrade to 2.7.
  • Change top_k to be disabled with 0 (still accept -1 for now) (#17773)
  • The seed is now set to 0 by default for V1 Engine, meaning that different vLLM runs now yield the same outputs even if temperature > 0. This does not modify the random state in user code since workers are run in separate processes unless VLLM_USE_V1_MULTIPROCESSING=0. (#17929, #18741)

Model Enhancements

  • Support MiMo-7B (#17433), MiniMax-VL-01 (#16328), Ovis 1.6 (#17861), Ovis 2 (#15826), GraniteMoeHybrid 4.0 (#17497), FalconH1* (#18406), LlamaGuard4 (#17315)
    • Please install the development version of transformers (from source) to use Falcon-H1.
  • Embedding models: nomic-embed-text-v2-moe (#17785), new class of gte models (#17986)
  • Progress in Hybrid Memory Allocator (#17394, #17479, #17474, #17483, #17193, #17946, #17945, #17999, #18001, #18593)
  • DeepSeek: perf enhancement by moving more calls into cuda-graph region(#17484, #17668), Function Call (#17784), MTP in V1 (#18435)
  • Qwen2.5-1M: Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support (#11844)
  • Qwen2.5-VL speed enhancement via rotary_emb optimization (#17973)
  • InternVL-Qwen2.5 models now support video inputs (#18499)

Performance, Production and Scaling

  • Support full cuda graph in v1 (#16072)
  • Pipeline Parallelism: MultiprocExecutor support (#14219), torchrun (#17827)
  • Support sequence parallelism combined with pipeline parallelism (#18243)
  • Async tensor parallelism using compilation pass (#17882)
  • Perf: Use small max_num_batched_tokens for A100 (#17885)
  • Fast Model Loading: Tensorizer support for V1 and LoRA (#17926)
  • Multi-modality: Automatically cast multi-modal input dtype before transferring device (#18756)

Security

  • Prevent side-channel attacks via cache salting (#17045)
  • Fix image hash collision in certain edge cases (#17378)
  • Add VLLM_ALLOW_INSECURE_SERIALIZATION env var (#17490)
  • Migrate to REGEX Library to prevent catastrophic backtracking (#18454, #18750)

Features

  • CLI: deprecated=True (#17426)
  • Frontend: progress bar for adding requests (#17525), chat_template_kwargs in LLM.chat (#17356), /classify endpoint (#17032), truncation control for embedding models (#14776), cached_tokens in response usage (#18149)
  • LoRA: default local directory LoRA resolver plugin. (#16855)
  • Metrics: kv event publishing (#16750), API for accessing in-memory Prometheus metrics (#17010)
  • Quantization: nvidia/DeepSeek-R1-FP4 (#16362), Quark MXFP4 format (#16943), AutoRound (#17850), torchao models with AOPerModuleConfig (#17826), CUDA Graph support for V1 GGUF support (#18646)
  • Reasoning: deprecate --enable-reasoning (#17452)
  • Spec Decode: EAGLE share input embedding (#17326), torch.compile & cudagraph to EAGLE (#17211), EAGLE3 (#17504), log accumulated metrics(#17913), Medusa (#17956)
  • Structured Outputs: Thinking compatibility (#16577), Spec Decoding (#14702), Qwen3 reasoning parser (#17466), tool_choice: required for Xgrammar (#17845), Structural Tag with Guidance backend (#17333)
  • Transformers backend: named parameters (#16868), interleaved sliding window attention (#18494)

Hardwares

  • NVIDIA: cutlass support for blackwell fp8 blockwise gemm (#14383)
  • TPU: Multi-LoRA implementation(#14238), default max-num-batched-tokens (#17508), V1 backend by default (#17673), top-logprobs (#17072)
  • Neuron: NeuronxDistributedInference support (#15970), Speculative Decoding, Dynamic on-device sampling (#16357), Mistral Model (#18222), Multi-LoRA (#18284)
  • AMD: Enable FP8 KV cache on V1 (#17870), Tuned fused moe config for Qwen3 MoE on MI300X (#17535, #17530), AITER biased group topk (#17955), Block-Scaled GEMM (#14968), MLA (#17523), Radeon GPU use Custom Paged Attention (#17004), reduce the number of environment variables in command line (#17229)
  • Extensibility: Make PiecewiseBackend pluggable and extendable (#18076)

Documentation

  • Update quickstart and install for cu128 using --torch-backend=auto (#18505)
  • NVIDIA TensorRT Model Optimizer (#17561)
  • Usage of Qwen3 thinking (#18291)

Developer Facing

What's Changed

Read more

v0.8.5.post1

02 May 18:03
Compare
Choose a tag to compare

This post release contains two bug fix for memory leak and model accuracy

  • Fix Memory Leak in _cached_reqs_data (#17567)
  • Fix sliding window attention in V1 giving incorrect results (#17574)

Full Changelog: v0.8.5...v0.8.5.post1

v0.8.5

28 Apr 21:13
Compare
Choose a tag to compare

This release contains 310 commits from 143 contributors (55 new contributors!).

Highlights

This release features important multi-modal bug fixes, day 0 support for Qwen3, and xgrammar's structure tag feature for tool calling.

Model Support

  • Day 0 support for Qwen3 and Qwen3MoE. This release fixes fp8 weight loading (#17318) and adds tuned MoE configs (#17328).
  • Add ModernBERT (#16648)
  • Add Granite Speech Support (#16246)
  • Add PLaMo2 (#14323)
  • Add Kimi-VL model support (#16387)
  • Add Qwen2.5-Omni model support (thinker only) (#15130)
  • Snowflake Arctic Embed (Family) (#16649)
  • Accuracy fixes for Llama4 Int4 (#16801), chat template for Llama 4 models (#16428), enhanced AMD support (#16674, #16847)

V1 Engine

  • Add structural_tag support using xgrammar (#17085)
  • Disaggregated serving:
    • KV Connector API V1 (#15960)
    • Adding LMCache KV connector for v1 (#16625)
  • Clean up: Remove Sampler from Model Code (#17084)
  • MLA: Simplification to batch P/D reordering (#16673)
  • Move usage stats to worker and start logging TPU hardware (#16211)
  • Support FlashInfer Attention (#16684)
  • Faster incremental detokenization (#15137)
  • EAGLE-3 Support (#16937)

Features

  • Validate urls object for multimodal content parts (#16990)
  • Prototype support sequence parallelism using compilation pass (#16155)
  • Add sampling params to v1/audio/transcriptions endpoint (#16591)
  • Enable vLLM to Dynamically Load LoRA from a Remote Server (#10546)
  • Add vllm bench [latency, throughput] CLI commands (#16508)

Performance

  • Attention:
    • FA3 decode perf improvement - single mma warp group support for head dim 128 (#16864)
    • Update to lastest FA3 code (#13111)
    • Support Cutlass MLA for Blackwell GPUs (#16032)
  • MoE:
    • Add expert_map support to Cutlass FP8 MOE (#16861)
    • Add fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 on NVIDIA H20 (#16753)
  • Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS (#6036)
  • Optimize rotary_emb implementation to use Triton operator for improved performance (#16457)

Hardwares

  • TPU:
    • Enable structured decoding on TPU V1 (#16499)
    • Capture multimodal encoder during model compilation (#15051)
    • Enable Top-P (#16843)
  • AMD:
    • AITER Fused MOE V1 Support (#16752)
    • Integrate Paged Attention Kernel from AITER (#15001)
    • Support AITER MLA (#15893)
    • Upstream prefix prefill speed up for vLLM V1 (#13305)
    • Adding fp8 and variable length sequence support to Triton FAv2 kernel (#12591)
    • Add skinny gemms for unquantized linear on ROCm (#15830)
    • Follow-ups for Skinny Gemms on ROCm. (#17011)

Documentation

  • Add open-webui example (#16747)
  • Document Matryoshka Representation Learning support (#16770)
  • Add a security guide (#17230)
  • Add example to run DeepSeek with Ray Serve LLM (#17134)
  • Benchmarks for audio models (#16505)

Security and Dependency Updates

  • Don't bind tcp zmq socket to all interfaces (#17197)
  • Use safe serialization and fix zmq setup for mooncake pipe (#17192)
  • Bump Transformers to 4.51.3 (#17116)

Build and testing

  • Add property-based testing for vLLM endpoints using an API defined by an OpenAPI 3.1 schema (#16721)

Breaking changes 🚨

  • --enable-chunked-prefill, --multi-step-stream-outputs, --disable-chunked-mm-input can no longer explicitly be set to False. Instead, add no- to the start of the argument (i.e. --enable-chunked-prefill and --no-enable-chunked-prefill) (#16533)

What's Changed

Read more

v0.8.4

14 Apr 06:14
dc1b4a6
Compare
Choose a tag to compare

This release contains 180 commits from 84 contributors (25 new contributors!).

Highlights

This release includes important accuracy fixes for Llama4 models, if you are using it, we highly recommend you to update.

Model

  • Llama4 (#16113,#16509) bug fix and enhancements:
    • qknorm should be not shared across head (#16311)
    • Enable attention temperature tuning by default for long context (>32k) (#16439)
    • Index Error When Single Request Near Max Context (#16209)
    • Add tuned FusedMoE kernel config for Llama4 Scout, TP=8 on H100 (#16488)
    • Update to transformers==4.51.1 (#16257)
    • Added chat templates for LLaMa4 pythonic tool calling (#16463)
    • Optimized topk for topk=1(#16512)
    • Add warning for Attention backends that do not support irope yet (#16212)
  • Support Qwen3 and Qwen3MoE (#15289), smolvlm (#16017), jinaai/jina-embeddings-v3 (#16120), InternVL3 (#16495), GLM-4-0414 (#16338)

API

  • Estimate max-model-len use available KV cache memory. The error message nows hints at how to set --max-model-len (#16168)
  • Add hf_token to EngineArgs (#16093)
  • Enable regex support with xgrammar in V0 engine (#13228)
  • Support matryoshka representation / support embedding API dimensions (#16331)
  • Add bucket for request_latency, time_to_first_token and time_per_output_token (#15202)
  • Support for TorchAO quantization (#14231)

Hardware

  • Intel-Gaudi: Multi-step scheduling implementation for HPU (#12779)
  • TPU:
    • Make @support_torch_compile work for XLA backend (#15782)
    • Use language_model interface for getting text backbone in MM (#16410)

Performance

  • DeepSeek MLA: a new merge_attn_states CUDA kernel, 3x speedup (#16173)
  • MoE: Support W8A8 channel-wise weights and per-token activations in triton fused_moe_kernel (#16366)
  • Add support to modelopt quantization of Mixtral model (#15961)
  • Enable PTPC FP8 for CompressedTensorsW8A8Fp8MoEMethod (triton fused_moe) (#16537)

V1 Engine Core

  • Enable multi-input by default (#15799)
  • Scatter and gather placeholders in the model runner (#16076)
  • Set structured output backend to auto by default (#15724)
  • Zero-copy tensor/ndarray serialization/transmission (#13790)
  • Eagle Model loading (#16035)
  • KV cache slots for eagle heads (#16370)
  • Add supports_structured_output() method to Platform (#16148)

Developer Facing

What's Changed

Read more

v0.8.3

06 Apr 04:11
Compare
Choose a tag to compare

Highlights

This release features 260 commits, 109 contributors, 38 new contributors.

  • We are excited to announce Day 0 Support for Llama 4 Scout and Maverick (#16104). Please see our blog for detailed user guide.
    • Please note that Llama4 is only supported in V1 engine only for now.
  • V1 engine now supports native sliding window attention (#14097) with the hybrid memory allocator.

Cluster Scale Serving

  • Single node data parallel with API server support (#13923)
  • Multi-node offline DP+EP example (#15484)
  • Expert parallelism enhancements
    • CUTLASS grouped gemm fp8 MoE kernel (#13972)
    • Fused experts refactor (#15914)
    • Fp8 Channelwise Dynamic Per Token GroupedGEMM (#15587)
    • Adding support for fp8 gemm layer input in fp8 (#14578)
    • Add option to use DeepGemm contiguous grouped gemm kernel for fused MoE operations. (#13932)
  • Support XpYd disaggregated prefill with MooncakeStore (#12957)

Model Supports

V1 Engine

  • Collective RPC (#15444)
  • Faster top-k only implementation (#15478)
  • BitsAndBytes support (#15611)
  • Speculative Decoding: metrics (#15151), Eagle Proposer (#15729), n-gram interface update (#15750), EAGLE Architecture with Proper RMS Norms (#14990)

Features

API

  • Support Enum for xgrammar based structured output in V1. (#15594, #15757)
  • A new tags parameter for wake_up (#15500)
  • V1 LoRA support CPU offload (#15843)
  • Prefix caching support: FIPS enabled machines with MD5 hashing (#15299), SHA256 as alternative hashing algorithm (#15297)
  • Addition of http service metrics (#15657)

Performance

  • LoRA Scheduler optimization bridging V1 and V0 performance (#15422).

Hardwares

  • AMD:
    • Add custom allreduce support for ROCM (#14125)
    • Quark quantization documentation (#15861)
    • AITER integration: int8 scaled gemm kernel (#15433), fused moe (#14967)
    • Paged attention for V1 (#15720)
  • CPU:
  • TPU
    • Improve Memory Usage Estimation (#15671)
    • Optimize the all-reduce performance (#15903)
    • Support sliding window and logit soft capping in the paged attention kernel. (#15732)
    • TPU-optimized top-p implementation (avoids scattering). (#15736)

Doc, Build, Ecosystem

  • V1 user guide update: fp8 kv cache support (#15585), multi-modality (#15460)
  • Recommend developing with Python 3.12 in developer guide (#15811)
  • Clean up: move dockerfiles into their own directory (#14549)
  • Add minimum version for huggingface_hub to enable Xet downloads (#15873)
  • TPU CI: Add basic perf regression test (#15414)

What's Changed

Read more

v0.8.3rc1

05 Apr 19:46
63375f0
Compare
Choose a tag to compare
v0.8.3rc1 Pre-release
Pre-release

What's Changed

Read more

v0.8.2

23 Mar 21:05
25f560a
Compare
Choose a tag to compare

This release contains important bug fix for the V1 engine's memory usage. We highly recommend you upgrading!

Highlights

  • Revert "Use uv python for docker rather than ppa:deadsnakess/ppa (#13569)" (#15377)
  • Remove openvino support in favor of external plugin (#15339)

V1 Engine

  • Fix V1 Engine crash while handling requests with duplicate request id (#15043)
  • Support FP8 KV Cache (#14570, #15191)
  • Add flag to disable cascade attention (#15243)
  • Scheduler Refactoring: Add Scheduler Interface (#15250)
  • Structured Output
    • Add disable-any-whitespace option support for xgrammar (#15316)
    • guidance backend for structured output + auto fallback mode (#14779)
  • Spec Decode
    • Enable spec decode for top-p & top-k sampling (#15063)
    • Use better defaults for N-gram (#15358)
    • Update target_logits in place for rejection sampling (#15427)
  • AMD
    • Enable Triton(ROCm) Attention backend for Nvidia GPUs (#14071)
  • TPU
    • Support V1 Sampler for ragged attention (#14227)
    • Tensor parallel MP support (#15059)
    • MHA Pallas backend (#15288)

Features

  • Integrate fastsafetensors loader for loading model weights (#10647)
  • Add guidance backend for structured output (#14589)

Others

  • Add Kubernetes deployment guide with CPUs (#14865)
  • Support reset prefix cache by specified device (#15003)
  • Support tool calling and reasoning parser (#14511)
  • Support --disable-uvicorn-access-log parameters (#14754)
  • Support Tele-FLM Model (#15023)
  • Add pipeline parallel support to TransformersModel (#12832)
  • Enable CUDA graph support for llama 3.2 vision (#14917)

What's Changed

Read more

v0.8.1

19 Mar 17:40
61c7a1b
Compare
Choose a tag to compare

This release contains important bug fixes for v0.8.0. We highly recommend upgrading!

  • V1 Fixes

    • Ensure using int64 for sampled token ids (#15065)
    • Fix long dtype in topk sampling (#15049)
    • Refactor Structured Output for multiple backends (#14694)
    • Fix size calculation of processing cache (#15114)
    • Optimize Rejection Sampler with Triton Kernels (#14930)
    • Fix oracle for device checking (#15104)
  • TPU

    • Fix chunked prefill with padding (#15037)
    • Enhanced CI/CD (#15054, 14974)
  • Model

    • Re-enable Gemma3 for V1 (#14980)
    • Embedding model support LoRA (#14935)
    • Pixtral: Remove layer instantiation duplication (#15053)

What's Changed

New Contributors

Full Changelog: v0.8.0...v0.8.1

v0.8.0

18 Mar 17:52
Compare
Choose a tag to compare

v0.8.0 featured 523 commits from 166 total contributors (68 new contributors)!

Highlights

V1

We have now enabled V1 engine by default (#13726) for supported use cases. Please refer to V1 user guide for more detail. We expect better performance for supported scenarios. If you'd like to disable V1 mode, please specify the environment variable VLLM_USE_V1=0, and send us a GitHub issue sharing the reason!

DeepSeek Improvements

We observe state of the art performance with vLLM running DeepSeek model on latest version of vLLM:

  • MLA Enhancements:
  • Distributed Expert Parallelism (EP) and Data Parallelism (DP)
    • EP Support for DeepSeek Models (#12583)
    • Add enable_expert_parallel arg (#14305)
    • EP/TP MoE + DP Attention (#13931)
    • Set up data parallel communication (#13591)
  • MTP: Expand DeepSeek MTP code to support k > n_predict (#13626)
  • Pipeline Parallelism:
    • DeepSeek V2/V3/R1 only place lm_head on last pp rank (#13833)
    • Improve pipeline partitioning (#13839)
  • GEMM
    • Add streamK for block-quantized CUTLASS kernels (#12978)
    • Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM (#13917)
    • Add more tuned configs for H20 and others (#14877)

New Models

  • Gemma 3 (#14660)
    • Note: You have to install transformers from main branch (pip install git+https://github.com/huggingface/transformers.git) to use this model. Also, there may be numerical instabilities for float16/half dtype. Please use bfloat16 (preferred by HF) or float32 dtype.
  • Mistral Small 3.1 (#14957)
  • Phi-4-multimodal-instruct (#14119)
  • Grok1 (#13795)
  • QwQ-32B and toll calling (#14479, #14478)
  • Zamba2 (#13185)

NVIDIA Blackwell

  • Support nvfp4 cutlass gemm (#13571)
  • Add cutlass support for blackwell fp8 gemm (#13798)
  • Update the flash attn tag to support Blackwell (#14244)
  • Add ModelOpt FP4 Checkpoint Support (#12520)

Breaking Changes

  • The default value of seed is now None to align with PyTorch and Hugging Face. Please explicitly set seed for reproduciblity. (#14274)
  • The kv_cache and attn_metadata arguments for model's forward method has been removed; as the attention backend has access to these value via forward_context. (#13887)
  • vLLM will now default generation_config from model for chat template, sampling parameters such as temperature, etc. (#12622)
  • Several request time metrics (vllm:time_in_queue_requests, vllm:model_forward_time_milliseconds, vllm:model_execute_time_milliseconds) has been deprecated and subject to removal (#14135)

Updates

  • Update to PyTorch 2.6.0 (#12721, #13860)
  • Update to Python 3.9 typing (#14492, #13971)
  • Update to CUDA 12.4 as default for release and nightly wheels (#12098)
  • Update to Ray 2.43 (#13994)
  • Upgrade aiohttp to incldue CVE fix (#14840)
  • Upgrade jinja2 to get 3 moderate CVE fixes (#14839)

Features

Frontend API

  • API Server
    • Support return_tokens_as_token_id as a request param (#14066)
    • Support Image Emedding as input (#13955)
    • New /load endpoint for load statistics (#13950)
    • New API endpoint /is_sleeping (#14312)
    • Enables /score endpoint for embedding models (#12846)
    • Enable streaming for Transcription API (#13301)
    • Make model param optional in request (#13568)
    • Support SSL Key Rotation in HTTP Server (#13495)
  • Reasoning
    • Support reasoning output (#12955)
    • Support outlines engine with reasoning outputs (#14114)
    • Update reasoning with stream example to use OpenAI library (#14077)
  • CLI
    • Ensure out-of-tree quantization method recognize by cli args (#14328)
    • Add vllm bench CLI (#13993)
  • Make LLM API compatible for torchrun launcher (#13642)

Disaggregated Serving

  • Support KV cache offloading and disagg prefill with LMCache connector (#12953)
  • Support chunked prefill for LMCache connector (#14505)

LoRA

  • Add LoRA support for TransformersModel (#13770)
  • Make the deviceprofilerinclude LoRA memory. (#14469)
  • Gemma3ForConditionalGeneration supports LoRA (#14797)
  • Retire SGMV and BGMV Kernels (#14685) (#14685)

VLM

  • Generalized prompt updates for multi-modal processor (#13964)
  • Deprecate legacy input mapper for OOT multimodal models (#13979)
  • Refer code examples for common cases in dev multimodal processor (#14278)

Quantization

  • BaiChuan SupportsQuant (#13710)
  • BartModel SupportsQuant (#14699)
  • Bamba SupportsQuant (#14698)
  • Deepseek GGUF support (#13167)
  • GGUF MoE kernel (#14613)
  • Add GPTQAllSpark Quantization (#12931)
  • Better performance of gptq marlin kernel when n is small (#14138)

Structured Output

  • xgrammar: Expand list of unsupported jsonschema keywords (#13783)

Hardware Support

AMD

  • Faster Custom Paged Attention kernels (#12348)
  • Improved performance for V1 Triton (ROCm) backend (#14152)
  • Chunked prefill/paged attention in MLA on ROCm (#14316)
  • Perf improvement for DSv3 on AMD GPUs (#13718)
  • MoE fp8 block quant tuning support (#14068)

TPU

  • Integrate the new ragged paged attention kernel with vLLM v1 on TPU (#13379)
  • Support start_profile/stop_profile in TPU worker (#13988)
  • Add TPU v1 test (#14834)
  • TPU multimodal model support for ragged attention (#14158)
  • Add tensor parallel support via Ray (#13618)
  • Enable prefix caching by default (#14773)

Neuron

  • Add Neuron device communicator for vLLM v1 (#14085)
  • Add custom_ops for neuron backend (#13246)
  • Add reshape_and_cache (#14391)
  • Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth (#13245)

CPU

  • Upgrade CPU backend to torch-2.6 (#13381)
  • Support FP8 KV cache in CPU Backend(#14741)

s390x

  • Adding cpu inference with VXE ISA for s390x architecture (#12613)
  • Add documentation for s390x cpu implementation (#14198)

Plugins

  • Remove cuda hard code in models and layers (#13658)
  • Move use allgather to platform (#14010)

Bugfix and Enhancements

  • Illegal memory access for MoE On H20 (#13693)
  • Fix FP16 overflow for DeepSeek V2 (#13232)
  • Illegal Memory Access in the blockwise cutlass fp8 GEMMs (#14396)
  • Pass all driver env vars to ray workers unless excluded (#14099)
  • Use xgrammar shared context to avoid copy overhead for offline engine (#13837)
  • Capture and log the time of loading weights (#13666)

Developer Tooling

Benchmarks

  • Consolidate performance benchmark datasets (#14036)
  • Update benchmarks README (#14646)

CI and Build

  • Add RELEASE.md (#13926)
  • Use env var to control whether to use S3 bucket in CI (#13634)

Documentation

  • Add RLHF document (#14482)
  • Add nsight guide to profiling docs (#14298)
  • Add K8s deployment guide (#14084)
  • Add developer documentation for torch.compile integration (#14437)

What's Changed

  • Update pre-commit's isort version to remove warnings by @hmellor in #13614
  • [V1][Minor] Print KV cache size in token counts by @WoosukKwon in #13596
  • fix neuron performance issue by @ajayvohra2005 in #13589
  • [Frontend] Add backend-specific options for guided decoding by @joerunde in #13505
  • [Bugfix] Fix max_num_batched_tokens for MLA by @mgoin in #13620
  • [Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth by @lingfanyu in #13245
  • Add llmaz as another integration by @kerthcet in #13643
  • [Misc] Adding script to setup ray for multi-node vllm deployments by @Edwinhr716 in #12913
  • [NVIDIA] Fix an issue to use current stream for the nvfp4 quant by @kaixih in #13632
  • Use pre-commit to update requirements-test.txt by @hmellor in #13617
  • [Bugfix] Add mm_processor_kwargs to chat-related protocols by @ywang96 in #13644
  • [V1][Sampler] Avoid an operation during temperature application by @njhill in #13587
  • Missing comment explaining VDR variable in GGUF kernels by @SzymonOzog in #13290
  • [FEATURE] Enables /score endpoint for embedding models by @gmarinho2 in #12846
  • [ci] Fix metrics test model path by @khluu in #13635
  • [Kernel]Add streamK for block-quantized CUTLASS kernels by @Hongbosherlock in #12978
  • [Bugfix][CPU] Fix cpu all-reduce using native pytorch implementation by @Isotr0py in #13586
  • fix typo of grafana dashboard, with correct datasource by @johnzheng1975 in https://...
Read more

v0.8.0rc2

17 Mar 17:08
37e3806
Compare
Choose a tag to compare
v0.8.0rc2 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: v0.8.0rc1...v0.8.0rc2