v0.15.0rc1
Pre-release
Pre-release
This is the first release candidate of v0.15.0 for vLLM Ascend. Please follow the official doc to get started.
Highlights
- NPU Graph EX (npugraph_ex) Enabled by Default: The npugraph_ex feature is now enabled by default, providing better graph optimization with integrated inductor pass and MatmulAllReduceAddRMSNorm fusion. #6354 #6664 #6006
- 310P MoE and W8A8 Support[Experimental]: 310P now supports MoE models, W8A8 quantization, and weightNZ feature, significantly expanding hardware capabilities. #6530 #6641 #6454 #6705
- Qwen3-VL-MoE EAGLE Support: Added EAGLE speculative decoding support for Qwen3-VL-MoE model. #6327
- Kimi-K2.5 Model Support: Added support for Kimi-K2.5 models. Please note that vLLM 0.15.0 has a known issue with Kimi-K2.5. To fix this, please apply the changes from the upstream
vllm-project/vllmrepository, specifically from pull requests #33320 and #34501. #6755
Features
- Auto-detect Quantization Format: Quantization format can now be auto-detected from model files. #6645
- GPT-OSS Attention Support: Added GPT-OSS attention implementation. #5901
- DCP Support for SFA: Added Decode Context Parallel (DCP) support for SFA architecture. #6563
- Mooncake Layerwise PCP Support: Mooncake layerwise connector now supports PCP function. #6627
- Mooncake Connector Remote PTP Size: Mooncake connector can now get remote PTP size. #5822
- KV Pool Sparse Attention: KV pool now supports sparse attention. #6339
- Batch Invariant with AscendC: Implemented batch invariant feature with AscendC. #6590
- Routing Replay: Added routing replay feature. #6696
- Compressed Tensors MoE W4A8 Dynamic Weight: Added support for compressed tensors moe w4a8 dynamic weight quantization. #5889
- GLM4.7-Flash W8A8 Quantization: Added W8A8 quantization support for GLM4.7-Flash. #6492
- DispatchGmmCombineDecode Enhancement: DispatchGmmCombineDecode now supports bf16/float16 gmm1/gmm2 weight and ND format weight. #6393
- RMSNorm Dynamic Quant Fusion: Added rmsnorm dynamic quant fusion pass. #6274
- Worker Health Check Interface: Added
check_healthinterface for worker. #6681
Hardware and Operator Support
- 310P Support Expansion: Multiple improvements for 310P hardware:
- ARM-only CPU Binding: Enabled ARM-only CPU binding with NUMA-balanced A3 policy. #6686
- Triton Rope Enhancement: Triton rope now supports index_selecting from cos_sin_cache. #5450
- AscendC Fused Op: Added AscendC fused op transpose_kv_cache_by_block to speed up GQA transfer. #6366
- Rotary_dim Parameter: Added support for rotary_dim parameter when using partial rope in rotary_embedding. #6581
Performance
- Multimodal seq_lens CPU Cache: Use
seq_lensCPU cache to avoid frequent D2H copy for better multimodal performance. #6448 - DispatchFFNCombine Optimization: Optimized DispatchFFNCombine kernel performance and resolved vector error caused by unaligned UB access. #6468 #6707
- DeepSeek V3.2 KVCache Optimization: Optimized KV cache usage for DeepSeek V3.2. #6610
- MLA/SFA Weight Prefetch: Refactored MLA/SFA weight prefetch to be consistent with MoE weight prefetch. #6629
- MLP Weight Prefetch: Refactored MLP weight prefetch to be consistent with MoE model's prefetching. #6442
- Adaptive Block Size Selection: Added adaptive block size selection in linear_persistent kernel. #6537
- EPLB Memory Optimization: Reduced memory used for heat aggregation in EPLB. #6729
- Memory Migration and Interrupt Core Binding: Improved binding logic with memory migration and interrupt core binding functions. #6785
- Triton Stability: Improved Triton stability on Ascend for large grids. #6301
Dependencies
- Mooncake: Upgraded to v0.3.8.post1. #6428
Deprecation & Breaking Changes
- ProfileExecuteDuration: Cleaned up and deprecated ProfileExecuteDuration feature. #6461
- Custom rotary_embedding Operator: Removed custom rotary_embedding operator. #6523
- USE_OPTIMIZED_MODEL: Cleaned up unused env
USE_OPTIMIZED_MODEL. #6618
Documentation
- Added AI-assisted model-adaptation workflow documentation for vllm-ascend. #6731
- Added vLLM Ascend development guidelines (AGETNS.md). #6797
- Added GLM5 tutorial documentation. #6709 #6717
- Added Memcache Usage Guide. #6476
- Added request forwarding documentation. #6780
- Added Benchmark Tutorial for Suffix Speculative Decoding. #6323
- Restructured tutorial documentation. #6501
- Added npugraph_ex introduction documentation. #6306
Others
- MTP in PD Fullgraph: Fixed support for ALL D-Nodes in fullgraph when running MTP in PD deployment. #5472
- DeepSeekV3.1 Accuracy: Fixed DeepSeekV3.1 accuracy issue. #6805
- EAGLE Refactor: Routed MTP to EAGLE except for PCP/DCP+MTP cases. #6349
- Speculative Decoding Accuracy: Fixed spec acceptance rate problem in vLLM 0.15.0. #6606
- PCP/DCP Accuracy: Fixed accuracy issue in PCP/DCP with speculative decoding. #6491
- Dynamic EPLB: Fixed ineffective dynamic EPLB bug and EPLB no longer depends on a specified model. #6653 #6528
- KV Pool Mooncake Backend: Correctly initialized head_or_tp_rank for mooncake backend. #6498
- Layerwise Connector Recompute Scheduler: Layerwise connector now supports recompute scheduler. #5900
- Memcache Pool: Fixed service startup failure when memcache pool is enabled. #6229
- AddRMSNormQuant: Fixed AddRMSNormQuant not taking effect. #6620
- Pooling Code: Fixed pooling code issues and updated usage guide. #6126
- Context Parallel: Fixed and unified the PD request discrimination logic. #5939
- npugraph_ex: Fixed duplicate pattern issue and added extra check for allreduce rmsnorm fusion pass. #6513 #6430
- RecomputeScheduler: Fixed incompatibility of RecomputeScheduler with vLLM v0.14.1. #6286
New Contributors
- @gjc0824 made their first contribution in #5416
- @pu-zhe made their first contribution in #6270
- @mengchengTang made their first contribution in #6141
- @Sergey-Zlobin made their first contribution in #6327
- @wubin58 made their first contribution in #6310
- @huangazazaz made their first contribution in #6290
- @IWantFight made their first contribution in #6432
- @Zhang-Bryan made their first contribution in #6274
- @acat-rw made their first contribution in #6469
- @GoCHug made their first contribution in #6581
- @luomin2005 made their first contribution in #6681
- @yydyzr made their first contribution in #6642
- @nakairika made their first contribution in #6709
- @huyq made their first contribution in #6664
- @lih827 made their first contribution in #6393
- @mikequan0425 made their first contribution in #5901
- @yejj710 made their first contribution in #6593
- @Spicy-Stick made their first contribution in #6543
- @chenchuw886 made their first contribution in #6686
- @LoganJane made their first contribution in #6755
- @Bowen-Leee made their first contribution in #6514
- @Li-Yongwen made their first contribution in #6696
- @wangbj127 made their first contribution in #6702
Full Changelog: v0.14.0rc1...v0.15.0rc1