Releases: vllm-project/vllm-ascend
v0.19.1rc1
This is the first release candidate of v0.19.1 for vLLM Ascend, based on vLLM v0.19.1. This release includes significant performance optimizations, new model support, hardware expansion, and important bug fixes.
Please follow the official doc to get started.
Highlights
- DFlash Attention Backend: Added DFlash attention backend with FULL_DECODE_ONLY support for improved inference performance (#8118, #8516, #8627)
- Zero Bubble Async Scheduling: Implemented zero bubble optimization for async scheduling and speculative decoding, significantly reducing scheduling overhead (#7640)
- A2/A3 Attention Operator Upgrade: Replaced npu_fusion_attention with _npu_flash_attention_unpad operator for better performance on A2 and A3 hardware (#8671)
- Eagle3 + MiniMax-M2.5 Support: Applied Eagle3 speculative decoding to MiniMax-M2.5 model for faster inference (#7619)
- C8 INT8 KV Cache for GQA: Added C8 (INT8 KV cache) support for GQA attention models, including DeepSeek-V3.1 with PD disaggregation (#7474, #7222)
- Bailing Model Support: Full support for Bailing MoE model including linear adaptation and ModelSlim quantization (#8657, #8709)
Features
- Flash Comm V1 for Qwen3-VL: Support Flash Comm V1 for Qwen3-VL multimodal models (#7897)
- Eagle + PCP + Full Graph Mode: Support Eagle combined with PCP and full graph mode (#7924)
- Multimodal Reasoning with PCP: Support multimodal reasoning when prefill context parallel feature is enabled (#8038)
- Dynamic Chunk for PP: Support Dynamic Chunk for Chunked Pipeline Parallelism (#7896)
- Hamming-based Sparse Attention: Added Hamming-based sparse attention inference framework and operators (#8564, #8346)
- Optimized Causal Conv1d Operator: Added optimized causal conv1d operator (#8215)
- Recurrent AscendC Operators: Added recurrent AscendC operators for specific model architectures (#8055)
- GLM4.7 C8 Support: Support GLM4.7 with C8 (INT8 KV cache) scenarios (#8174)
- Minitron-8B-Base Support: Verified and supported nvidia/Minitron-8B-Base model (#8157)
- Bailing Model Support: Full support for Bailing MoE model with linear adaptation and ModelSlim quantization configuration (#8657, #8709)
- Qwen3.5 MoE Flash Comm: Support Flash Comm for Qwen3.5 MoE models (#7486)
- Initial MoE Support for MRv2: Add initial MoE models support for Model Runner V2 (#7922)
- Xlite Backend Expansion:
- EPLB Enhancements:
- Eagle Improvements for model_runner_v2:
- MTP Merged Graph: Support merged graph for MTP (Multi-Token Prediction) (#6860)
- Unified MoE Expert Placement: Support unified placement for shared & router experts (#7188)
- Dispatch V2 Hierarchy Communication: Support dispatch_v2/combine_v2 hierarchy communication for better MoE performance (#7583)
- Xmask for Dispatch FFN Combine: Add xmask feature for dispatch_ffn_combine operator (w8a8 branch) (#8560)
- Fused W4A8 Kernel: Fuse W4A8 dispatch + FFN + combine into a single fused kernel (#7779)
- KV Cache Memory Accounting: Account for graph capture memory in KV cache planning (#8289)
- Qwen3-Next Hybrid Attention: Support Qwen3-next hybrid attention in piecewise & full_decode_only modes (#7422)
- GDN Optimization: Optimize GDN non-spec prefill fallback metadata (#7756)
- Qwen3-VL Support: Support kv_rmsnorm_mrope for Qwen3-VL (#7762)
- Mamba Prefix Caching: Layerwise connector supports Mamba prefill prefix caching (#7814)
- Yuanrong KV Pool Backend: Add Yuanrong backend support to KV Pool (#6869)
Hardware and Operator Support
- 310P Enhancements:
Performance
- A2/A3 Attention: Replace npu_fusion_attention with _npu_flash_attention_unpad operator for better performance (#8671)
- MLA PCP Prefill Optimization: Optimize MLA PCP prefill attention by avoiding projecting unnecessary tail KV tokens (#8787)
- Async Scheduling Optimization:
- KV Cache Optimization:
- Operator Optimizations:
- Triton Kernel Optimizations (model_runner_v2):
- Optimize _temperature_kernel and _topk_log_softmax_kernel (#8083)
- Optimize _min_p_kernel performance (#8243, #7767)
- Add bad-words-kernel triton kernel (#8030)
- Optimize bincount_kernel performance (#7757)
- Optimize _ranks_kernel performance (#7767)
- Optimize triton recompilation triggered by function parameters (#7480, #7481, #7483)
- HCCL Process Group Reuse: Reuse equivalent HCCL process groups on Ascend (#7654)
- CPU Binding Defer: Defer CPU binding until worker warmup completes (#7829)
- Conv3d to Linear Conversion: Convert conv3d to linear when kernel size equ...
v0.18.0
We're excited to announce the release of v0.18.0 for vLLM Ascend. This is the official release for v0.18.0. Please follow the official doc to get started.
Highlights
Model Support
- Kimi-K2.x Model Support: [Experimental]Added support for Kimi-K2.x models. @aipaes @dragondream-chen @SparrowMu @LoganJane #6755
- Minimax-m2.x Model Support: [Experimental]Added support for Minimax-m2.x models with eagle3. @SparrowMu @GDzhu01 #7105 #7714
- GLM5 Support: [Experimental]Added support for GLM5 models without any code modification!
- Qwen3.x Support: [Experimental]Added support for Qwen3.x models without any code modification!
- DeepseekOCR Support: [Experimental]Added support for DeepseekOCR model and optimize
RelPosAttentionandCustomQwen2Decoder. @Wangbei25 #7737
Core Features
- EPLB (Expert Parallelism Load Balance): EPLB is more stable with many bug fixes, and has better performance now. EPLB now works in most cases and is recommended for use. #6528 #7344 #7890 #6477
- ACLGraph Enhancement: ACLGraph now support capturing a single merged graph for multi-step drafts, which greatly reduce host bound in multi-step spec decoding case! #5553 #5940
- KV Pooling: Enhanced KV pool with Mooncake connector now support sparse attention, and LMCacheAscendConnector is added as a new KV cache pooling solution for Ascend, and support FabricMem Mode for HIXL interconnect, support yuanrong as a backend for AscendStoreConnector, and now MooncakeLayerwiseConnector can be activated together with KV Pooling. Compared with previous versions, KV Pooling has a huge performance optimization on TTFT! #6339 #6882 #6806 #6869 #7032
- PD disaggregation: Mooncake layerwise connector now support hybrid attention manager and PCP feature. #7022 #6627
- NPU Graph EX (npugraph_ex) Enabled by Default: The npugraph_ex feature is now enabled by default, providing better graph optimization with integrated inductor pass and MatmulAllReduceAddRMSNorm fusion. #6354 #6664 #6006
- RL(Reinforcement learning): [Experimental]RL enhanced with implemented batch invariant feature with AscendC and triton op, and added routing replay feature. #6590 #6696
- CPU Binding Enabled by Default: Enabled ARM-only CPU binding with global-slicing A3 policy, improving inference throughput in hostbound scenarios. #6686
Features
- Prefix cache is now supported in hybrid model. #7103
- Flash Comm V1 now supports VL models with MLA, removing a previous limitation for multimodal serving. #7390
- VL MoE models now support SP, and
sp_thresholdis removed in favor ofsp_min_token_numfrom vLLM. #7044 - [Experimental]Pipeline Parallel now supports async scheduling, improving throughput for PP deployments. #7136
- Eagle3 now supports QuaRot quantization without embedding. #7038
- Refactoring eagle3/mtp, eagle3 and mtp are now using the same proposer. #6349 #7033
Hardware and Operator Support
- First time support 310P, with huge performance optimization!:
- Custom Operators: Added multiple custom operators including:
- Added AscendC casual_conv1d_fn operator for Qwen3-Next. #6661
- Added Ascend Ops recurrent_gated_delta_rule operator. #6725
- Added GMM custom operator for MoE models. #7010
- Optimize split_qkv_rmsnorm_rope operator. #6827
- Triton rope now supports index_selecting from cos_sin_cache. #5450
- Added AscendC fused op transpose_kv_cache_by_block to speed up GQA transfer. #6366
- Optimized
DispatchFFNCombinekernel performance and resolved vector error caused by unaligned UB access. #6468 #6707 - Refactor and optimize CausalConv1d. #7495
Performance
- Initialize Performance: Optimized Triton operator recompilation to reduce redundant rebuilds and unnecessary recompilation triggered by function parameter optimization. #7647 #7645
- Qwen3.x Performance: [Experimental]Optimized the Qwen3.x and Qwen3-Next performance by supporting full graph mode, PD disaggregation, mamba prefill prefix-caching and flashcomm1, prebuilding chunk metadata to reducing host-device synchronization overhead, and multiple op performance optimization including
chunk_gated_delta_rule,chunk_fwd_kernel_o,solve_tril,recompute_w_u_fwd_kernel,split_qkv_rmsnorm_mrope, etc. @LoganJane @shaopeng-666 @ppppeng @SunnyLee151064 @hust17yixuan @Toneymiller @linfeng-yuan #7487 #6830 #7506 #7796 #7527 #7529 #7495 #7368 - Kimi-K2.x Performance: [Experimental]Optimized the Kimi-K2.x performance by supporting eagle3 and flashcomm1, and reducing d2h overhead. @aipaes @dragondream-chen @SparrowMu @LoganJane @GDzhu01 @Yaphets24 @hust17yixuan #7342 #7390 #7521
- Qwen3-VL Performance: Qwen3-VL gets stronger multimodal operator enablement with Flash Comm V1 and
qkv_rmsnorm_mropesupport, and enable 2.7x faster for convolution computation with aclnn BatchMatMulV2, support EAGLE speculative decoding. #7893 #7852 #7017 #6327 - Qwen3-Omni Performance: Qwen3-Omni quantization adaptation and optimization is now available. #6828
- DeepSeek-V3.2/GLM5 Performance: Performance optimizations, support W8A8C8 quantization, and optimized KV cache usage. @yydyzr @ZYang6263 @rjg-lyh @Nagisa125 #7029 #6610
- GLM4.7-Flash Performance: Added W8A8 quantization support for GLM4.7-Flash. @aipaes #6492
Dependencies
- vLLM: Upgraded to 0.18.0 and dropped 0.17.0 support.
- CANN: Upgraded to 8.5.1. PS: AscendStoreConnector with FabricMem mode, 310P device supporting and Qwen3-Omni model need upgrades CANN version to 9.0.0, if you need these features, please upgrade manually.
- torch-npu: Upgraded to 2.9.0.post1+git4c901a4 because of some known issue. This version can't install by default, pleas...
v0.13.0rc3
What's Changed
- [Doc][Misc] Update release notes and FAQ links for v0.13.0 by @wangxiyuan in #6585
- [BugFix][v0.13.0] fix a bug that patch from PR #5786 does not take effect by @Angazenn in #6615
- [v0.13.0][Ops] Make triton rope support index_selecting from cos_sin_cache by @Angazenn in #6602
- [0.13.0][bugfix]fix profiler initialization bug with calling stack by @linfeng-yuan in #6714
- [0.13.0] modify release note & supported matrix by @zzzzwwjj in #6751
- [v0.13.0][Fusion]add checks to skip fusion where split_rmsnorm_rope is not supported by @Angazenn in #6749
- [DOC] add request forwarding (cherry-pick from #6780) by @starmountain1997 in #6788
- [Bugfix] Fix vllm-ascend 0.13.0 error:
TypeError: apply_token_bitmask_inplace_cpu(): incompatible function argumentsby @wjunLu in #6823 - [Bugfix] mtp forces eager mode by @zhenwenqi2024 in #6760
- [DOC] add layer_sharding and fix link by @starmountain1997 in #6808
- [v0.13.0][CI] Upgrade to CANN 8.5.1 by @wxsIcey in #6865
- [Bugfix] Resolve operator name collision for DeepSeekV3.2 in RL scena… by @Mind-s in #7034
- [doc] Added Ascend PyTorch Profiler section by @herizhen in #6905
- [0.13.0][cherry-pick][Bugfix][csrc] Add compile-time Ascend950/910_95 compatibility for custom ops between CANN8.5 and 9.0 by @zjchenn in #7116
- [0.13.0][cherry-pick][Bugfix][Triton] Centralize Ascend extension op dispatch in triton_utils by @zjchenn in #7112
- [Doc][Misc][v0.13.0] Updated the document configuration for DeepSeek-V3.2 by @Nagisa125 in #7957
- [CI] Fix Releases/v0.13.0 CI tests by @wjunLu in #7952
- [BugFix]Fix compilation errors for operators dispatch_gmm_combine_decode/moe_combine_normal/moe_dispatch_normal by @wangyibo1005 in #7840
- [v0.13.0][Feature] Add DeepSeek v4 initial support by @wangxiyuan in #8648
New Contributors
Full Changelog: v0.13.0...v0.13.0rc3
v0.18.0rc1
This is the first release candidate of v0.18.0 for vLLM Ascend. Please follow the official doc to get started.
Highlights
- C8(INT8 KV cache) is now supported for DeepSeek-V3.1 with PD disaggregation scenario. #7222
- DeepSeek models are now supported on A5 through new MLA operators. #7232
Features
- Flash Comm V1 now supports VL models with MLA, removing a previous limitation for multimodal serving. #7390
- Support separate attention backends for target and draft models in speculative decoding, allowing finer backend tuning per model. #7342
- VL MoE models now support SP, and
sp_thresholdis removed in favor ofsp_min_token_numfrom vLLM. #7044 - Qwen VL models now support
w8a8_mxfp8quantization. #7417
Performance
- Optimized Triton operator recompilation to reduce redundant rebuilds and unnecessary recompilation triggered by function parameter optimization. #7647 #7645
- Optimized the Qwen3.5 and Qwen3-Next GDN prefill path by prebuilding chunk metadata, reducing host-device synchronization overhead. #7487
- Simplified the FIA prefill context merge path for better runtime efficiency. #7293
Documentation
- Refreshed deployment and model docs for Kimi-K2.5, GLM-4.7, DeepSeek-V3.2, MiniMax-M2.5, and PD disaggregation guides. #7371 #7403 #7292 #7296 #7300
Others
- Fixed a PD separation issue where decode nodes could get stuck because shapes were not aligned across DP nodes. #7534
- Fixed a regression where hybrid attention plus mamba models on Ascend could start with an incorrect block size after the v0.18.0 upgrade. #7528
- Fixed multi-instance serving OOM calculation on single-card deployments. #7427
- Fixed DeepSeek v3.1 C8 when overlaying MTP with full decode and full graph modes. #7571
- Fixed quantization config key mapping in
AscendModelSlimConfigby switching from reverse mapping to forward mapping. #7716
Dependencies
- To address issues triggered by multi-stream parallel operations within ACL Graph, we have integrated temporary dependency versions for
torch_npu. These fixes are already included in our official Docker images. If you prefer to build your own environment from source, please manually install the specific versions as follows:
# Set environment variables
PYTHON_TAG=$(python3 -c "import sys; print(f'cp{sys.version_info.major}{sys.version_info.minor}')")
ARCH=$(python3 -c "import platform; m=platform.machine().lower(); arch_map={'x86_64':'x86_64','amd64':'x86_64','aarch64':'aarch64','arm64':'aarch64'}; print(arch_map.get(m,m))")
# Select the specific torch_npu wheel based on your environment
if [ "$PYTHON_TAG" = "cp310" ] && [ "$ARCH" = "aarch64" ]; then PTA_WHEEL="torch_npu-2.9.0.post1%2Bgit4c901a4-${PYTHON_TAG}-${PYTHON_TAG}-manylinux_2_28_${ARCH}.whl"; \
elif [ "$PYTHON_TAG" = "cp311" ] && [ "$ARCH" = "x86_64" ]; then PTA_WHEEL="torch_npu-2.9.0.post1%2Bgitdc51c2d-${PYTHON_TAG}-${PYTHON_TAG}-manylinux_2_28_${ARCH}.whl"; \
elif [ "$PYTHON_TAG" = "cp310" ] && [ "$ARCH" = "x86_64" ]; then PTA_WHEEL="torch_npu-2.9.0.post1%2Bgita74051c-${PYTHON_TAG}-${PYTHON_TAG}-manylinux_2_28_${ARCH}.whl"; \
elif [ "$PYTHON_TAG" = "cp311" ] && [ "$ARCH" = "aarch64" ]; then PTA_WHEEL="torch_npu-2.9.0.post1%2Bgitee7ba04-${PYTHON_TAG}-${PYTHON_TAG}-manylinux_2_28_${ARCH}.whl"; \
else echo "Unsupported PYTHON_TAG=$PYTHON_TAG ARCH=$ARCH"; exit 1; fi
# Install wheels
python3 -m pip install "https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/${PTA_WHEEL}"- There is an known issue on the current triton-ascend, as shown in #7782 Please upgrade triton-ascend to 3.2.0.dev20260322 to avoid this issue, please use the official docker images or manually install the specific triton-ascend version as following:
PYTHON_TAG=$(python3 -c "import sys; print(f'cp{sys.version_info.major}{sys.version_info.minor}')") && \
ARCH=$(python3 -c "import platform; machine = platform.machine().lower(); arch_map = {'x86_64': 'x86_64', 'amd64': 'x86_64', 'aarch64': 'aarch64', 'arm64': 'aarch64'}; print(arch_map.get(machine, machine))") && \
TRITON_ASCEND_WHEEL="triton_ascend-3.2.0.dev20260322-${PYTHON_TAG}-${PYTHON_TAG}-manylinux_2_27_${ARCH}.manylinux_2_28_${ARCH}.whl" && \
python3 -m pip install "https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/${TRITON_ASCEND_WHEEL}"Known Issue
- When running DeepSeek-R1 W8A8 with MTP and KV Pool enabled under high concurrency, a
ValueError: Counters can only be incremented by non-negative amountsmay occur. #7489 - triton-ascend may fail to compile with a g++ internal compiler error (Segmentation fault). Workaround: update to
triton-ascend==3.2.0.dev20260322and clear the Triton cache (rm -rf ~/.triton/cache/*). #7782 - FIA does not support all MHA head dimensions when using tp-size >= 16 on Ascend. Affected models will fail with an error on unsupported head dimensions. This will be resolved in a future release when FIA supports more head dimensions. #7729
- While Minimax-2.5 now supports PD Disaggregation, internal testing has identified a 13% regression on the GPQA benchmark when this feature is enabled. We currently do not recommend enabling PD Disaggregation for this model and We are working on an optimization fix.
New Contributors
- @GGGGua made their first contribution in #7295
- @asunxiao made their first contribution in #7066
- @liuhy1213-cell made their first contribution in #7300
- @jiangmengyu18 made their first contribution in #7383
- @ksiyuan made their first contribution in #7417
- @yesyue-w made their first contribution in #7046
- @lijiahang226 made their first contribution in #7232
- @ZhuQi-seu made their first contribution in #7368
- @GoMarck made their first contribution in #7392
Full Changelog: v0.17.0rc1...0.18.0rc1
v0.17.0rc1
This is the first release candidate of v0.17.0 for vLLM Ascend. Please follow the official doc to get started.
Highlights
- Ascend950 chip is now supported. #7151
- ACLGraph (graph mode) is now supported for Model Runner V2. #7110
- Unified parallelized speculative decoding is supported, enabling parallel draft inference schemes simultaneously. #6766
Features
- Auto-detect quantization format from model files, and remote model IDs (e.g.,
org/model-name) are also supported.--quantization ascendis not required now. #7111 - Qwen3.5 is supported from this version on.
- FlashLB algorithm for EPLB: supports per-step heat collection and multi-stage load balancing for better expert parallelism efficiency. #6477
- LoRA with tensor parallel and
--fully-sharded-lorasis now fixed and working. #6650 - LMCacheAscendConnector is added as a new KV cache pooling solution for Ascend. #6882
- W8A8C8 quantization is now supported for DeepSeek-V3.2 and GLM5 in PD-mix scenario. #7029
- [Experimental] Minimax-m2.5 model is now supported on Ascend NPU. #7105
- [Experimental] Mooncake Layerwise Connector now supports hybrid attention manager with multiple KV cache groups. #7022
- [Experimental] Prefix cache is now supported in hybrid model. #7103
Performance
- Pipeline Parallel now supports async scheduling, improving throughput for PP deployments. #7136
- Improved TTFT when using Mooncake connector by reducing log overhead. #6125
- KV Pool lookup is optimized for short sequences (token length < block_size). #7146
- Fix penalty ops in Model Runner V2, achieving ~10% performance improvement. #7013
Documentation
- Added EPD (Encode-Prefill-Decode) documentation and load-balance proxy example. #6221
- Added Ascend PyTorch Profiler usage guide. #7117
- Fixed DSV3.1 PD configuration documentation. #7187
Others
- Fix drafter crash in full graph mode for speculative decoding. #7158 #7148
- Fix GLM5-W8A8 precision issues caused by rotary quant MTP weights. #7139
- Fix ngram graph replay accuracy error on 310P. #7134
- Fix FIA pad logic in graph mode after upstream vLLM change. #7144
- Fix a precision issue caused by wrong KV cache reshape on Qwen3.5. #7209
- Fix extra processes spawned on rank0 device. #7107
- Graph capture failures now properly raise exceptions for easier debugging. #5644
- Fix Qwen3.5 model by replacing torch_npu.npu_recurrent_gated_delta_rule by fused_recurrent_gated_delta_rule. #7109
- Fix the bug when running Qwen3-Reranker-0.6B with LoRA. #7156
Known Issue
- GLM5 requires transformers==5.2.0, and this will resolved by vllm-project/vllm#30566, will not included in v0.17.0.
- There is a precision issue with Qwen3-Next due to the changed tp weight split method. Will fix it in next release.
- The minimum number of tokens of prefix cache hit in hybrid model is large now. The exact number is related to tp size, e.g., with tp 2, the block_size is adjusted to 2048, which means that any prefix shorter than 2048 will never be cached.
- GLM5 has an issue in the 2-node PD mixed deployment scenario where inference may hang when concurrency exceeds 8 (fixed in PR #7235 #7290).
New Contributors
- @ppppeng made their first contribution in #7109
- @SparrowMu made their first contribution in #7105
- @tfhddd made their first contribution in #7127
- @drizzlezyk made their first contribution in #7208
- @chloroethylene made their first contribution in #6882
- @bazingazhou233-hub made their first contribution in #7286
Full Changelog: v0.16.0rc1...v0.17.0rc1
v0.16.0rc1
This is the first release candidate of v0.16.0 for vLLM Ascend. Please follow the official doc to get started.
Highlights
- Qwen3-Omni quantization adaptation and optimization is now available. #6828
- GLM5-W8A8 quantization is now supported by parameterizing hardcoded MLA dimensions. #6902
Features
- [Experimental] Support FabricMem Mode for ADXL/HIXL interconnect. #6806
- Qwen3-Next now supports FlashComm1. #6830
- NPUWorker Profiler now supports profile_prefix for better profiling experience. #6968
- EPLB profiling now displays expert hotness comparison and time required for eplb adjustment. #6877 #7001]
- Xlite Qwen3 MoE now supports Data Parallel. #6715
- Mooncake Layerwise Connector now supports kv_pool. #7032
- Eagle3 now supports QuaRot quantization without embedding. #7038
Hardware and Operator Support
- 310P now supports w8a8sc quantization method. #7075
- Added AscendC casual_conv1d_fn operator for Qwen3-Next. #6661
- Added Ascend Ops recurrent_gated_delta_rule operator. #6725
- Added GMM custom operator for MoE models. #7010
Performance
- Faster convolution computation improves TTFT by 0.95% and throughput by 0.59% for Qwen3-VL models. #7017
- Optimize split_qkv_rmsnorm_rope operator. #6827
- Implement global CPU slicing and improve IRQ binding for Ascend NPUs, ensuring non-overlapping CPU partitions and better resource management. #6945
- Optimize MTP execution by reordering state update operation. #6844
- Avoid CPU sync in mrope_positions copy by using full tensor copy. #7014
- Remove H2D synchronization for expert_map in MoE models. #7000
Dependencies
- CANN is upgraded to 8.5.1, please remember to upgrade by hand if you're not using the official image. #6897
Deprecation & Breaking Changes
enable_flash_comm_v1config option has been renamed back toenable_sp. #6883- The auto-detect quantization format from model files is reverted, in v0.16.0rc1, we still need to add
---quantization ascendto serve a model quantinized by modelslim. It will be added back in the next version after the bug with the remote model id is fixed. #6873
Documentation
- Added user/developer guide for CPU binding. #7045
- Added metrics usage documentation and example. #6962
- Added llms.txt for LLM discovery. #6886
- Added GLM4.x multi-node deploy tutorial. #6872
- Added explanation of 310p special param: max-model-len. #7065
Others
- Fix openEuler Dockerfile error. #6871
- Many bug fixes including:
- Fix Eagle speculative decoding with Context Parallel enabled. #6981 #7079
- Fix LoRA accuracy issue introduced by upstream vLLM changes. #6958
- Fix streaming content-type in load balance proxy server. #6985
- Fix metadata execute error: integer modulo by zero. #6521
- Fix triton rope_siso implementation bug. #7082
- Fix incorrect layer count for MTP models in update_aclgraph_sizes. #7064
- Fix compilation errors for CANN versions subsequent to b020. #7059
- Fix quant config support in GLM4.6V. #7062
- Fix parameter ordering bug in _merge_multimodal_embeddings. #7068
- Fix fused mc2 bug in EPLB. #6794
- Fix kernel block size for computing slot mapping. #7019
- Fix layerwise stacking MTP error in P/D disaggregation. #7036
- Fix RoPE dimension for npu_rotary_embedding. #6880
- Fix Qwen-Omni quantization bugs. #7042 #7007
- Fix GDN layer accuracy in graph mode. #6822
- Fix precision bugs for PCP/DCP in PD disaggregate. #6876
- Fix MTP in PD disaggregation with fullgraph support for all D-Nodes. #6948
- Fix GQA model error when enabling both DP and DCP. #7012
- Fix MTP prefill misclassified as decode edge case. #6835
- Fix Eagle3 acceptance rate for QuaRot quantized models. #6914
- Fix RoPE shape mismatch for MTP models with FlashComm V1 enabled. #6939
- Fix Qwen2.5VL accuracy issue. #6975
- Fix MoE forward error with static kernel enabled. #6964
- Fix muls_add fusion for GLM5 models. #6928
- Fix GDN layer detection for multimodal models. #6941
- Fix 300I unquant model weight nd2nz error. #6851
- Fix CPU binding logic. #6889
- Fix Eagle fullgraph shape capture. #6846
Known Issue
- Currently, for DeepSeek v3.2, PCP & DCP do not yet work with FlashComm1 feature, which may cause serve errors or other unknown errors.
- In 4-node A3 PD disaggregation deployment with DeepSeek V3.2, the P-Node may hang when benchmarking in high concurrency scenario, e.g., 2K/2K tokens with 512 concurrent requests.
- MTP with large EP configurations may cause graph capture buffer overflow. This is a bug need to fix in vLLM, now there is a workaround to avoid it: explicitly set
--compilation-config '{"max_cudagraph_capture_size": N}'whereN = max_concurrency × (1 + num_speculative_tokens).
New Contributors
- @Eric-dot made their first contribution in #6670
- @tanhaoan333 made their first contribution in #6828
- @NJX-njx made their first contribution in #6965
- @Zhujiyang2 made their first contribution in #6939
- @songjianquan made their first contribution in #6740
- @liuchen2026fly made their first contribution in #6928
- @guleo made their first contribution in #6827
- @wanghengkang made their first contribution in #6977
- @xiaocongtou6 made their first contribution in #6940
- @chenxi-hh made their first contribution in #7010
- @wanghuanjun2113 made their first contribution in #7064
- @banxiaduhuo made their first contribution in #6715
- @xmpp777 made their first contribution in #6933
- @s-zk made their first contribution in #6872
- @ZRJ026 made their first contribution in #6985
Full Changelog: v0.15.0rc1...v0.16.0rc1
v0.15.0rc1
This is the first release candidate of v0.15.0 for vLLM Ascend. Please follow the official doc to get started.
Highlights
- NPU Graph EX (npugraph_ex) Enabled by Default: The npugraph_ex feature is now enabled by default, providing better graph optimization with integrated inductor pass and MatmulAllReduceAddRMSNorm fusion. #6354 #6664 #6006
- 310P MoE and W8A8 Support[Experimental]: 310P now supports MoE models, W8A8 quantization, and weightNZ feature, significantly expanding hardware capabilities. #6530 #6641 #6454 #6705
- Qwen3-VL-MoE EAGLE Support: Added EAGLE speculative decoding support for Qwen3-VL-MoE model. #6327
- Kimi-K2.5 Model Support: Added support for Kimi-K2.5 models. Please note that vLLM 0.15.0 has a known issue with Kimi-K2.5. To fix this, please apply the changes from the upstream
vllm-project/vllmrepository, specifically from pull requests #33320 and #34501. #6755
Features
- Auto-detect Quantization Format: Quantization format can now be auto-detected from model files. #6645
- GPT-OSS Attention Support: Added GPT-OSS attention implementation. #5901
- DCP Support for SFA: Added Decode Context Parallel (DCP) support for SFA architecture. #6563
- Mooncake Layerwise PCP Support: Mooncake layerwise connector now supports PCP function. #6627
- Mooncake Connector Remote PTP Size: Mooncake connector can now get remote PTP size. #5822
- KV Pool Sparse Attention: KV pool now supports sparse attention. #6339
- Batch Invariant with AscendC: Implemented batch invariant feature with AscendC. #6590
- Routing Replay: Added routing replay feature. #6696
- Compressed Tensors MoE W4A8 Dynamic Weight: Added support for compressed tensors moe w4a8 dynamic weight quantization. #5889
- GLM4.7-Flash W8A8 Quantization: Added W8A8 quantization support for GLM4.7-Flash. #6492
- DispatchGmmCombineDecode Enhancement: DispatchGmmCombineDecode now supports bf16/float16 gmm1/gmm2 weight and ND format weight. #6393
- RMSNorm Dynamic Quant Fusion: Added rmsnorm dynamic quant fusion pass. #6274
- Worker Health Check Interface: Added
check_healthinterface for worker. #6681
Hardware and Operator Support
- 310P Support Expansion: Multiple improvements for 310P hardware:
- ARM-only CPU Binding: Enabled ARM-only CPU binding with NUMA-balanced A3 policy. #6686
- Triton Rope Enhancement: Triton rope now supports index_selecting from cos_sin_cache. #5450
- AscendC Fused Op: Added AscendC fused op transpose_kv_cache_by_block to speed up GQA transfer. #6366
- Rotary_dim Parameter: Added support for rotary_dim parameter when using partial rope in rotary_embedding. #6581
Performance
- Multimodal seq_lens CPU Cache: Use
seq_lensCPU cache to avoid frequent D2H copy for better multimodal performance. #6448 - DispatchFFNCombine Optimization: Optimized DispatchFFNCombine kernel performance and resolved vector error caused by unaligned UB access. #6468 #6707
- DeepSeek V3.2 KVCache Optimization: Optimized KV cache usage for DeepSeek V3.2. #6610
- MLA/SFA Weight Prefetch: Refactored MLA/SFA weight prefetch to be consistent with MoE weight prefetch. #6629
- MLP Weight Prefetch: Refactored MLP weight prefetch to be consistent with MoE model's prefetching. #6442
- Adaptive Block Size Selection: Added adaptive block size selection in linear_persistent kernel. #6537
- EPLB Memory Optimization: Reduced memory used for heat aggregation in EPLB. #6729
- Memory Migration and Interrupt Core Binding: Improved binding logic with memory migration and interrupt core binding functions. #6785
- Triton Stability: Improved Triton stability on Ascend for large grids. #6301
Dependencies
- Mooncake: Upgraded to v0.3.8.post1. #6428
Deprecation & Breaking Changes
- ProfileExecuteDuration: Cleaned up and deprecated ProfileExecuteDuration feature. #6461
- Custom rotary_embedding Operator: Removed custom rotary_embedding operator. #6523
- USE_OPTIMIZED_MODEL: Cleaned up unused env
USE_OPTIMIZED_MODEL. #6618
Documentation
- Added AI-assisted model-adaptation workflow documentation for vllm-ascend. #6731
- Added vLLM Ascend development guidelines (AGETNS.md). #6797
- Added GLM5 tutorial documentation. #6709 #6717
- Added Memcache Usage Guide. #6476
- Added request forwarding documentation. #6780
- Added Benchmark Tutorial for Suffix Speculative Decoding. #6323
- Restructured tutorial documentation. #6501
- Added npugraph_ex introduction documentation. #6306
Others
- MTP in PD Fullgraph: Fixed support for ALL D-Nodes in fullgraph when running MTP in PD deployment. #5472
- DeepSeekV3.1 Accuracy: Fixed DeepSeekV3.1 accuracy issue. #6805
- EAGLE Refactor: Routed MTP to EAGLE except for PCP/DCP+MTP cases. #6349
- Speculative Decoding Accuracy: Fixed spec acceptance rate problem in vLLM 0.15.0. #6606
- PCP/DCP Accuracy: Fixed accuracy issue in PCP/DCP with speculative decoding. #6491
- Dynamic EPLB: Fixed ineffective dynamic EPLB bug and EPLB no longer depends on a specified model. #6653 #6528
- KV Pool Mooncake Backend: Correctly initialized head_or_tp_rank for mooncake backend. #6498
- Layerwise Connector Recompute Scheduler: Layerwise connector now supports recompute scheduler. #5900
- Memcache Pool: Fixed service startup failure when memcache pool is enabled. #6229
- AddRMSNormQuant: Fixed AddRMSNormQuant not taking effect. #6620
- Pooling Code: Fixed pooling code issues and updated usage guide. #6126
- Context Parallel: Fixed and unified the PD request discrimination logic. #5939
- npugraph_ex: Fixed duplicate pattern issue and added extra check for allreduce rmsnorm fus...
v0.13.0
This is the final release of v0.13.0 for vLLM Ascend. Please follow the official doc to get started.
Highlights
Model Support
- DeepSeek-R1 & DeepSeek-V3.2: [Experimental]Performance optimizations, and async scheduling enhancements. #3631 #3900 #3908 #4191 #4805
- Qwen3-Next: [Experimental]Full support for Qwen3-Next series including 80B-A3B-Instruct with full graph mode, MTP, quantization (W8A8), NZ optimization, and chunked prefill. Fixed multiple accuracy and stability issues. #3450 #3572 #3428 #3918 #4058 #4245 #4070 #4477 #4770
- InternVL: Added support for InternVL models with comprehensive e2e tests and accuracy evaluation. #3796 #3964
- LongCat-Flash: [Experimental]Added support for LongCat-Flash model. #3833
- minimax_m2: [Experimental]Added support for minimax_m2 model. #5624
- Whisper & Cross-Attention: [Experimental]Added support for cross-attention and Whisper models. #5592
- Pooling Models: [Experimental]Added support for pooling models with PCP adaptation and fixed multiple pooling-related bugs. #3122 #4143 #6056 #6057 #6146
- PanguUltraMoE: [Experimental]Added support for PanguUltraMoE model. #4615
Core Features
- Context Parallel (PCP/DCP): [Experimental] Added comprehensive support for Prefill Context Parallel (PCP) and Decode Context Parallel (DCP) with ACLGraph, MTP, chunked prefill, MLAPO, and Mooncake connector integration. This is an experimental feature - feedback welcome. #3260 #3731 #3801 #3980 #4066 #4098 #4183 #5672
- Full Graph Mode (ACLGraph): [Experimental]Enhanced full graph mode with GQA support, memory optimizations, unified logic between ACLGraph and Torchair, and improved stability. #3560 #3970 #3812 #3879 #3888 #3894 #5118
- Multi-Token Prediction (MTP): Significantly improved MTP support with chunked prefill for DeepSeek, quantization support, full graph mode, PCP/DCP integration, and async scheduling. MTP now works in most cases and is recommended for use. #2711 #2713 #3620 #3845 #3910 #3915 #4102 #4111 #4770 #5477
- Eagle Speculative Decoding: Eagle spec decode now works with full graph mode and is more stable. #5118 #4893 #5804
- PD Disaggregation: Set ADXL engine as default backend for disaggregated prefill with improved performance and stability. Added support for KV NZ feature for DeepSeek decode node. #3761 #3950 #5008 #3072
- KV Pool & Mooncake: Enhanced KV pool with Mooncake connector support for PCP/DCP, multiple input suffixes, and improved performance of Layerwise Connector. #3690 #3752 #3849 #4183 #5303
- EPLB (Elastic Prefill Load Balancing): [Experimental]EPLB is now more stable with many bug fixes. Mix placement now works. #6086
- Full Decode Only Mode: Added support for Qwen3-Next and DeepSeekv32 in full_decode_only mode with bug fixes. #3949 #3986 #3763
- Model Runner V2: [Experimental]Added basic support for Model Runner V2, the next generation of vLLM. It will be used by default in future releases. #5210
Features
- W8A16 Quantization: [Experimental]Added new W8A16 quantization method support. #4541
- UCM Connector: [Experimental]Added UCMConnector for KV Cache Offloading. #4411
- Batch Invariant: [Experimental]Implemented basic framework for batch invariant feature. #5517
- Sampling: Enhanced sampling with async_scheduler and disable_padded_drafter_batch support in Eagle. #4893
Hardware and Operator Support
- Custom Operators: Added multiple custom operators including:
- Operator Fusion: Added AddRmsnormQuant fusion pattern with SP support and inductor fusion for quantization. #5077 #4168
- MLA/SFA: Refactored SFA into MLA architecture for better maintainability. #3769
- FIA Operator: Adapted to npu_fused_infer_attention_score with flash decoding function. To optimize performance in small batch size scenarios, this attention operator is now available. Please refer to item 22 in FAQs to enable it. #4025
- CANN 8.5 Support: Removed CP redundant variables after FIA operator enables for CANN 8.5. #6039
Performance
Many custom ops and triton kernels were added in this release to speed up model performance:
- DeepSeek Performance: [Experimental]Improved performance for DeepSeek V3.2 by eliminating HD synchronization in async scheduling and optimizing memory usage for MTP. #4805 #2713
- Qwen3-Next Performance: [Experimental]Improved performance with Triton ops and optimizations. #5664 #5984 #5765
- FlashComm: Enhanced FlashComm v2 optimization with o_shared linear and communication domain fixes. #3232 #4188 [#4458](https://github.com/vllm-projec...
v0.14.0rc1
This is the first release candidate of v0.14.0 for vLLM Ascend. Please follow the official doc to get started. This release includes all the changes in v0.13.0rc2. So We just list the differences from v0.13.0rc2. If you are upgrading from v0.13.0rc1, please read both v0.14.0rc1 and v0.13.0rc2 release notes.
Highlights
- 310P support is back now. In this release, only basic dense and vl models are supported with eager mode. We'll keep improving and maintaining the support for 310P. #5776
- Support compressed tensors moe w8a8-int8 quantization. #5718
- Support Medusa speculative decoding. #5668
- Support Eagle3 speculative decoding for Qwen3vl. #4848
Features
- Xlite Backend supports Qwen3 MoE now. #5951
- Support DSA-CP for PD-mix deployment case. #5702
- Add support of new W4A4_LAOS_DYNAMIC quantization method. #5143
Performance
- The performance of Qwen3-next has been improved. #5664 #5984 #5765
- The CPU bind logic and performance has been improved. #5555
- Merge Q/K split to simplify AscendApplyRotaryEmb for better performance. #5799
- Add Matmul Allreduce Rmsnorm fusion Pass. It's disabled by default. Set
fuse_allreduce_rms=Truein--additional_configto enable it. #5034 - Optimize rope embedding with triton kernel for huge performance gain. #5918
- support advanced apply_top_k_top_p without top_k constraint. #6098
- Parallelize Q/K/V padding in AscendMMEncoderAttention for better performance. #6204
Others
- model runner v2 support triton of penalty. #5854
- model runner v2 support eagle spec decoding. #5840
- Fix multi-modal inference OOM issues by setting
expandable_segments:Trueby default. #5855 VLLM_ASCEND_ENABLE_MLAPOis set toTrueby default. It's enabled automatically on decode node in PD deployment case. Please note that this feature will cost more memory. If you are memory sensitive, please set it to False. #5952- SSL config can be set to kv_extra_config for PD deployment with mooncake layerwise connector. #5875
- support
--max_model_len=auto. #6193
Dependencies
- torch-npu is upgraded to 2.9.0 #6112
Deprecation & Breaking Changes
- EPLB config options is moved to
eplb_configin additional config. The old ones are removed in this release. - The profiler envs, such as
VLLM_TORCH_PROFILER_DIRandVLLM_TORCH_PROFILER_WITH_PROFILE_MEMORYdo not work with vLLM Ascend now. Please use vLLM--profiler-configparameters instead. #5928
Known Issues
- If you hit the pickle error from
EngineCoreprocess sometimes, please cherry-pick the PR into your local vLLM code. This known issue will be fixed in vLLM in the next release.
New Contributors
- @zhanzy178 made their first contribution in #4587
- @jiazhengyi made their first contribution in #5251
- @Fager10086 made their first contribution in #5458
- @ZCG12345 made their first contribution in #5271
- @hu-qi made their first contribution in #5257
- @chuyuelin made their first contribution in #3833
- @L4-1024 made their first contribution in #2920
- @zhangmuzhibangde made their first contribution in #5415
- @frankie-ys made their first contribution in #5045
- @Debonex made their first contribution in #5516
- @starmountain1997 made their first contribution in #5371
- @wangyibo1005 made their first contribution in #5552
- @pacoxu made their first contribution in #5646
- @zyz111222 made their first contribution in #5556
- @wwwumr made their first contribution in #5711
- @icerain-alt made their first contribution in #4939
- @Feng-xiaosuo made their first contribution in #5624
- @gh924 made their first contribution in #5592
- @Rozwel-dx made their first contribution in #5555
- @taoyao1221 made their first contribution in #4467
- @Tflowers-0129 made their first contribution in #5776
- @aipaes made their first contribution in #5992
- @guanguan0308 made their first contribution in #5866
- @maxmgrdv made their first contribution in #5143
- @simplzyu made their first contribution in #5668
- @Mitchell-xiyunfeng made their first contribution in #6216
- @huangfeifei1995 made their first contribution in #6107
Full Changelog: v0.13.0rc1...v0.14.0rc1
v0.13.0rc2
This is the second release candidate of v0.13.0 for vLLM Ascend. In this rc release, we fixed lots of bugs and improved the performance of many models. Please follow the official doc to get started. Any feedback is welcome to help us to improve the final version of v0.13.0.
Highlights
We mainly focus on quality and performance improvement in this release. The spec decode, graph mode, context parallel and EPLB have been improved significantly. A lot of bugs have been fixed and the performance has been improved for DeepSeek3.1/3.2, Qwen3 Dense/MOE models.
Features
- implement basic framework for batch invariant #5517
- Eagle spec decode feature now works with full graph mode. #5118
- Context Parallel(PCP&DCP) feature is more stable now. And it works for most case. Please try it out.
- MTP and eagle spec decode feature now works in most cases. And it's suggested to use them in most cases.
- EPLB feature more stable now. Many bugs have been fixed. Mix placement works now #6086
- Support kv nz feature for DeepSeek decode node in disagg-prefill scenario #3072
Model Support
- LongCat-Flash is supported now.#3833
- minimax_m2 is supported now. #5624
- Support for cross-attention and whisper models #5592
Performance
- Many custom ops and triton kernels are added in this release to speed up the performance of models. Such as
RejectSampler,MoeInitRoutingCustom,DispatchFFNCombineand so on. - Improved the performance of Layerwise Connector #5303
Others
- Basic support Model Runner v2. Model Runner V2 is the next generation of vLLM. It will be used by default in the future release. #5210
- Fixed a bug that the zmq send/receive may failed #5503
- Supported to use full-graph with Qwen3-Next-MTP #5477
- Fix weight transpose in RL scenarios #5567
- Adapted SP to eagle3 #5562
- Context Parallel(PCP&DCP) support mlapo #5672
- GLM4.6 support mtp with fullgraph #5460
- Flashcomm2 now works with oshard generalized feature #4723
- Support setting tp=1 for the Eagle draft model #5804
- Flashcomm1 feature now works with qwen3-vl #5848
- Support fine-grained shared expert overlap #5962
Dependencies
- CANN is upgraded to 8.5.0
- torch-npu is upgraded to 2.8.0.post1. Please note that the post version will not be installed by default. Please install it by hand from pypi mirror.
- triton-ascend is upgraded to 3.2.0
Deprecation & Breaking Changes
CPUOffloadingConnectoris deprecated. We'll remove it in the next release. It'll be replaced by CPUOffload feature from vLLM in the future.- eplb config options is moved to
eplb_configin additional config. The old ones will be removed in the next release. ProfileExecuteDurationfeature is deprecated. It's replaced byObservabilityConfigfrom vLLM.- The value of
VLLM_ASCEND_ENABLE_MLAPOenv will be set to True by default in the next release. It'll be enabled in decode node by default. Please note that this feature will cost more memory. If you are memory sensitive, please set it to False.
Known Issue
- We notice that the docker image for this release doesn't work by default. It's because that torch-npu 2.8.0.post1 is installed in docker image, but vllm-ascend is compiled with torch-npu 2.8.0. You can either rebuild vllm-ascend with 2.8.0.post1 inner container, or downgrade torch-npu to 2.8.0
New Contributors
- @zhanzy178 made their first contribution in #4587
- @jiazhengyi made their first contribution in #5251
- @Fager10086 made their first contribution in #5458
- @hu-qi made their first contribution in #5257
- @chuyuelin made their first contribution in #3833
- @L4-1024 made their first contribution in #2920
- @zhangmuzhibangde made their first contribution in #5415
- @frankie-ys made their first contribution in #5045
- @Debonex made their first contribution in #5516
- @wangyibo1005 made their first contribution in #5552
- @pacoxu made their first contribution in #5646
- @zyz111222 made their first contribution in #5556
- @wwwumr made their first contribution in #5711
- @icerain-alt made their first contribution in #4939
- @Feng-xiaosuo made their first contribution in #5624
- @gh924 made their first contribution in #5592
- @brandneway made their first contribution in #5848
- @ichaoren made their first contribution in #5827
Full Changelog: v0.13.0rc1...v0.13.0rc2