Releases: vllm-project/vllm-ascend
v0.13.0
This is the final release of v0.13.0 for vLLM Ascend. Please follow the official doc or 中文文档 to get started.
Highlights
Model Support
- DeepSeek-R1 & DeepSeek-V3.2: Improved DeepSeek-V3.2 with MTP support, performance optimizations, and async scheduling enhancements. #3631 #3900 #3908 #4191 #4805
- Qwen3-Next: Full support for Qwen3-Next series including 80B-A3B-Instruct with full graph mode, MTP, quantization (W8A8), NZ optimization, and chunked prefill. Fixed multiple accuracy and stability issues. #3450 #3572 #3428 #3918 #4058 #4245 #4070 #4477 #4770
- InternVL: Added support for InternVL models with comprehensive e2e tests and accuracy evaluation. #3796 #3964
- LongCat-Flash: Added support for LongCat-Flash model. #3833
- minimax_m2: Added support for minimax_m2 model. #5624
- Whisper & Cross-Attention: Added support for cross-attention and Whisper models. #5592
- Pooling Models: Added support for pooling models with PCP adaptation and fixed multiple pooling-related bugs. #3122 #4143 #6056 #6057 #6146
- PanguUltraMoE: Added support for PanguUltraMoE model. #4615
Core Features
- Context Parallel (PCP/DCP): [Experimental] Added comprehensive support for Prefill Context Parallel (PCP) and Decode Context Parallel (DCP) with ACLGraph, MTP, chunked prefill, MLAPO, and Mooncake connector integration. This is an experimental feature - feedback welcome. #3260 #3731 #3801 #3980 #4066 #4098 #4183 #5672
- Full Graph Mode (ACLGraph): Enhanced full graph mode with GQA support, memory optimizations, unified logic between ACLGraph and Torchair, and improved stability. #3560 #3970 #3812 #3879 #3888 #3894 #5118
- Multi-Token Prediction (MTP): Significantly improved MTP support with chunked prefill for DeepSeek, quantization support, full graph mode, PCP/DCP integration, and async scheduling. MTP now works in most cases and is recommended for use. #2711 #2713 #3620 #3845 #3910 #3915 #4102 #4111 #4770 #5477
- Eagle Speculative Decoding: Eagle spec decode now works with full graph mode and is more stable. #5118 #4893 #5804
- PD Disaggregation: Set ADXL engine as default backend for disaggregated prefill with improved performance and stability. Added support for KV NZ feature for DeepSeek decode node. #3761 #3950 #5008 #3072
- KV Pool & Mooncake: Enhanced KV pool with Mooncake connector support for PCP/DCP, multiple input suffixes, and improved performance of Layerwise Connector. #3690 #3752 #3849 #4183 #5303
- EPLB (Elastic Prefill Load Balancing): EPLB is now more stable with many bug fixes. Mix placement now works. #6086
- Full Decode Only Mode: Added support for Qwen3-Next and DeepSeekv32 in full_decode_only mode with bug fixes. #3949 #3986 #3763
- Model Runner V2: Added basic support for Model Runner V2, the next generation of vLLM. It will be used by default in future releases. #5210
Features
- W8A16 Quantization: Added new W8A16 quantization method support. #4541
- UCM Connector: Added UCMConnector for KV Cache Offloading. #4411
- Batch Invariant: Implemented basic framework for batch invariant feature. #5517
- Sampling: Enhanced sampling with async_scheduler and disable_padded_drafter_batch support in Eagle. #4893
Hardware and Operator Support
- Custom Operators: Added multiple custom operators including:
- Operator Fusion: Added AddRmsnormQuant fusion pattern with SP support and inductor fusion for quantization. #5077 #4168
- MLA/SFA: Refactored SFA into MLA architecture for better maintainability. #3769
- FIA Operator: Adapted to npu_fused_infer_attention_score with flash decoding function. To optimize performance in small batch size scenarios, this attention operator is now available. Please refer to item 22 in FAQs to enable it. #4025
- CANN 8.5 Support: Removed CP redundant variables after FIA operator enables for CANN 8.5. #6039
Performance
Many custom ops and triton kernels were added in this release to speed up model performance:
- DeepSeek Performance: Improved performance for DeepSeek V3.2 by eliminating HD synchronization in async scheduling and optimizing memory usage for MTP. #4805 #2713
- Qwen3-Next Performance: Improved performance with Triton ops and optimizations. #5664 #5984 #5765
- FlashComm: Enhanced FlashComm v2 optimization with o_shared linear and communication domain fixes. #3232 #4188 #4458 #5848
- MoE Optimization...
v0.14.0rc1
This is the first release candidate of v0.14.0 for vLLM Ascend. Please follow the official doc to get started. This release includes all the changes in v0.13.0rc2. So We just list the differences from v0.13.0rc2. If you are upgrading from v0.13.0rc1, please read both v0.14.0rc1 and v0.13.0rc2 release notes.
Highlights
- 310P support is back now. In this release, only basic dense and vl models are supported with eager mode. We'll keep improving and maintaining the support for 310P. #5776
- Support compressed tensors moe w8a8-int8 quantization. #5718
- Support Medusa speculative decoding. #5668
- Support Eagle3 speculative decoding for Qwen3vl. #4848
Features
- Xlite Backend supports Qwen3 MoE now. #5951
- Support DSA-CP for PD-mix deployment case. #5702
- Add support of new W4A4_LAOS_DYNAMIC quantization method. #5143
Performance
- The performance of Qwen3-next has been improved. #5664 #5984 #5765
- The CPU bind logic and performance has been improved. #5555
- Merge Q/K split to simplify AscendApplyRotaryEmb for better performance. #5799
- Add Matmul Allreduce Rmsnorm fusion Pass. It's disabled by default. Set
fuse_allreduce_rms=Truein--additional_configto enable it. #5034 - Optimize rope embedding with triton kernel for huge performance gain. #5918
- support advanced apply_top_k_top_p without top_k constraint. #6098
- Parallelize Q/K/V padding in AscendMMEncoderAttention for better performance. #6204
Others
- model runner v2 support triton of penalty. #5854
- model runner v2 support eagle spec decoding. #5840
- Fix multi-modal inference OOM issues by setting
expandable_segments:Trueby default. #5855 VLLM_ASCEND_ENABLE_MLAPOis set toTrueby default. It's enabled automatically on decode node in PD deployment case. Please note that this feature will cost more memory. If you are memory sensitive, please set it to False. #5952- SSL config can be set to kv_extra_config for PD deployment with mooncake layerwise connector. #5875
- support
--max_model_len=auto. #6193
Dependencies
- torch-npu is upgraded to 2.9.0 #6112
Deprecation & Breaking Changes
- EPLB config options is moved to
eplb_configin additional config. The old ones are removed in this release. - The profiler envs, such as
VLLM_TORCH_PROFILER_DIRandVLLM_TORCH_PROFILER_WITH_PROFILE_MEMORYdo not work with vLLM Ascend now. Please use vLLM--profiler-configparameters instead. #5928
Known Issues
- If you hit the pickle error from
EngineCoreprocess sometimes, please cherry-pick the PR into your local vLLM code. This known issue will be fixed in vLLM in the next release.
New Contributors
- @zhanzy178 made their first contribution in #4587
- @jiazhengyi made their first contribution in #5251
- @Fager10086 made their first contribution in #5458
- @ZCG12345 made their first contribution in #5271
- @hu-qi made their first contribution in #5257
- @chuyuelin made their first contribution in #3833
- @L4-1024 made their first contribution in #2920
- @zhangmuzhibangde made their first contribution in #5415
- @frankie-ys made their first contribution in #5045
- @Debonex made their first contribution in #5516
- @starmountain1997 made their first contribution in #5371
- @wangyibo1005 made their first contribution in #5552
- @pacoxu made their first contribution in #5646
- @zyz111222 made their first contribution in #5556
- @wwwumr made their first contribution in #5711
- @icerain-alt made their first contribution in #4939
- @Feng-xiaosuo made their first contribution in #5624
- @gh924 made their first contribution in #5592
- @Rozwel-dx made their first contribution in #5555
- @taoyao1221 made their first contribution in #4467
- @Tflowers-0129 made their first contribution in #5776
- @aipaes made their first contribution in #5992
- @guanguan0308 made their first contribution in #5866
- @maxmgrdv made their first contribution in #5143
- @simplzyu made their first contribution in #5668
- @Mitchell-xiyunfeng made their first contribution in #6216
- @huangfeifei1995 made their first contribution in #6107
Full Changelog: v0.13.0rc1...v0.14.0rc1
v0.13.0rc2
This is the second release candidate of v0.13.0 for vLLM Ascend. In this rc release, we fixed lots of bugs and improved the performance of many models. Please follow the official doc to get started. Any feedback is welcome to help us to improve the final version of v0.13.0.
Highlights
We mainly focus on quality and performance improvement in this release. The spec decode, graph mode, context parallel and EPLB have been improved significantly. A lot of bugs have been fixed and the performance has been improved for DeepSeek3.1/3.2, Qwen3 Dense/MOE models.
Features
- implement basic framework for batch invariant #5517
- Eagle spec decode feature now works with full graph mode. #5118
- Context Parallel(PCP&DCP) feature is more stable now. And it works for most case. Please try it out.
- MTP and eagle spec decode feature now works in most cases. And it's suggested to use them in most cases.
- EPLB feature more stable now. Many bugs have been fixed. Mix placement works now #6086
- Support kv nz feature for DeepSeek decode node in disagg-prefill scenario #3072
Model Support
- LongCat-Flash is supported now.#3833
- minimax_m2 is supported now. #5624
- Support for cross-attention and whisper models #5592
Performance
- Many custom ops and triton kernels are added in this release to speed up the performance of models. Such as
RejectSampler,MoeInitRoutingCustom,DispatchFFNCombineand so on. - Improved the performance of Layerwise Connector #5303
Others
- Basic support Model Runner v2. Model Runner V2 is the next generation of vLLM. It will be used by default in the future release. #5210
- Fixed a bug that the zmq send/receive may failed #5503
- Supported to use full-graph with Qwen3-Next-MTP #5477
- Fix weight transpose in RL scenarios #5567
- Adapted SP to eagle3 #5562
- Context Parallel(PCP&DCP) support mlapo #5672
- GLM4.6 support mtp with fullgraph #5460
- Flashcomm2 now works with oshard generalized feature #4723
- Support setting tp=1 for the Eagle draft model #5804
- Flashcomm1 feature now works with qwen3-vl #5848
- Support fine-grained shared expert overlap #5962
Dependencies
- CANN is upgraded to 8.5.0
- torch-npu is upgraded to 2.8.0.post1. Please note that the post version will not be installed by default. Please install it by hand from pypi mirror.
- triton-ascend is upgraded to 3.2.0
Deprecation & Breaking Changes
CPUOffloadingConnectoris deprecated. We'll remove it in the next release. It'll be replaced by CPUOffload feature from vLLM in the future.- eplb config options is moved to
eplb_configin additional config. The old ones will be removed in the next release. ProfileExecuteDurationfeature is deprecated. It's replaced byObservabilityConfigfrom vLLM.- The value of
VLLM_ASCEND_ENABLE_MLAPOenv will be set to True by default in the next release. It'll be enabled in decode node by default. Please note that this feature will cost more memory. If you are memory sensitive, please set it to False.
Known Issue
- We notice that the docker image for this release doesn't work by default. It's because that torch-npu 2.8.0.post1 is installed in docker image, but vllm-ascend is compiled with torch-npu 2.8.0. You can either rebuild vllm-ascend with 2.8.0.post1 inner container, or downgrade torch-npu to 2.8.0
New Contributors
- @zhanzy178 made their first contribution in #4587
- @jiazhengyi made their first contribution in #5251
- @Fager10086 made their first contribution in #5458
- @hu-qi made their first contribution in #5257
- @chuyuelin made their first contribution in #3833
- @L4-1024 made their first contribution in #2920
- @zhangmuzhibangde made their first contribution in #5415
- @frankie-ys made their first contribution in #5045
- @Debonex made their first contribution in #5516
- @wangyibo1005 made their first contribution in #5552
- @pacoxu made their first contribution in #5646
- @zyz111222 made their first contribution in #5556
- @wwwumr made their first contribution in #5711
- @icerain-alt made their first contribution in #4939
- @Feng-xiaosuo made their first contribution in #5624
- @gh924 made their first contribution in #5592
- @brandneway made their first contribution in #5848
- @ichaoren made their first contribution in #5827
Full Changelog: v0.13.0rc1...v0.13.0rc2
v0.13.0rc1
This is the first release candidate of v0.13.0 for vLLM Ascend. We landed lots of bug fix, performance improvement and feature support in this release. Any feedback is welcome to help us to improve vLLM Ascend. Please follow the official doc to get started.
Highlights
- Improved the performance of DeepSeek V3.2, please refer to tutorials
- Qwen3-Next MTP with chunked prefill is supported now #4770, please refer to tutorials
- [Experimental] Prefill Context Parallel and Decode Context Parallel are supported, but notice that it is an experimental feature now, welcome any feedback. please refer to context parallel feature guide
Features
- Support openPangu Ultra MoE 4615
- A new quantization method W8A16 is supported now. #4541
- Cross-machine Disaggregated Prefill is supported now. #5008
- Add UCMConnector for KV Cache Offloading. #4411
- Support async_scheduler and disable_padded_drafter_batch in eagle. #4893
- Support pcp + mtp in full graph mode. #4572
- Enhance all-reduce skipping logic for MoE models in NPUModelRunner #5329
Performance
Some general performance improvement:
- Add l2norm triton kernel #4595
- Add new pattern for AddRmsnormQuant with SP, which could only take effect in graph mode. #5077
- Add async exponential while model executing. #4501
- Remove the transpose step after attention and switch to transpose_batchmatmul #5390
- To optimize the performance in small batch size scenario, an attention operator with flash decoding function is offered, please refer to item 22 in FAQs to enable it.
Other
- OOM error on VL models is fixed now. We're keeping observing it, if you hit OOM problem again, please submit an issue. #5136
- Fixed an accuracy bug of Qwen3-Next-MTP when batched inferring. #4932
- Fix npu-cpu offloading interface change bug. #5290
- Fix MHA model runtime error in aclgraph mode #5397
- Fix unsuitable moe_comm_type under ep=1 scenario #5388
Deprecation & Breaking Changes
VLLM_ASCEND_ENABLE_DENSE_OPTIMIZEis removed andVLLM_ASCEND_ENABLE_PREFETCH_MLPis recommend to replace as they always be enabled together. #5272VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EPis dropped now. #5270VLLM_ASCEND_ENABLE_NZis disabled for float weight case, since we notice that the performance is not good in some float case. Feel free to set it to 2 if you make sure it works for your case. #4878chunked_prefill_for_mlainadditional_configis dropped now. #5296dump_configinadditional_configis renamed todump_config_pathand the type is change fromdicttostring. #5296
Dependencies
- vLLM version has been upgraded to 0.13.0 and drop 0.12.0 support. #5146
- Transformer version has been upgraded >= 4.57.3 #5250
Known Issues
- Qwen3-Next doesn't support long sequence scenario, and we should limit
gpu-memory-utilizationaccording to the doc to run Qwen3-Next. We'll improve it in the next release - The functional break on Qwen3-Next when the input/output is around 3.5k/1.5k is fixed, but it introduces a regression on performance. We'll fix it in next release. #5357
- There is a precision issue with curl on ultra-short sequences in DeepSeek-V3.2. We'll fix it in next release. #5370
New Contributors
- @ming1212 made their first contribution in #4607
- @knight0528 made their first contribution in #4902
- @UnifiedCacheManager made their first contribution in #4411
- @Toneymiller made their first contribution in #5053
- @JeffLee1874 made their first contribution in #4615
- @ader47 made their first contribution in #4953
- @YzTongNiar made their first contribution in #5115
- @TingW09 made their first contribution in #5086
- @ZT-AIA made their first contribution in #4788
- @yuxinshan made their first contribution in #5063
- @LICO1314 made their first contribution in #5142
- @yuxingcyx made their first contribution in #4830
- @hukongyi made their first contribution in #4141
- @luluxiu520 made their first contribution in #5167
- @YuhanBai made their first contribution in #4501
- @pisceskkk made their first contribution in #5183
- @OsirisDuan made their first contribution in #4304
- @LJQ142857 made their first contribution in #5228
- @lengrongfu made their first contribution in #5258
- @hzxuzhonghu made their first contribution in #4674
- @TmacAaron made their first contribution in #4541
- @chenaoxuan made their first contribution in #4443
- @changdawei1 made their first contribution in #5305
- @wjunLu made their first contribution in #5287
- @cookieyyds made their first contribution in #5373
- @maoxx241 made their first contribution in #5322
- @jiangkuaixue123 made their first contribution in #5435
Full Changelog: v0.12.0rc1...v0.13.0rc1
v0.11.0
We're excited to announce the release of v0.11.0 for vLLM Ascend. This is the official release for v0.11.0. Please follow the official doc to get started. We'll consider to release post version in the future if needed. This release note will only contain the important change and note from v0.11.0rc3.
Highlights
- Improved the performance for deepseek 3/3.1. #3995
- Fixed the accuracy bug for qwen3-vl. #4811
- Improved the performance of sample. #4153
- Eagle3 is back now. #4721
Other
- Improved the performance for kimi-k2. #4555
- Fixed a quantization bug for deepseek3.2-exp. #4797
- Fixed qwen3-vl-moe bug under high concurrency. #4658
- Fixed an accuracy bug for Prefill Decode disaggregation case. #4437
- Fixed some bugs for EPLB #4576 #4777
- Fixed the version incompatibility issue for openEuler docker image. #4745
Deprecation announcement
- LLMdatadist connector has been deprecated, it'll be removed in v0.12.0rc1
- Torchair graph has been deprecated, it'll be removed in v0.12.0rc1
- Ascend scheduler has been deprecated, it'll be removed in v0.12.0rc1
Upgrade notice
- torch-npu is upgraded to 2.7.1.post1. Please note that the package is pushed to pypi mirror. So it's hard to add it to auto dependence. Please install it by yourself.
- CANN is upgraded to 8.3.rc2.
Known Issues
- Qwen3-Next doesn't support expert parallel and MTP features in this release. And it'll be oom if the input is too long. We'll improve it in the next release
- Deepseek 3.2 only work with torchair graph mode in this release. We'll make it work with aclgraph mode in the next release.
- Qwen2-audio doesn't work by default. Temporary solution is to set
--gpu-memory-utilizationto a suitable value, such as 0.8. - CPU bind feature doesn't work if more than one vLLM instance is running on the same node.
New Contributors
- @sunchendd made their first contribution in #4721
Full Changelog: v0.11.0rc3...v0.11.0
v0.12.0rc1
This is the first release candidate of v0.12.0 for vLLM Ascend. We landed lots of bug fix, performance improvement and feature support in this release. Any feedback is welcome to help us to improve vLLM Ascend. Please follow the official doc to get started.
Highlights
- DeepSeek 3.2 is stable and performance is improved. In this release, you don't need to install any other packages now. Following the official tutorial to start using it.
- Async scheduler is more stable and ready to enable now. Please set
--async-schedulingto enable it. - More new models, such as Qwen3-omni, DeepSeek OCR, PaddleOCR, OpenCUA are supported now.
Core
- [Experimental] Full decode only graph mode is supported now. Although it is not enabled by default, we suggest to enable it by
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'in most case. Let us know if you hit any error. We'll keep improve it and enable it by default in next few release. - Lots of triton kernel are added. The performance of vLLM Ascend, especially Qwen3-Next and DeepSeek 3.2 is improved. Please note that triton is not installed and enabled by default, but we suggest to enable it in most case. You can download and install it by hand from package url. If you're running vLLM Ascend with X86, you need to build triton ascend by yourself from source
- Lots of Ascend ops are added to improve the performance. It means that from this release vLLM Ascend only works with custom ops built. So we removed the env
COMPILE_CUSTOM_KERNELS. You can not set it to 0 now. - speculative decode method
MTPis more stable now. It can be enabled with most case and decode token number can be 1,2,3. - speculative decode method
suffixis supported now. Thanks for the contribution from China Merchants Bank. - llm-comppressor quantization tool with W8A8 works now. You can now deploy the model with W8A8 quantization from this tool directly.
- W4A4 quantization works now.
- Support features flashcomm1 and flashcomm2 in paper flashcomm #3004 #3334
- Pooling model, such as bge, reranker, etc. are supported now
- Official doc has been improved. we refactored the tutorial to make it more clear. The user guide and developer guide is more complete now. We'll keep improving it.
Other
- [Experimental] Mooncake layerwise connector is supported now.
- [Experimental] KV cache pool feature is added
- [Experimental] A new graph mode
xliteis introduced. It performs good with some models. Following the official tutorial to start using it. - LLMdatadist kv connector is removed. Please use mooncake connector instead.
- Ascend scheduler is removed.
--additional-config {"ascend_scheudler": {"enabled": true}doesn't work anymore. - Torchair graph mode is removed.
--additional-config {"torchair_graph_config": {"enabled": true}}doesn't work anymore. Please use aclgraph instead. VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATIONenv is removed. This feature is stable enough. We enable it by default now.- speculative decode method
Ngramis back now. - msprobe tool is added to help user to check the model accuracy. Please follow the official doc to get started.
- msserviceprofiler tool is added to help user to profile the model performance. Please follow the official doc to get started.
Upgrade Note
- vLLM Ascend self maintained modeling file has been removed. The related python entrypoint is removed as well. So please uninstall the old version of vLLM Ascend in your env before upgrade.
- CANN is upgraded to 8.3.RC2, Pytorch and torch-npu are upgraded to 2.8.0. Don't forget to install them.
- Python 3.9 support is dropped to keep the same with vLLM v0.12.0
Known Issues
- DeepSeek 3/3.1 and Qwen3 doesn't work with FULL_DECODE_ONLY graph mode. We'll fix it in next release. #4990
- Hunyuan OCR doesn't work. We'll fix it in the next release. #4989 #4992
- DeepSeek 3.2 doesn't work with chat template. It because that vLLM v0.12.0 doesn't support it. We'll support in the next v0.13.0rc1 version.
- DeepSeek 3.2 doesn't work with high concurrency in some case. We'll fix it in next release. #4996
- We notice that bf16/fp16 model doesn't perform well, it's mainly because that
VLLM_ASCEND_ENABLE_NZis enabled by default. Please setVLLM_ASCEND_ENABLE_NZ=0to disable it. We'll add the auto detection mechanism in next release. - speculative decode method
suffixdoesn't work. We'll fix it in next release. You can pick this commit to fix the issue: #5010
New Contributors
- @huangdong2022 made their first contribution in #3205
- @kiscad made their first contribution in #3226
- @jiangyunfan1 made their first contribution in #3370
- @dsxsteven made their first contribution in #3381
- @elilzhu made their first contribution in #3426
- @yuzhup made their first contribution in #3203
- @DreamerLeader made their first contribution in #3476
- @yechao237 made their first contribution in #3473
- @leijie-ww made their first contribution in #3519
- @Anionex made their first contribution in #3311
- @drslark made their first contribution in #3549
- @KyrieDrewWang made their first contribution in #3490
- @HF-001 made their first contribution in #3433
- @yzy1996 made their first contribution in #3615
- @destinysky made their first contribution in #2888
- @HuaJiaHeng made their first contribution in #3676
- @lio1226 made their first contribution in #3434
- @yenuo26 made their first contribution in #3707
- @gcanlin made their first contribution in #3729
- @QilaiZhang made their first contribution in #3572
- @ck-hw-1018 made their first contribution in #3757
- @Meihan-chen made their first contribution in #3861
- @Liwx1014 made their first contribution in #3870
- @ForBetterCodeNine made their first contribution in #3937
- @Pz1116 made their first contribution in #3752
- @Liziqi-77 made their first contribution in #3690
- @herizhen made their first contribution in #4089
- @Apocalypse990923-qshi made their first contribution in #3801
- @thonean made their first contribution in #3756
- @845473182 made their first contribution in #4144
- @wangxiaochao6 made their first contribution in #4183
- @Delphine-Nic made their first contribution in #4209
- @InSec made their first contribution in #4245
- @Tjh-UKN made their first contribution in #4241
- @zjchenn made their first contribution in #4354
- @LHXuuu made their first contribution in #4036
- @ChenxiQ made their first contribution in #3804
- @swy20190 made their first contribution in #4550
- @fluctlux made their first contribution in #4045
- @coder-fny made their first contribution in #4529
- @MingYang119 made their first contribution in #4625
- @amy-why-3459 made their first contribution in #4176
- @h1074112368 made their...
v0.11.0rc3
This is the third release candidate of v0.11.0 for vLLM Ascend. For quality reasons, we released a new rc before the official release. Thanks for all your feedback. Please follow the official doc to get started.
Highlights
- torch-npu is upgraded to 2.7.1.post1. Please note that the package is pushed to pypi mirror. So it's hard to add it to auto dependence. Please install it by yourself.
- Disable NZ weight loader to speed up dense model. Please note that this is a temporary solution. If you find the performance becomes bad, please let us know. We'll keep improving it. #4495
- mooncake is installed in official docker image now. You can use it directly in container now. #4506
Other
- Fix an OOM issue for moe models. #4367
- Fix hang issue of multimodal model when running with DP>1 #4393
- Fix some bugs for EPLB #4416
- Fix bug for mtp>1 + lm_head_tp>1 case #4360
- Fix a accuracy issue when running vLLM serve for long time. #4117
- Fix a function bug when running qwen2.5 vl under high concurrency. #4553
Full Changelog: v0.11.0rc2...v0.11.0rc3
v0.11.0rc2
This is the second release candidate of v0.11.0 for vLLM Ascend. In this release, we solved many bugs to improve the quality. Thanks for all your feedback. We'll keep working on bug fix and performance improvement. The v0.11.0 official release will come soon. Please follow the official doc to get started.
Highlights
- CANN is upgraded to 8.3.RC2. #4332
- Ngram spec decode method is back now. #4092
- The performance of aclgraph is improved by updating default capture size. #4205
Core
- Speed up vLLM startup time. #4099
- Kimi k2 with quantization works now. #4190
- Fix a bug for qwen3-next. It's more stable now. #4025
Other
- Fix an issue for full decode only mode. Full graph mode is more stable now. #4106 #4282
- Fix a allgather ops bug for DeepSeek V3 series models. #3711
- Fix some bugs for EPLB feature. #4150 #4334
- Fix a bug that vl model doesn't work on x86 machine. #4285
- Support ipv6 for prefill disaggregation proxy. Please note that mooncake connector doesn't work with ipv6. We're working on it. #4242
- Add a check that to ensure EPLB only support w8a8 method for quantization case. #4315
- Add a check that to ensure FLASHCOMM feature doesn't work with vl model. It'll be supported in 2025 Q4 #4222
- Audio required library is installed in container. #4324
Known Issues
- Ray + EP doesn't work, if you run vLLM Ascend with ray, please disable expert parallelism. #4123
response_formatparameter is not supported yet. We'll support it soon. #4175- cpu bind feature doesn't work for multi instance case(Such as multi DP on one node). We'll fix it in the next release.
Full Changelog: v0.11.0rc1...v0.11.0rc2
v0.11.0rc1
This is the first release candidate of v0.11.0 for vLLM Ascend. Please follow the official doc to get started.
v0.11.0 will be the next official release version of vLLM Ascend. We'll release it in the next few days. Any feedback is welcome to help us to improve v0.11.0.
Highlights
- CANN is upgrade to 8.3.RC1. Torch-npu is upgrade to 2.7.1. #3945 #3896
- PrefixCache and Chunked Prefill are enabled by default. #3967
- W4A4 quantization is supported now. #3427 Official tutorial is available at here.
- The official documentation has now been switched to https://docs.vllm.ai/projects/ascend.
Core
- Performance of Qwen3 and Deepseek V3 series models are improved.
- Mooncake layerwise connector is supported now #2602. Find tutorial here.
- MTP > 1 is supported now. #2708
- [Experimental] Graph mode
FULL_DECODE_ONLYis supported now! AndFULLwill be landing in the next few weeks. #2128 - Pooling models, such as bge-m3, are supported now. #3171
Other
- Refactor the MOE module to make it clearer and easier to understand and the performance has improved in both quantitative and non-quantitative scenarios.
- Refactor model register module to make it easier to maintain. We'll remove this module in Q4 2025. #3004
- LLMDatadist KV Connector is deprecated. We'll remove it in Q1 2026.
- Refactor the linear module to support features flashcomm1 and flashcomm2 in paper flashcomm #3004 #3334
Known issue
- With PD disaggragate + fullgraph case, the memory may be leaked and the service may be stuck after long time serving. This is a bug from torch-npu, we'll upgrade and fix it soon.
- The accuracy of qwen2.5 VL is not very good with BF16 on videobench data collection. This is a bug lead by CANN, we'll fix it soon.
- For long sequence input case(>32k), there is no response sometimes and the kv cache usage is become higher. This is a bug from vLLM scheduler. We are working on it. Temporary solution is to set
max-model-lento a suitable value - Qwen2-audio doesn't work by default, we're fixing it. Temporary solution is to set
--gpu-memory-utilizationto a suitable value, such as 0.8. - When running Qwen3-Next with expert parallel enabled, please set
HCCL_BUFFSIZEenvironment variable to a suitable value, such as 1024. - The accuracy of DeepSeek3.2 with aclgraph is not correct. Temporary solution is to set
cudagraph_capture_sizesto a suitable value depending on the batch size for the input.
New Contributors
- @huangdong2022 made their first contribution in #3205
- @kiscad made their first contribution in #3226
- @dsxsteven made their first contribution in #3381
- @elilzhu made their first contribution in #3426
- @yuzhup made their first contribution in #3203
- @DreamerLeader made their first contribution in #3476
- @yechao237 made their first contribution in #3473
- @leijie-cn made their first contribution in #3519
- @Anionex made their first contribution in #3311
- @Semmer2 made their first contribution in #4041
Full Changelog: v0.11.0rc0...v0.11.0rc1
v0.11.0rc0
This is the special release candidate of v0.11.0 for vLLM Ascend. Please follow the official doc to get started.
Highlights
- DeepSeek V3.2 is supported now. #3270 Please follow the official guide to take a try.
- Qwen3-vl is supported now. #3103
Core
- DeepSeek works with aclgraph now. #2707
- MTP works with aclgraph now. #2932
- EPLB is supported now. #2956
- Mooncacke store kvcache connector is supported now. #2913
- CPU offload connector is supported now. #1659
Other
- Qwen3-next is stable now. #3007
- Fixed a lot of bugs introduced in v0.10.2 by Qwen3-next. #2964 #2781 #3070 #3113
- The LoRA feature is back now. #3044
- Eagle3 spec decode method is back now. #2949
New Contributors
- @offline893 made their first contribution in #2956
- @1Fire4 made their first contribution in #2869
- @jesse996 made their first contribution in #2796
- @Lucaskabela made their first contribution in #2969
- @qyqc731 made their first contribution in #2962
- @Mercykid-bash made their first contribution in #3042
- @MaoJianwei made their first contribution in #3116
- @booker123456 made their first contribution in #3071
- @Csrayz made their first contribution in #2372
- @Clorist33 made their first contribution in #3035
- @clrs97 made their first contribution in #2931
- @zzhx1 made their first contribution in #3027
- @mfyCn-1204 made their first contribution in #3123
- @dragondream-chen made their first contribution in #3132
- @florenceCH made their first contribution in #3126
- @slippersss made their first contribution in #3153
- @socrahow made their first contribution in #3151
Full Changelog: v0.10.2rc1...v0.11.0rc0