Releases · vllm-project/vllm-ascend

27 Dec 10:50

v0.13.0rc1

1b5d5ab

v0.13.0rc1 Pre-release

Pre-release

This is the first release candidate of v0.13.0 for vLLM Ascend. We landed lots of bug fix, performance improvement and feature support in this release. Any feedback is welcome to help us to improve vLLM Ascend. Please follow the official doc to get started.

Highlights

Improved the performance of DeepSeek V3.2, please refer to tutorials
Qwen3-Next MTP with chunked prefill is supported now #4770, please refer to tutorials
[Experimental] Prefill Context Parallel and Decode Context Parallel are supported, but notice that it is an experimental feature now, welcome any feedback. please refer to context parallel feature guide

Features

Support openPangu Ultra MoE 4615
A new quantization method W8A16 is supported now. #4541
Cross-machine Disaggregated Prefill is supported now. #5008
Add UCMConnector for KV Cache Offloading. #4411
Support async_scheduler and disable_padded_drafter_batch in eagle. #4893
Support pcp + mtp in full graph mode. #4572
Enhance all-reduce skipping logic for MoE models in NPUModelRunner #5329

Performance

Some general performance improvement:

Add l2norm triton kernel #4595
Add new pattern for AddRmsnormQuant with SP, which could only take effect in graph mode. #5077
Add async exponential while model executing. #4501
Remove the transpose step after attention and switch to transpose_batchmatmul #5390
To optimize the performance in small batch size scenario, an attention operator with flash decoding function is offered, please refer to item 22 in FAQs to enable it.

Other

OOM error on VL models is fixed now. We're keeping observing it, if you hit OOM problem again, please submit an issue. #5136
Fixed an accuracy bug of Qwen3-Next-MTP when batched inferring. #4932
Fix npu-cpu offloading interface change bug. #5290
Fix MHA model runtime error in aclgraph mode #5397
Fix unsuitable moe_comm_type under ep=1 scenario #5388

Deprecation & Breaking Changes

VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE is removed and VLLM_ASCEND_ENABLE_PREFETCH_MLP is recommend to replace as they always be enabled together. #5272
VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP is dropped now. #5270
VLLM_ASCEND_ENABLE_NZ is disabled for float weight case, since we notice that the performance is not good in some float case. Feel free to set it to 2 if you make sure it works for your case. #4878
chunked_prefill_for_mla in additional_config is dropped now. #5296
dump_config in additional_config is renamed to dump_config_path and the type is change from dict to string. #5296

Dependencies

vLLM version has been upgraded to 0.13.0 and drop 0.12.0 support. #5146
Transformer version has been upgraded >= 4.57.3 #5250

Known Issues

Qwen3-Next doesn't support long sequence scenario, and we should limit gpu-memory-utilization according to the doc to run Qwen3-Next. We'll improve it in the next release
The functional break on Qwen3-Next when the input/output is around 3.5k/1.5k is fixed, but it introduces a regression on performance. We'll fix it in next release. #5357
There is a precision issue with curl on ultra-short sequences in DeepSeek-V3.2. We'll fix it in next release. #5370

New Contributors

@ming1212 made their first contribution in #4607
@knight0528 made their first contribution in #4902
@UnifiedCacheManager made their first contribution in #4411
@Toneymiller made their first contribution in #5053
@JeffLee1874 made their first contribution in #4615
@ader47 made their first contribution in #4953
@YzTongNiar made their first contribution in #5115
@TingW09 made their first contribution in #5086
@ZT-AIA made their first contribution in #4788
@yuxinshan made their first contribution in #5063
@LICO1314 made their first contribution in #5142
@yuxingcyx made their first contribution in #4830
@hukongyi made their first contribution in #4141
@luluxiu520 made their first contribution in #5167
@YuhanBai made their first contribution in #4501
@pisceskkk made their first contribution in #5183
@OsirisDuan made their first contribution in #4304
@LJQ142857 made their first contribution in #5228
@lengrongfu made their first contribution in #5258
@hzxuzhonghu made their first contribution in #4674
@TmacAaron made their first contribution in #4541
@chenaoxuan made their first contribution in #4443
@changdawei1 made their first contribution in #5305
@wjunLu made their first contribution in #5287
@cookieyyds made their first contribution in #5373
@maoxx241 made their first contribution in #5322
@jiangkuaixue123 made their first contribution in #5435

Full Changelog: v0.12.0rc1...v0.13.0rc1

Contributors

hzxuzhonghu, lengrongfu, and 25 other contributors

Assets 2

16 Dec 09:43

wangxiyuan

v0.11.0

2f1aed9

v0.11.0

We're excited to announce the release of v0.11.0 for vLLM Ascend. This is the official release for v0.11.0. Please follow the official doc to get started. We'll consider to release post version in the future if needed. This release note will only contain the important change and note from v0.11.0rc3.

Highlights

Improved the performance for deepseek 3/3.1. #3995
Fixed the accuracy bug for qwen3-vl. #4811
Improved the performance of sample. #4153
Eagle3 is back now. #4721

Other

Improved the performance for kimi-k2. #4555
Fixed a quantization bug for deepseek3.2-exp. #4797
Fixed qwen3-vl-moe bug under high concurrency. #4658
Fixed an accuracy bug for Prefill Decode disaggregation case. #4437
Fixed some bugs for EPLB #4576 #4777
Fixed the version incompatibility issue for openEuler docker image. #4745

Deprecation announcement

LLMdatadist connector has been deprecated, it'll be removed in v0.12.0rc1
Torchair graph has been deprecated, it'll be removed in v0.12.0rc1
Ascend scheduler has been deprecated, it'll be removed in v0.12.0rc1

Upgrade notice

torch-npu is upgraded to 2.7.1.post1. Please note that the package is pushed to pypi mirror. So it's hard to add it to auto dependence. Please install it by yourself.
CANN is upgraded to 8.3.rc2.

Known Issues

Qwen3-Next doesn't support expert parallel and MTP features in this release. And it'll be oom if the input is too long. We'll improve it in the next release
Deepseek 3.2 only work with torchair graph mode in this release. We'll make it work with aclgraph mode in the next release.
Qwen2-audio doesn't work by default. Temporary solution is to set --gpu-memory-utilization to a suitable value, such as 0.8.
CPU bind feature doesn't work if more than one vLLM instance is running on the same node.

New Contributors

@sunchendd made their first contribution in #4721

Full Changelog: v0.11.0rc3...v0.11.0

Contributors

sunchendd

Assets 2

13 Dec 14:10

wangxiyuan

v0.12.0rc1

42ceaf0

v0.12.0rc1 Pre-release

Pre-release

This is the first release candidate of v0.12.0 for vLLM Ascend. We landed lots of bug fix, performance improvement and feature support in this release. Any feedback is welcome to help us to improve vLLM Ascend. Please follow the official doc to get started.

Highlights

DeepSeek 3.2 is stable and performance is improved. In this release, you don't need to install any other packages now. Following the official tutorial to start using it.
Async scheduler is more stable and ready to enable now. Please set --async-scheduling to enable it.
More new models, such as Qwen3-omni, DeepSeek OCR, PaddleOCR, OpenCUA are supported now.

Core

[Experimental] Full decode only graph mode is supported now. Although it is not enabled by default, we suggest to enable it by --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' in most case. Let us know if you hit any error. We'll keep improve it and enable it by default in next few release.
Lots of triton kernel are added. The performance of vLLM Ascend, especially Qwen3-Next and DeepSeek 3.2 is improved. Please note that triton is not installed and enabled by default, but we suggest to enable it in most case. You can download and install it by hand from package url. If you're running vLLM Ascend with X86, you need to build triton ascend by yourself from source
Lots of Ascend ops are added to improve the performance. It means that from this release vLLM Ascend only works with custom ops built. So we removed the env COMPILE_CUSTOM_KERNELS. You can not set it to 0 now.
speculative decode method MTP is more stable now. It can be enabled with most case and decode token number can be 1,2,3.
speculative decode method suffix is supported now. Thanks for the contribution from China Merchants Bank.
llm-comppressor quantization tool with W8A8 works now. You can now deploy the model with W8A8 quantization from this tool directly.
W4A4 quantization works now.
Support features flashcomm1 and flashcomm2 in paper flashcomm #3004 #3334
Pooling model, such as bge, reranker, etc. are supported now
Official doc has been improved. we refactored the tutorial to make it more clear. The user guide and developer guide is more complete now. We'll keep improving it.

Other

[Experimental] Mooncake layerwise connector is supported now.
[Experimental] KV cache pool feature is added
[Experimental] A new graph mode xlite is introduced. It performs good with some models. Following the official tutorial to start using it.
LLMdatadist kv connector is removed. Please use mooncake connector instead.
Ascend scheduler is removed. --additional-config {"ascend_scheudler": {"enabled": true} doesn't work anymore.
Torchair graph mode is removed. --additional-config {"torchair_graph_config": {"enabled": true}} doesn't work anymore. Please use aclgraph instead.
VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION env is removed. This feature is stable enough. We enable it by default now.
speculative decode method Ngram is back now.
msprobe tool is added to help user to check the model accuracy. Please follow the official doc to get started.
msserviceprofiler tool is added to help user to profile the model performance. Please follow the official doc to get started.

Upgrade Note

vLLM Ascend self maintained modeling file has been removed. The related python entrypoint is removed as well. So please uninstall the old version of vLLM Ascend in your env before upgrade.
CANN is upgraded to 8.3.RC2, Pytorch and torch-npu are upgraded to 2.8.0. Don't forget to install them.
Python 3.9 support is dropped to keep the same with vLLM v0.12.0

Known Issues

DeepSeek 3/3.1 and Qwen3 doesn't work with FULL_DECODE_ONLY graph mode. We'll fix it in next release. #4990
Hunyuan OCR doesn't work. We'll fix it in the next release. #4989 #4992
DeepSeek 3.2 doesn't work with chat template. It because that vLLM v0.12.0 doesn't support it. We'll support in the next v0.13.0rc1 version.
DeepSeek 3.2 doesn't work with high concurrency in some case. We'll fix it in next release. #4996
We notice that bf16/fp16 model doesn't perform well, it's mainly because that VLLM_ASCEND_ENABLE_NZ is enabled by default. Please set VLLM_ASCEND_ENABLE_NZ=0 to disable it. We'll add the auto detection mechanism in next release.
speculative decode method suffix doesn't work. We'll fix it in next release. You can pick this commit to fix the issue: #5010

New Contributors

@huangdong2022 made their first contribution in #3205
@kiscad made their first contribution in #3226
@jiangyunfan1 made their first contribution in #3370
@dsxsteven made their first contribution in #3381
@elilzhu made their first contribution in #3426
@yuzhup made their first contribution in #3203
@DreamerLeader made their first contribution in #3476
@yechao237 made their first contribution in #3473
@leijie-ww made their first contribution in #3519
@Anionex made their first contribution in #3311
@drslark made their first contribution in #3549
@KyrieDrewWang made their first contribution in #3490
@HF-001 made their first contribution in #3433
@yzy1996 made their first contribution in #3615
@destinysky made their first contribution in #2888
@HuaJiaHeng made their first contribution in #3676
@lio1226 made their first contribution in #3434
@yenuo26 made their first contribution in #3707
@gcanlin made their first contribution in #3729
@QilaiZhang made their first contribution in #3572
@ck-hw-1018 made their first contribution in #3757
@Meihan-chen made their first contribution in #3861
@Liwx1014 made their first contribution in #3870
@ForBetterCodeNine made their first contribution in #3937
@Pz1116 made their first contribution in #3752
@Liziqi-77 made their first contribution in #3690
@herizhen made their first contribution in #4089
@Apocalypse990923-qshi made their first contribution in #3801
@thonean made their first contribution in #3756
@845473182 made their first contribution in #4144
@wangxiaochao6 made their first contribution in #4183
@Delphine-Nic made their first contribution in #4209
@InSec made their first contribution in #4245
@Tjh-UKN made their first contribution in #4241
@zjchenn made their first contribution in #4354
@LHXuuu made their first contribution in #4036
@ChenxiQ made their first contribution in #3804
@swy20190 made their first contribution in #4550
@fluctlux made their first contribution in #4045
@coder-fny made their first contribution in #4529
@MingYang119 made their first contribution in #4625
@amy-why-3459 made their first contribution in #4176
@h1074112368 made their...

Contributors

kiscad, lulina, and 58 other contributors

Assets 2

03 Dec 03:54

wangxiyuan

v0.11.0rc3

b6d63bb

v0.11.0rc3 Pre-release

Pre-release

This is the third release candidate of v0.11.0 for vLLM Ascend. For quality reasons, we released a new rc before the official release. Thanks for all your feedback. Please follow the official doc to get started.

Highlights

torch-npu is upgraded to 2.7.1.post1. Please note that the package is pushed to pypi mirror. So it's hard to add it to auto dependence. Please install it by yourself.
Disable NZ weight loader to speed up dense model. Please note that this is a temporary solution. If you find the performance becomes bad, please let us know. We'll keep improving it. #4495
mooncake is installed in official docker image now. You can use it directly in container now. #4506

Other

Fix an OOM issue for moe models. #4367
Fix hang issue of multimodal model when running with DP>1 #4393
Fix some bugs for EPLB #4416
Fix bug for mtp>1 + lm_head_tp>1 case #4360
Fix a accuracy issue when running vLLM serve for long time. #4117
Fix a function bug when running qwen2.5 vl under high concurrency. #4553

Full Changelog: v0.11.0rc2...v0.11.0rc3

Assets 2

21 Nov 15:10

wangxiyuan

v0.11.0rc2

a2e4c3f

v0.11.0rc2 Pre-release

Pre-release

This is the second release candidate of v0.11.0 for vLLM Ascend. In this release, we solved many bugs to improve the quality. Thanks for all your feedback. We'll keep working on bug fix and performance improvement. The v0.11.0 official release will come soon. Please follow the official doc to get started.

Highlights

CANN is upgraded to 8.3.RC2. #4332
Ngram spec decode method is back now. #4092
The performance of aclgraph is improved by updating default capture size. #4205

Core

Speed up vLLM startup time. #4099
Kimi k2 with quantization works now. #4190
Fix a bug for qwen3-next. It's more stable now. #4025

Other

Fix an issue for full decode only mode. Full graph mode is more stable now. #4106 #4282
Fix a allgather ops bug for DeepSeek V3 series models. #3711
Fix some bugs for EPLB feature. #4150 #4334
Fix a bug that vl model doesn't work on x86 machine. #4285
Support ipv6 for prefill disaggregation proxy. Please note that mooncake connector doesn't work with ipv6. We're working on it. #4242
Add a check that to ensure EPLB only support w8a8 method for quantization case. #4315
Add a check that to ensure FLASHCOMM feature doesn't work with vl model. It'll be supported in 2025 Q4 #4222
Audio required library is installed in container. #4324

Known Issues

Ray + EP doesn't work, if you run vLLM Ascend with ray, please disable expert parallelism. #4123
response_format parameter is not supported yet. We'll support it soon. #4175
cpu bind feature doesn't work for multi instance case(Such as multi DP on one node). We'll fix it in the next release.

Full Changelog: v0.11.0rc1...v0.11.0rc2

Assets 2

10 Nov 13:02

wangxiyuan

v0.11.0rc1

c5fe179

v0.11.0rc1 Pre-release

Pre-release

This is the first release candidate of v0.11.0 for vLLM Ascend. Please follow the official doc to get started.
v0.11.0 will be the next official release version of vLLM Ascend. We'll release it in the next few days. Any feedback is welcome to help us to improve v0.11.0.

Highlights

CANN is upgrade to 8.3.RC1. Torch-npu is upgrade to 2.7.1. #3945 #3896
PrefixCache and Chunked Prefill are enabled by default. #3967
W4A4 quantization is supported now. #3427 Official tutorial is available at here.
The official documentation has now been switched to https://docs.vllm.ai/projects/ascend.

Core

Performance of Qwen3 and Deepseek V3 series models are improved.
Mooncake layerwise connector is supported now #2602. Find tutorial here.
MTP > 1 is supported now. #2708
[Experimental] Graph mode FULL_DECODE_ONLY is supported now! And FULL will be landing in the next few weeks. #2128
Pooling models, such as bge-m3, are supported now. #3171

Other

Refactor the MOE module to make it clearer and easier to understand and the performance has improved in both quantitative and non-quantitative scenarios.
Refactor model register module to make it easier to maintain. We'll remove this module in Q4 2025. #3004
LLMDatadist KV Connector is deprecated. We'll remove it in Q1 2026.
Refactor the linear module to support features flashcomm1 and flashcomm2 in paper flashcomm #3004 #3334

Known issue

With PD disaggragate + fullgraph case, the memory may be leaked and the service may be stuck after long time serving. This is a bug from torch-npu, we'll upgrade and fix it soon.
The accuracy of qwen2.5 VL is not very good with BF16 on videobench data collection. This is a bug lead by CANN, we'll fix it soon.
For long sequence input case(>32k), there is no response sometimes and the kv cache usage is become higher. This is a bug from vLLM scheduler. We are working on it. Temporary solution is to set max-model-len to a suitable value
Qwen2-audio doesn't work by default, we're fixing it. Temporary solution is to set --gpu-memory-utilization to a suitable value, such as 0.8.
When running Qwen3-Next with expert parallel enabled, please set HCCL_BUFFSIZE environment variable to a suitable value, such as 1024.
The accuracy of DeepSeek3.2 with aclgraph is not correct. Temporary solution is to set cudagraph_capture_sizes to a suitable value depending on the batch size for the input.

New Contributors

@huangdong2022 made their first contribution in #3205
@kiscad made their first contribution in #3226
@dsxsteven made their first contribution in #3381
@elilzhu made their first contribution in #3426
@yuzhup made their first contribution in #3203
@DreamerLeader made their first contribution in #3476
@yechao237 made their first contribution in #3473
@leijie-cn made their first contribution in #3519
@Anionex made their first contribution in #3311
@Semmer2 made their first contribution in #4041

Full Changelog: v0.11.0rc0...v0.11.0rc1

Contributors

kiscad, Semmer2, and 8 other contributors

Assets 2

29 Sep 19:37

wangxiyuan

v0.11.0rc0

00ba071

v0.11.0rc0 Pre-release

Pre-release

This is the special release candidate of v0.11.0 for vLLM Ascend. Please follow the official doc to get started.

Highlights

DeepSeek V3.2 is supported now. #3270 Please follow the official guide to take a try.
Qwen3-vl is supported now. #3103

Core

DeepSeek works with aclgraph now. #2707
MTP works with aclgraph now. #2932
EPLB is supported now. #2956
Mooncacke store kvcache connector is supported now. #2913
CPU offload connector is supported now. #1659

Other

Qwen3-next is stable now. #3007
Fixed a lot of bugs introduced in v0.10.2 by Qwen3-next. #2964 #2781 #3070 #3113
The LoRA feature is back now. #3044
Eagle3 spec decode method is back now. #2949

New Contributors

@offline893 made their first contribution in #2956
@1Fire4 made their first contribution in #2869
@jesse996 made their first contribution in #2796
@Lucaskabela made their first contribution in #2969
@qyqc731 made their first contribution in #2962
@Mercykid-bash made their first contribution in #3042
@MaoJianwei made their first contribution in #3116
@booker123456 made their first contribution in #3071
@Csrayz made their first contribution in #2372
@Clorist33 made their first contribution in #3035
@clrs97 made their first contribution in #2931
@zzhx1 made their first contribution in #3027
@mfyCn-1204 made their first contribution in #3123
@dragondream-chen made their first contribution in #3132
@florenceCH made their first contribution in #3126
@slippersss made their first contribution in #3153
@socrahow made their first contribution in #3151

Full Changelog: v0.10.2rc1...v0.11.0rc0

Contributors

MaoJianwei, jesse996, and 15 other contributors

Assets 2

15 Sep 17:22

wangxiyuan

v0.10.2rc1

048bfd5

v0.10.2rc1 Pre-release

Pre-release

This is the 1st release candidate of v0.10.2 for vLLM Ascend. Please follow the official doc to get started.

Highlights

Add support for Qwen3 Next. Please note that expert parallel and MTP feature doesn't work with this release. We'll make it work enough soon. Follow the official guide to get start #2917
Add quantization support for aclgraph #2841

Core

Aclgraph now works with Ray backend. #2589
MTP now works with the token > 1. #2708
Qwen2.5 VL now works with quantization. #2778
Improved the performance with async scheduler enabled. #2783
Fixed the performance regression with non MLA model when use default scheduler. #2894

Other

The performance of w8a8 quantization is improved. #2275
The performance of moe model is improved. #2689 #2842
Fixed resources limit error when apply speculative decoding and aclgraph. #2472
Fixed the git config error in docker images. #2746
Fixed the sliding windows attention bug with prefill. #2758
The official doc for Prefill Decode Disaggregation with Qwen3 is added. #2751
VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP env works again. #2740
A new improvement for oproj in deepseek is added. Set oproj_tensor_parallel_size to enable this feature#2167
Fix a bug that deepseek with torchair doesn't work as expect when graph_batch_sizes is set. #2760
Avoid duplicate generation of sin_cos_cache in rope when kv_seqlen > 4k. #2744
The performance of Qwen3 dense model is improved with flashcomm_v1. Set VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE=1 and VLLM_ASCEND_ENABLE_FLASHCOMM=1 to enable it. #2779
The performance of Qwen3 dense model is improved with prefetch feature. Set VLLM_ASCEND_ENABLE_PREFETCH_MLP=1 to enable it. #2816
The performance of Qwen3 MoE model is improved with rope ops update. #2571
Fix the weight load error for RLHF case. #2756
Add warm_up_atb step to speed up the inference. #2823
Fixed the aclgraph steam error for moe model. #2827

Known issue

The server will be hang when running Prefill Decode Disaggregation with different TP size for P and D. It's fixed by vLLM commit which is not included in v0.10.2. You can pick this commit to fix the issue.
The HBM usage of Qwen3 Next is higher than expected. It's a known issue and we're working on it. You can set max_model_len and gpu_memory_utilization to suitable value basing on your parallel config to avoid oom error.
We notice that lora doesn't work with this release due to the refactor of kv cache. We'll fix it soon. 2941
Please do not enable chunked prefill with prefix cache when running with Ascend scheduler. The performance and accuracy is not good/correct. #2943

New Contributors

@WithHades made their first contribution in #2589
@vllm-ascend-ci made their first contribution in #2755
@1092626063 made their first contribution in #2708
@marcobarlo made their first contribution in #2039
@realliujiaxu made their first contribution in #2719
@machenglong2025 made their first contribution in #2805
@fffrog made their first contribution in #2815
@anon189Ty made their first contribution in #2619
@zhaozx-cn made their first contribution in #2787
@wenba0 made their first contribution in #2778
@wuweiqiang24 made their first contribution in #2814
@wyu0-0 made their first contribution in #2857
@nwpu-zxr made their first contribution in #2824

Full Changelog: v0.10.1rc1...v0.10.2rc1

Contributors

machenglong2025, WithHades, and 11 other contributors

Assets 2

04 Sep 03:30

MengqingCao

v0.10.1rc1

7e16b4a

v0.10.1rc1 Pre-release

Pre-release

This is the 1st release candidate of v0.10.1 for vLLM Ascend. Please follow the official doc to get started.

Highlights

LoRA Performance improved much through adding Custom Kernels by China Merchants Bank. #2325
Support Mooncake TransferEngine for kv cache register and pull_blocks style disaggregate prefill implementation. #1568
Support capture custom ops into aclgraph now. #2113

Core

Add MLP tensor parallel to improve performance, but note that this will increase memory usage. #2120
openEuler is upgraded to 24.03. #2631
Add custom lmhead tensor parallel to achieve reduced memory consumption and improved TPOT performance. #2309
Qwen3 MoE/Qwen2.5 support torchair graph now. #2403
Support Sliding Window Attention with AscendSceduler, thus fixing Gemma3 accuracy issue. #2528

Other

Bug fixes:
- Update the graph capture size calculation, somehow alleviated the problem that npu stream not enough in some scenarios #2511
- Fix bugs and refactor cached mask generation logic. #2442
- Fix the nz format does not work in quantization scenarios. #2549
- Fix accuracy issue on Qwen series caused by enabling enable_shared_pert_dp by default. #2457
- Fix accuracy issue on models whose rope dim is not equal to head dim, e.g., GLM4.5. #2601
Performance improved through a lot of prs:
- Remove torch.cat and replace it by List[0]. #2153
- Convert the format of gmm to nz. #2474
- Optimize parallel strategies to reduce communication overhead #2198
- Optimize reject sampler in greedy situation #2137
A batch of refactoring prs to enhance the code architecture:
- Refactor on MLA. #2465
- Refactor on torchair fused_moe. #2438
- Refactor on allgather/mc2-related fused_experts. #2369
- Refactor on torchair model runner. #2208
- Refactor on CI. #2276
Parameters changes:
- Add lmhead_tensor_parallel_size in additional_config, set it to enable lmhead tensor parallel. #2309
- Some unused environ variables HCCN_PATH, PROMPT_DEVICE_ID, DECODE_DEVICE_ID, LLMDATADIST_COMM_PORT and LLMDATADIST_SYNC_CACHE_WAIT_TIME are removed. #2448
- Environ variable VLLM_LLMDD_RPC_PORT is renamed to VLLM_ASCEND_LLMDD_RPC_PORT now. #2450
- Add VLLM_ASCEND_ENABLE_MLP_OPTIMIZE in environ variables, Whether to enable mlp optimize when tensor parallel is enabled, this feature in eager mode will get better performance. #2120
- Remove MOE_ALL2ALL_BUFFER and VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ in environ variables.#2612
- Add enable_prefetch in additional_config, whether to enable weight prefetch. #2465
- Add mode in additional_config.torchair_graph_config, When using reduce-overhead mode for torchair, mode needs to be set. #2461
- enable_shared_expert_dp in additional_config is disabled by default now, and it is recommended to enable when inferencing with deepseek. #2457

Known Issues

Sliding window attention not support chunked prefill currently, thus we could only enable AscendScheduler to run with it. #2729
There is a bug with creating mc2_mask when MultiStream is enabled, will fix it in next release. #2681

New Contributors

@lidenghui1110 made their first contribution in #1917
@haojiangzheng made their first contribution in #1772
@QwertyJack made their first contribution in #2298
@LCAIZJ made their first contribution in #1568
@liuchenbing made their first contribution in #2325
@gameofdimension made their first contribution in #2407
@NicholasTao made their first contribution in #2403
@ZhaoJiangJiang made their first contribution in #2453
@s-jiayang made their first contribution in #2373
@NSDie made their first contribution in #2528
@panchao-hub made their first contribution in #2639
@zzy-ContiLearn made their first contribution in #2541
@baxingpiaochong made their first contribution in #2664

Full Changelog: v0.10.0rc1...v0.10.1rc1

Contributors

QwertyJack, NicholasTao, and 11 other contributors

Assets 2

03 Sep 10:05

wangxiyuan

v0.9.1

0740d10

v0.9.1

We are excited to announce the newest official release of vLLM Ascend. This release includes many feature supports, performance improvements and bug fixes. We recommend users to upgrade from 0.7.3 to this version. Please always set VLLM_USE_V1=1 to use V1 engine.

In this release, we added many enhancements for large scale expert parallel case. It's recommended to follow the official guide.

Please note that this release note will list all the important changes from last official release(v0.7.3)

Highlights

DeepSeek V3/R1 is supported with high quality and performance. MTP can work with DeepSeek as well. Please refer to muliti node tutorials and Large Scale Expert Parallelism.
Qwen series models work with graph mode now. It works by default with V1 Engine. Please refer to Qwen tutorials.
Disaggregated Prefilling support for V1 Engine. Please refer to Large Scale Expert Parallelism tutorials.
Automatic prefix caching and chunked prefill feature is supported.
Speculative decoding feature works with Ngram and MTP method.
MOE and dense w4a8 quantization support now. Please refer to quantization guide.
Sleep Mode feature is supported for V1 engine. Please refer to Sleep mode tutorials.
Dynamic and Static EPLB support is added. This feature is still experimental.

Note

The following notes are especially for reference when upgrading from last final release (v0.7.3):

V0 Engine is not supported from this release. Please always set VLLM_USE_V1=1 to use V1 engine with vLLM Ascend.
Mindie Turbo is not needed with this release. And the old version of Mindie Turbo is not compatible. Please do not install it. Currently all the function and enhancement is included in vLLM Ascend already. We'll consider to add it back in the future in needed.
Torch-npu is upgraded to 2.5.1.post1. CANN is upgraded to 8.2.RC1. Don't forget to upgrade them.

Core

The Ascend scheduler is added for V1 engine. This scheduler is more affine with Ascend hardware.
Structured output feature works now on V1 Engine.
A batch of custom ops are added to improve the performance.

Changes

EPLB support for Qwen3-moe model. #2000
Fix the bug that MTP doesn't work well with Prefill Decode Disaggregation. #2610 #2554 #2531
Fix few bugs to make sure Prefill Decode Disaggregation works well. #2538 #2509 #2502
Fix file not found error with shutil.rmtree in torchair mode. #2506

Known Issues

When running MoE model, Aclgraph mode only work with tensor parallel. DP/EP doesn't work in this release.
Pipeline parallelism is not supported in this release for V1 engine.
If you use w4a8 quantization with eager mode, please set VLLM_ASCEND_MLA_PARALLEL=1 to avoid oom error.
Accuracy test with some tools may not be correct. It doesn't affect the real user case. We'll fix it in the next post release. #2654
We notice that there are still some problems when running vLLM Ascend with Prefill Decode Disaggregation. For example, the memory may be leaked and the service may be stuck. It's caused by known issue by vLLM and vLLM Ascend. We'll fix it in the next post release. #2650 #2604 vLLM#22736 vLLM#23554 vLLM#23981

Assets 2

Releases: vllm-project/vllm-ascend

v0.13.0rc1

Highlights

Features

Performance

Other

Deprecation & Breaking Changes

Dependencies

Known Issues

New Contributors

Contributors

Uh oh!

v0.11.0

Highlights

Other

Deprecation announcement

Upgrade notice

Known Issues

New Contributors

Contributors

Uh oh!

v0.12.0rc1

Highlights

Core

Other

Upgrade Note

Known Issues

New Contributors

Contributors

Uh oh!

v0.11.0rc3

Highlights

Other

Uh oh!

v0.11.0rc2

Highlights

Core

Other

Known Issues

Uh oh!

v0.11.0rc1

Highlights

Core

Other

Known issue

New Contributors

Contributors

Uh oh!

v0.11.0rc0

Highlights

Core

Other

New Contributors

Contributors

Uh oh!

v0.10.2rc1

Highlights

Core

Other

Known issue

New Contributors

Contributors

Uh oh!

v0.10.1rc1

Highlights

Core

Other

Known Issues

New Contributors

Contributors

Uh oh!

v0.9.1

Highlights

Note

Core

Changes

Known Issues

Uh oh!