Release v0.16.0rc1 · vllm-project/vllm-ascend

This is the first release candidate of v0.16.0 for vLLM Ascend. Please follow the official doc to get started.

Highlights

Qwen3-Omni quantization adaptation and optimization is now available. #6828
GLM5-W8A8 quantization is now supported by parameterizing hardcoded MLA dimensions. #6902

Features

[Experimental] Support FabricMem Mode for ADXL/HIXL interconnect. #6806
Qwen3-Next now supports FlashComm1. #6830
NPUWorker Profiler now supports profile_prefix for better profiling experience. #6968
EPLB profiling now displays expert hotness comparison and time required for eplb adjustment. #6877 #7001]
Xlite Qwen3 MoE now supports Data Parallel. #6715
Mooncake Layerwise Connector now supports kv_pool. #7032
Eagle3 now supports QuaRot quantization without embedding. #7038

Hardware and Operator Support

310P now supports w8a8sc quantization method. #7075
Added AscendC casual_conv1d_fn operator for Qwen3-Next. #6661
Added Ascend Ops recurrent_gated_delta_rule operator. #6725
Added GMM custom operator for MoE models. #7010

Performance

Faster convolution computation improves TTFT by 0.95% and throughput by 0.59% for Qwen3-VL models. #7017
Optimize split_qkv_rmsnorm_rope operator. #6827
Implement global CPU slicing and improve IRQ binding for Ascend NPUs, ensuring non-overlapping CPU partitions and better resource management. #6945
Optimize MTP execution by reordering state update operation. #6844
Avoid CPU sync in mrope_positions copy by using full tensor copy. #7014
Remove H2D synchronization for expert_map in MoE models. #7000

Dependencies

CANN is upgraded to 8.5.1, please remember to upgrade by hand if you're not using the official image. #6897

Deprecation & Breaking Changes

enable_flash_comm_v1 config option has been renamed back to enable_sp. #6883
The auto-detect quantization format from model files is reverted, in v0.16.0rc1, we still need to add ---quantization ascend to serve a model quantinized by modelslim. It will be added back in the next version after the bug with the remote model id is fixed. #6873

Documentation

Added user/developer guide for CPU binding. #7045
Added metrics usage documentation and example. #6962
Added llms.txt for LLM discovery. #6886
Added GLM4.x multi-node deploy tutorial. #6872
Added explanation of 310p special param: max-model-len. #7065

Others

Fix openEuler Dockerfile error. #6871
Many bug fixes including:
- Fix Eagle speculative decoding with Context Parallel enabled. #6981 #7079
- Fix LoRA accuracy issue introduced by upstream vLLM changes. #6958
- Fix streaming content-type in load balance proxy server. #6985
- Fix metadata execute error: integer modulo by zero. #6521
- Fix triton rope_siso implementation bug. #7082
- Fix incorrect layer count for MTP models in update_aclgraph_sizes. #7064
- Fix compilation errors for CANN versions subsequent to b020. #7059
- Fix quant config support in GLM4.6V. #7062
- Fix parameter ordering bug in _merge_multimodal_embeddings. #7068
- Fix fused mc2 bug in EPLB. #6794
- Fix kernel block size for computing slot mapping. #7019
- Fix layerwise stacking MTP error in P/D disaggregation. #7036
- Fix RoPE dimension for npu_rotary_embedding. #6880
- Fix Qwen-Omni quantization bugs. #7042 #7007
- Fix GDN layer accuracy in graph mode. #6822
- Fix precision bugs for PCP/DCP in PD disaggregate. #6876
- Fix MTP in PD disaggregation with fullgraph support for all D-Nodes. #6948
- Fix GQA model error when enabling both DP and DCP. #7012
- Fix MTP prefill misclassified as decode edge case. #6835
- Fix Eagle3 acceptance rate for QuaRot quantized models. #6914
- Fix RoPE shape mismatch for MTP models with FlashComm V1 enabled. #6939
- Fix Qwen2.5VL accuracy issue. #6975
- Fix MoE forward error with static kernel enabled. #6964
- Fix muls_add fusion for GLM5 models. #6928
- Fix GDN layer detection for multimodal models. #6941
- Fix 300I unquant model weight nd2nz error. #6851
- Fix CPU binding logic. #6889
- Fix Eagle fullgraph shape capture. #6846

Known Issue

Currently, for DeepSeek v3.2, PCP & DCP do not yet work with FlashComm1 feature, which may cause serve errors or other unknown errors.
In 4-node A3 PD disaggregation deployment with DeepSeek V3.2, the P-Node may hang when benchmarking in high concurrency scenario, e.g., 2K/2K tokens with 512 concurrent requests.
MTP with large EP configurations may cause graph capture buffer overflow. This is a bug need to fix in vLLM, now there is a workaround to avoid it: explicitly set --compilation-config '{"max_cudagraph_capture_size": N}' where N = max_concurrency × (1 + num_speculative_tokens).

New Contributors

@Eric-dot made their first contribution in #6670
@tanhaoan333 made their first contribution in #6828
@NJX-njx made their first contribution in #6965
@Zhujiyang2 made their first contribution in #6939
@songjianquan made their first contribution in #6740
@liuchen2026fly made their first contribution in #6928
@guleo made their first contribution in #6827
@wanghengkang made their first contribution in #6977
@xiaocongtou6 made their first contribution in #6940
@chenxi-hh made their first contribution in #7010
@wanghuanjun2113 made their first contribution in #7064
@banxiaduhuo made their first contribution in #6715
@xmpp777 made their first contribution in #6933
@s-zk made their first contribution in #6872
@ZRJ026 made their first contribution in #6985

Full Changelog: v0.15.0rc1...v0.16.0rc1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.16.0rc1

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

Features

Hardware and Operator Support

Performance

Dependencies

Deprecation & Breaking Changes

Documentation

Others

Known Issue

New Contributors

Contributors

Uh oh!