v0.16.0rc1
Pre-release
Pre-release
This is the first release candidate of v0.16.0 for vLLM Ascend. Please follow the official doc to get started.
Highlights
- Qwen3-Omni quantization adaptation and optimization is now available. #6828
- GLM5-W8A8 quantization is now supported by parameterizing hardcoded MLA dimensions. #6902
Features
- [Experimental] Support FabricMem Mode for ADXL/HIXL interconnect. #6806
- Qwen3-Next now supports FlashComm1. #6830
- NPUWorker Profiler now supports profile_prefix for better profiling experience. #6968
- EPLB profiling now displays expert hotness comparison and time required for eplb adjustment. #6877 #7001]
- Xlite Qwen3 MoE now supports Data Parallel. #6715
- Mooncake Layerwise Connector now supports kv_pool. #7032
- Eagle3 now supports QuaRot quantization without embedding. #7038
Hardware and Operator Support
- 310P now supports w8a8sc quantization method. #7075
- Added AscendC casual_conv1d_fn operator for Qwen3-Next. #6661
- Added Ascend Ops recurrent_gated_delta_rule operator. #6725
- Added GMM custom operator for MoE models. #7010
Performance
- Faster convolution computation improves TTFT by 0.95% and throughput by 0.59% for Qwen3-VL models. #7017
- Optimize split_qkv_rmsnorm_rope operator. #6827
- Implement global CPU slicing and improve IRQ binding for Ascend NPUs, ensuring non-overlapping CPU partitions and better resource management. #6945
- Optimize MTP execution by reordering state update operation. #6844
- Avoid CPU sync in mrope_positions copy by using full tensor copy. #7014
- Remove H2D synchronization for expert_map in MoE models. #7000
Dependencies
- CANN is upgraded to 8.5.1, please remember to upgrade by hand if you're not using the official image. #6897
Deprecation & Breaking Changes
enable_flash_comm_v1config option has been renamed back toenable_sp. #6883- The auto-detect quantization format from model files is reverted, in v0.16.0rc1, we still need to add
---quantization ascendto serve a model quantinized by modelslim. It will be added back in the next version after the bug with the remote model id is fixed. #6873
Documentation
- Added user/developer guide for CPU binding. #7045
- Added metrics usage documentation and example. #6962
- Added llms.txt for LLM discovery. #6886
- Added GLM4.x multi-node deploy tutorial. #6872
- Added explanation of 310p special param: max-model-len. #7065
Others
- Fix openEuler Dockerfile error. #6871
- Many bug fixes including:
- Fix Eagle speculative decoding with Context Parallel enabled. #6981 #7079
- Fix LoRA accuracy issue introduced by upstream vLLM changes. #6958
- Fix streaming content-type in load balance proxy server. #6985
- Fix metadata execute error: integer modulo by zero. #6521
- Fix triton rope_siso implementation bug. #7082
- Fix incorrect layer count for MTP models in update_aclgraph_sizes. #7064
- Fix compilation errors for CANN versions subsequent to b020. #7059
- Fix quant config support in GLM4.6V. #7062
- Fix parameter ordering bug in _merge_multimodal_embeddings. #7068
- Fix fused mc2 bug in EPLB. #6794
- Fix kernel block size for computing slot mapping. #7019
- Fix layerwise stacking MTP error in P/D disaggregation. #7036
- Fix RoPE dimension for npu_rotary_embedding. #6880
- Fix Qwen-Omni quantization bugs. #7042 #7007
- Fix GDN layer accuracy in graph mode. #6822
- Fix precision bugs for PCP/DCP in PD disaggregate. #6876
- Fix MTP in PD disaggregation with fullgraph support for all D-Nodes. #6948
- Fix GQA model error when enabling both DP and DCP. #7012
- Fix MTP prefill misclassified as decode edge case. #6835
- Fix Eagle3 acceptance rate for QuaRot quantized models. #6914
- Fix RoPE shape mismatch for MTP models with FlashComm V1 enabled. #6939
- Fix Qwen2.5VL accuracy issue. #6975
- Fix MoE forward error with static kernel enabled. #6964
- Fix muls_add fusion for GLM5 models. #6928
- Fix GDN layer detection for multimodal models. #6941
- Fix 300I unquant model weight nd2nz error. #6851
- Fix CPU binding logic. #6889
- Fix Eagle fullgraph shape capture. #6846
Known Issue
- Currently, for DeepSeek v3.2, PCP & DCP do not yet work with FlashComm1 feature, which may cause serve errors or other unknown errors.
- In 4-node A3 PD disaggregation deployment with DeepSeek V3.2, the P-Node may hang when benchmarking in high concurrency scenario, e.g., 2K/2K tokens with 512 concurrent requests.
- MTP with large EP configurations may cause graph capture buffer overflow. This is a bug need to fix in vLLM, now there is a workaround to avoid it: explicitly set
--compilation-config '{"max_cudagraph_capture_size": N}'whereN = max_concurrency × (1 + num_speculative_tokens).
New Contributors
- @Eric-dot made their first contribution in #6670
- @tanhaoan333 made their first contribution in #6828
- @NJX-njx made their first contribution in #6965
- @Zhujiyang2 made their first contribution in #6939
- @songjianquan made their first contribution in #6740
- @liuchen2026fly made their first contribution in #6928
- @guleo made their first contribution in #6827
- @wanghengkang made their first contribution in #6977
- @xiaocongtou6 made their first contribution in #6940
- @chenxi-hh made their first contribution in #7010
- @wanghuanjun2113 made their first contribution in #7064
- @banxiaduhuo made their first contribution in #6715
- @xmpp777 made their first contribution in #6933
- @s-zk made their first contribution in #6872
- @ZRJ026 made their first contribution in #6985
Full Changelog: v0.15.0rc1...v0.16.0rc1