Skip to content

v0.16.0rc1

Pre-release
Pre-release

Choose a tag to compare

@MengqingCao MengqingCao released this 10 Mar 14:51
· 756 commits to main since this release
a78a00e

This is the first release candidate of v0.16.0 for vLLM Ascend. Please follow the official doc to get started.

Highlights

  • Qwen3-Omni quantization adaptation and optimization is now available. #6828
  • GLM5-W8A8 quantization is now supported by parameterizing hardcoded MLA dimensions. #6902

Features

  • [Experimental] Support FabricMem Mode for ADXL/HIXL interconnect. #6806
  • Qwen3-Next now supports FlashComm1. #6830
  • NPUWorker Profiler now supports profile_prefix for better profiling experience. #6968
  • EPLB profiling now displays expert hotness comparison and time required for eplb adjustment. #6877 #7001]
  • Xlite Qwen3 MoE now supports Data Parallel. #6715
  • Mooncake Layerwise Connector now supports kv_pool. #7032
  • Eagle3 now supports QuaRot quantization without embedding. #7038

Hardware and Operator Support

  • 310P now supports w8a8sc quantization method. #7075
  • Added AscendC casual_conv1d_fn operator for Qwen3-Next. #6661
  • Added Ascend Ops recurrent_gated_delta_rule operator. #6725
  • Added GMM custom operator for MoE models. #7010

Performance

  • Faster convolution computation improves TTFT by 0.95% and throughput by 0.59% for Qwen3-VL models. #7017
  • Optimize split_qkv_rmsnorm_rope operator. #6827
  • Implement global CPU slicing and improve IRQ binding for Ascend NPUs, ensuring non-overlapping CPU partitions and better resource management. #6945
  • Optimize MTP execution by reordering state update operation. #6844
  • Avoid CPU sync in mrope_positions copy by using full tensor copy. #7014
  • Remove H2D synchronization for expert_map in MoE models. #7000

Dependencies

  • CANN is upgraded to 8.5.1, please remember to upgrade by hand if you're not using the official image. #6897

Deprecation & Breaking Changes

  • enable_flash_comm_v1 config option has been renamed back to enable_sp. #6883
  • The auto-detect quantization format from model files is reverted, in v0.16.0rc1, we still need to add ---quantization ascend to serve a model quantinized by modelslim. It will be added back in the next version after the bug with the remote model id is fixed. #6873

Documentation

  • Added user/developer guide for CPU binding. #7045
  • Added metrics usage documentation and example. #6962
  • Added llms.txt for LLM discovery. #6886
  • Added GLM4.x multi-node deploy tutorial. #6872
  • Added explanation of 310p special param: max-model-len. #7065

Others

  • Fix openEuler Dockerfile error. #6871
  • Many bug fixes including:
    • Fix Eagle speculative decoding with Context Parallel enabled. #6981 #7079
    • Fix LoRA accuracy issue introduced by upstream vLLM changes. #6958
    • Fix streaming content-type in load balance proxy server. #6985
    • Fix metadata execute error: integer modulo by zero. #6521
    • Fix triton rope_siso implementation bug. #7082
    • Fix incorrect layer count for MTP models in update_aclgraph_sizes. #7064
    • Fix compilation errors for CANN versions subsequent to b020. #7059
    • Fix quant config support in GLM4.6V. #7062
    • Fix parameter ordering bug in _merge_multimodal_embeddings. #7068
    • Fix fused mc2 bug in EPLB. #6794
    • Fix kernel block size for computing slot mapping. #7019
    • Fix layerwise stacking MTP error in P/D disaggregation. #7036
    • Fix RoPE dimension for npu_rotary_embedding. #6880
    • Fix Qwen-Omni quantization bugs. #7042 #7007
    • Fix GDN layer accuracy in graph mode. #6822
    • Fix precision bugs for PCP/DCP in PD disaggregate. #6876
    • Fix MTP in PD disaggregation with fullgraph support for all D-Nodes. #6948
    • Fix GQA model error when enabling both DP and DCP. #7012
    • Fix MTP prefill misclassified as decode edge case. #6835
    • Fix Eagle3 acceptance rate for QuaRot quantized models. #6914
    • Fix RoPE shape mismatch for MTP models with FlashComm V1 enabled. #6939
    • Fix Qwen2.5VL accuracy issue. #6975
    • Fix MoE forward error with static kernel enabled. #6964
    • Fix muls_add fusion for GLM5 models. #6928
    • Fix GDN layer detection for multimodal models. #6941
    • Fix 300I unquant model weight nd2nz error. #6851
    • Fix CPU binding logic. #6889
    • Fix Eagle fullgraph shape capture. #6846

Known Issue

  • Currently, for DeepSeek v3.2, PCP & DCP do not yet work with FlashComm1 feature, which may cause serve errors or other unknown errors.
  • In 4-node A3 PD disaggregation deployment with DeepSeek V3.2, the P-Node may hang when benchmarking in high concurrency scenario, e.g., 2K/2K tokens with 512 concurrent requests.
  • MTP with large EP configurations may cause graph capture buffer overflow. This is a bug need to fix in vLLM, now there is a workaround to avoid it: explicitly set --compilation-config '{"max_cudagraph_capture_size": N}' where N = max_concurrency × (1 + num_speculative_tokens).

New Contributors

Full Changelog: v0.15.0rc1...v0.16.0rc1