Skip to content

v0.18.0

Latest

Choose a tag to compare

@zzzzwwjj zzzzwwjj released this 30 Apr 10:48
· 21 commits to releases/v0.18.0 since this release
e18643f

We're excited to announce the release of v0.18.0 for vLLM Ascend. This is the official release for v0.18.0. Please follow the official doc to get started.

Highlights

Model Support

  • Kimi-K2.x Model Support: [Experimental]Added support for Kimi-K2.x models. @aipaes @dragondream-chen @SparrowMu @LoganJane #6755
  • Minimax-m2.x Model Support: [Experimental]Added support for Minimax-m2.x models with eagle3. @SparrowMu @GDzhu01 #7105 #7714
  • GLM5 Support: [Experimental]Added support for GLM5 models without any code modification!
  • Qwen3.x Support: [Experimental]Added support for Qwen3.x models without any code modification!
  • DeepseekOCR Support: [Experimental]Added support for DeepseekOCR model and optimize RelPosAttention and CustomQwen2Decoder. @Wangbei25 #7737

Core Features

  • EPLB (Expert Parallelism Load Balance): EPLB is more stable with many bug fixes, and has better performance now. EPLB now works in most cases and is recommended for use. #6528 #7344 #7890 #6477
  • ACLGraph Enhancement: ACLGraph now support capturing a single merged graph for multi-step drafts, which greatly reduce host bound in multi-step spec decoding case! #5553 #5940
  • KV Pooling: Enhanced KV pool with Mooncake connector now support sparse attention, and LMCacheAscendConnector is added as a new KV cache pooling solution for Ascend, and support FabricMem Mode for HIXL interconnect, support yuanrong as a backend for AscendStoreConnector, and now MooncakeLayerwiseConnector can be activated together with KV Pooling. Compared with previous versions, KV Pooling has a huge performance optimization on TTFT! #6339 #6882 #6806 #6869 #7032
  • PD disaggregation: Mooncake layerwise connector now support hybrid attention manager and PCP feature. #7022 #6627
  • NPU Graph EX (npugraph_ex) Enabled by Default: The npugraph_ex feature is now enabled by default, providing better graph optimization with integrated inductor pass and MatmulAllReduceAddRMSNorm fusion. #6354 #6664 #6006
  • RL(Reinforcement learning): [Experimental]RL enhanced with implemented batch invariant feature with AscendC and triton op, and added routing replay feature. #6590 #6696
  • CPU Binding Enabled by Default: Enabled ARM-only CPU binding with global-slicing A3 policy, improving inference throughput in hostbound scenarios. #6686

Features

  • Prefix cache is now supported in hybrid model. #7103
  • Flash Comm V1 now supports VL models with MLA, removing a previous limitation for multimodal serving. #7390
  • VL MoE models now support SP, and sp_threshold is removed in favor of sp_min_token_num from vLLM. #7044
  • [Experimental]Pipeline Parallel now supports async scheduling, improving throughput for PP deployments. #7136
  • Eagle3 now supports QuaRot quantization without embedding. #7038
  • Refactoring eagle3/mtp, eagle3 and mtp are now using the same proposer. #6349 #7033

Hardware and Operator Support

  • First time support 310P, with huge performance optimization!:
    • support W8A8 quantization. #6641 #6454
    • support weightNZ with quant and unquant case. #6705
    • support W8A8SC quantization. #7075
    • fix post-sampling not working in graph mode. #8077
    • Added addrmsnorm support for 300I DUO. #6704
    • Fix ngram graph replay accuracy error on 310P. #7134
  • Custom Operators: Added multiple custom operators including:
    • Added AscendC casual_conv1d_fn operator for Qwen3-Next. #6661
    • Added Ascend Ops recurrent_gated_delta_rule operator. #6725
    • Added GMM custom operator for MoE models. #7010
    • Optimize split_qkv_rmsnorm_rope operator. #6827
    • Triton rope now supports index_selecting from cos_sin_cache. #5450
    • Added AscendC fused op transpose_kv_cache_by_block to speed up GQA transfer. #6366
    • Optimized DispatchFFNCombine kernel performance and resolved vector error caused by unaligned UB access. #6468 #6707
    • Refactor and optimize CausalConv1d. #7495

Performance

  • Initialize Performance: Optimized Triton operator recompilation to reduce redundant rebuilds and unnecessary recompilation triggered by function parameter optimization. #7647 #7645
  • Qwen3.x Performance: [Experimental]Optimized the Qwen3.x and Qwen3-Next performance by supporting full graph mode, PD disaggregation, mamba prefill prefix-caching and flashcomm1, prebuilding chunk metadata to reducing host-device synchronization overhead, and multiple op performance optimization including chunk_gated_delta_rule, chunk_fwd_kernel_o, solve_tril, recompute_w_u_fwd_kernel, split_qkv_rmsnorm_mrope, etc. @LoganJane @shaopeng-666 @ppppeng @SunnyLee151064 @hust17yixuan @Toneymiller @linfeng-yuan #7487 #6830 #7506 #7796 #7527 #7529 #7495 #7368
  • Kimi-K2.x Performance: [Experimental]Optimized the Kimi-K2.x performance by supporting eagle3 and flashcomm1, and reducing d2h overhead. @aipaes @dragondream-chen @SparrowMu @LoganJane @GDzhu01 @Yaphets24 @hust17yixuan #7342 #7390 #7521
  • Qwen3-VL Performance: Qwen3-VL gets stronger multimodal operator enablement with Flash Comm V1 and qkv_rmsnorm_mrope support, and enable 2.7x faster for convolution computation with aclnn BatchMatMulV2, support EAGLE speculative decoding. #7893 #7852 #7017 #6327
  • Qwen3-Omni Performance: Qwen3-Omni quantization adaptation and optimization is now available. #6828
  • DeepSeek-V3.2/GLM5 Performance: Performance optimizations, support W8A8C8 quantization, and optimized KV cache usage. @yydyzr @ZYang6263 @rjg-lyh @Nagisa125 #7029 #6610
  • GLM4.7-Flash Performance: Added W8A8 quantization support for GLM4.7-Flash. @aipaes #6492

Dependencies

${PYTHON_TAG} is python version tag, and ${ARCH} is cpu architecture.

For example: python3.11 and aarch64, ${PYTHON_TAG}=cp311, ${ARCH}=aarch64.

Deprecation & Breaking Changes

  • Cleaned up and deprecated ProfileExecuteDuration feature. #6461
  • Removed custom rotary_embedding operator. #6523
  • Cleaned up unused env USE_OPTIMIZED_MODEL. #6618
  • enable_flash_comm_v1 config option has been renamed back to enable_sp. #6883

Documentation

  • Add a new introduction for MiniMax-M2.5 and MiniMax-M2.7. #8169
  • Add preemption guidance in FAQs. #8136
  • Update the deployment and support documentation for GLM5, including parameter descriptions, best practices, and FAQs. #7963 #7909
  • Update the Qwen3.5 user guide. #7934
  • Update the document configuration for DeepSeek-V3.2. #7970
  • Clean up documentation wording and grammar. #8073
  • Refreshed deployment and model docs for Kimi-K2.5, GLM-4.7, DeepSeek-V3.2, MiniMax-M2.5, and PD disaggregation guides. #7371 #7403 #7292 #7296 #7300
  • Added user/developer guide for CPU binding. #7045
  • Added Memcache Usage Guide. #6476
  • Added Benchmark Tutorial for Suffix Speculative Decoding. #6323
  • Added npugraph_ex introduction documentation. #6306

Others

  • Add async communication checks for capturing mode. #8149
  • Fix KV Pool conflicts between pooling scenarios and fix missing KV cache placement on vLLM v0.18.0. #8101 #7874
  • Fix short-prompt forwarding by correcting attention state handling. #8088
  • Restore global_bs=0 and mc2_mask for uniform-token dispatching, and support inter-node RoCE hierarchical MC2 communication. #8040
  • Fix the weights mapper bug of Qwen3-VL. #7868
  • Fixed quantization config key mapping in AscendModelSlimConfig by switching from reverse mapping to forward mapping. #7716
  • Fixed support for ALL D-Nodes in full graph when running MTP in PD deployment. #5472
  • Layerwise connector now supports recompute scheduler. #5900
  • Fixed pooling code issues and updated usage guide. #6126
  • NPUWorker Profiler now supports profile_prefix for better profiling experience. #6968

Known Issue

  • Currently, VLLM_ASCEND_ENABLE_FUSED_MC2 is not recommended for multi-DP and large number of tokens case(kv_producer or kv_both), this case may create large number of padded tokens across dp, which will be routed to certain experts, and make certain ranks receive tokens overload, resulting accuracy and performance issues. #8320
  • Currently, EPLB cannot support minimax_m2 model and W4A8 quantization. #8341
  • PCP and eagle3 overlaying may generate error when a prefill req's scheduled token number is smaller than 1 + num_speculative_tokens, which will make this prefill req be treated as a decode req, resulting in an error. #8402
  • NPU soft partitioning + CUDAGraphMode.PIECEWISE is not supported. #8585
  • Qwen3.x now has accuracy issue with PD disaggregation case. #8421
  • Currently, there is a known issue on x86 arch, and this issue has been resolved with CANN 9.0.0, if you want to deploy vllm-ascend on x86, please upgrade CANN version manually. #7993
  • P/D proxy may leak resources on recomputed retry and mask metaserver errors after. #8852
  • When deploying GLM5 and Deepseek V3.2 separately via PD architecture, there is a probabilistic issue of empty output or garbled characters. #8853
  • For GLM 5/5.1 under PD separation which D node setup with TP16 DP2 parallelism, the GPQA accuracy fell short of the standard. #8844

New Contributors

Full Changelog: v0.18.0rc1...v0.18.0