Skip to content

v0.18.0rc1

Pre-release
Pre-release

Choose a tag to compare

@yiz-liu yiz-liu released this 01 Apr 15:30
· 129 commits to releases/v0.18.0 since this release
99e1ea0

This is the first release candidate of v0.18.0 for vLLM Ascend. Please follow the official doc to get started.

Highlights

  • C8(INT8 KV cache) is now supported for DeepSeek-V3.1 with PD disaggregation scenario. #7222
  • DeepSeek models are now supported on A5 through new MLA operators. #7232

Features

  • Flash Comm V1 now supports VL models with MLA, removing a previous limitation for multimodal serving. #7390
  • Support separate attention backends for target and draft models in speculative decoding, allowing finer backend tuning per model. #7342
  • VL MoE models now support SP, and sp_threshold is removed in favor of sp_min_token_num from vLLM. #7044
  • Qwen VL models now support w8a8_mxfp8 quantization. #7417

Performance

  • Optimized Triton operator recompilation to reduce redundant rebuilds and unnecessary recompilation triggered by function parameter optimization. #7647 #7645
  • Optimized the Qwen3.5 and Qwen3-Next GDN prefill path by prebuilding chunk metadata, reducing host-device synchronization overhead. #7487
  • Simplified the FIA prefill context merge path for better runtime efficiency. #7293

Documentation

  • Refreshed deployment and model docs for Kimi-K2.5, GLM-4.7, DeepSeek-V3.2, MiniMax-M2.5, and PD disaggregation guides. #7371 #7403 #7292 #7296 #7300

Others

  • Fixed a PD separation issue where decode nodes could get stuck because shapes were not aligned across DP nodes. #7534
  • Fixed a regression where hybrid attention plus mamba models on Ascend could start with an incorrect block size after the v0.18.0 upgrade. #7528
  • Fixed multi-instance serving OOM calculation on single-card deployments. #7427
  • Fixed DeepSeek v3.1 C8 when overlaying MTP with full decode and full graph modes. #7571
  • Fixed quantization config key mapping in AscendModelSlimConfig by switching from reverse mapping to forward mapping. #7716

Dependencies

  • To address issues triggered by multi-stream parallel operations within ACL Graph, we have integrated temporary dependency versions for torch_npu. These fixes are already included in our official Docker images. If you prefer to build your own environment from source, please manually install the specific versions as follows:
# Set environment variables
PYTHON_TAG=$(python3 -c "import sys; print(f'cp{sys.version_info.major}{sys.version_info.minor}')")
ARCH=$(python3 -c "import platform; m=platform.machine().lower(); arch_map={'x86_64':'x86_64','amd64':'x86_64','aarch64':'aarch64','arm64':'aarch64'}; print(arch_map.get(m,m))")

# Select the specific torch_npu wheel based on your environment
if [ "$PYTHON_TAG" = "cp310" ] && [ "$ARCH" = "aarch64" ]; then PTA_WHEEL="torch_npu-2.9.0.post1%2Bgit4c901a4-${PYTHON_TAG}-${PYTHON_TAG}-manylinux_2_28_${ARCH}.whl"; \
elif [ "$PYTHON_TAG" = "cp311" ] && [ "$ARCH" = "x86_64" ]; then PTA_WHEEL="torch_npu-2.9.0.post1%2Bgitdc51c2d-${PYTHON_TAG}-${PYTHON_TAG}-manylinux_2_28_${ARCH}.whl"; \
elif [ "$PYTHON_TAG" = "cp310" ] && [ "$ARCH" = "x86_64" ]; then PTA_WHEEL="torch_npu-2.9.0.post1%2Bgita74051c-${PYTHON_TAG}-${PYTHON_TAG}-manylinux_2_28_${ARCH}.whl"; \
elif [ "$PYTHON_TAG" = "cp311" ] && [ "$ARCH" = "aarch64" ]; then PTA_WHEEL="torch_npu-2.9.0.post1%2Bgitee7ba04-${PYTHON_TAG}-${PYTHON_TAG}-manylinux_2_28_${ARCH}.whl"; \
else echo "Unsupported PYTHON_TAG=$PYTHON_TAG ARCH=$ARCH"; exit 1; fi

# Install wheels
python3 -m pip install "https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/${PTA_WHEEL}"
  • There is an known issue on the current triton-ascend, as shown in #7782 Please upgrade triton-ascend to 3.2.0.dev20260322 to avoid this issue, please use the official docker images or manually install the specific triton-ascend version as following:
PYTHON_TAG=$(python3 -c "import sys; print(f'cp{sys.version_info.major}{sys.version_info.minor}')") && \
ARCH=$(python3 -c "import platform; machine = platform.machine().lower(); arch_map = {'x86_64': 'x86_64', 'amd64': 'x86_64', 'aarch64': 'aarch64', 'arm64': 'aarch64'}; print(arch_map.get(machine, machine))") && \
TRITON_ASCEND_WHEEL="triton_ascend-3.2.0.dev20260322-${PYTHON_TAG}-${PYTHON_TAG}-manylinux_2_27_${ARCH}.manylinux_2_28_${ARCH}.whl" && \
python3 -m pip install "https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/${TRITON_ASCEND_WHEEL}"

Known Issue

  • When running DeepSeek-R1 W8A8 with MTP and KV Pool enabled under high concurrency, a ValueError: Counters can only be incremented by non-negative amounts may occur. #7489
  • triton-ascend may fail to compile with a g++ internal compiler error (Segmentation fault). Workaround: update to triton-ascend==3.2.0.dev20260322 and clear the Triton cache (rm -rf ~/.triton/cache/*). #7782
  • FIA does not support all MHA head dimensions when using tp-size >= 16 on Ascend. Affected models will fail with an error on unsupported head dimensions. This will be resolved in a future release when FIA supports more head dimensions. #7729
  • While Minimax-2.5 now supports PD Disaggregation, internal testing has identified a 13% regression on the GPQA benchmark when this feature is enabled. We currently do not recommend enabling PD Disaggregation for this model and We are working on an optimization fix.

New Contributors

Full Changelog: v0.17.0rc1...0.18.0rc1