Release v0.18.0rc1 · vllm-project/vllm-ascend

This is the first release candidate of v0.18.0 for vLLM Ascend. Please follow the official doc to get started.

Highlights

C8(INT8 KV cache) is now supported for DeepSeek-V3.1 with PD disaggregation scenario. #7222
DeepSeek models are now supported on A5 through new MLA operators. #7232

Features

Flash Comm V1 now supports VL models with MLA, removing a previous limitation for multimodal serving. #7390
Support separate attention backends for target and draft models in speculative decoding, allowing finer backend tuning per model. #7342
VL MoE models now support SP, and sp_threshold is removed in favor of sp_min_token_num from vLLM. #7044
Qwen VL models now support w8a8_mxfp8 quantization. #7417

Performance

Optimized Triton operator recompilation to reduce redundant rebuilds and unnecessary recompilation triggered by function parameter optimization. #7647 #7645
Optimized the Qwen3.5 and Qwen3-Next GDN prefill path by prebuilding chunk metadata, reducing host-device synchronization overhead. #7487
Simplified the FIA prefill context merge path for better runtime efficiency. #7293

Documentation

Refreshed deployment and model docs for Kimi-K2.5, GLM-4.7, DeepSeek-V3.2, MiniMax-M2.5, and PD disaggregation guides. #7371 #7403 #7292 #7296 #7300

Others

Fixed a PD separation issue where decode nodes could get stuck because shapes were not aligned across DP nodes. #7534
Fixed a regression where hybrid attention plus mamba models on Ascend could start with an incorrect block size after the v0.18.0 upgrade. #7528
Fixed multi-instance serving OOM calculation on single-card deployments. #7427
Fixed DeepSeek v3.1 C8 when overlaying MTP with full decode and full graph modes. #7571
Fixed quantization config key mapping in AscendModelSlimConfig by switching from reverse mapping to forward mapping. #7716

Dependencies

To address issues triggered by multi-stream parallel operations within ACL Graph, we have integrated temporary dependency versions for torch_npu. These fixes are already included in our official Docker images. If you prefer to build your own environment from source, please manually install the specific versions as follows:

# Set environment variables
PYTHON_TAG=$(python3 -c "import sys; print(f'cp{sys.version_info.major}{sys.version_info.minor}')")
ARCH=$(python3 -c "import platform; m=platform.machine().lower(); arch_map={'x86_64':'x86_64','amd64':'x86_64','aarch64':'aarch64','arm64':'aarch64'}; print(arch_map.get(m,m))")

# Select the specific torch_npu wheel based on your environment
if [ "$PYTHON_TAG" = "cp310" ] && [ "$ARCH" = "aarch64" ]; then PTA_WHEEL="torch_npu-2.9.0.post1%2Bgit4c901a4-${PYTHON_TAG}-${PYTHON_TAG}-manylinux_2_28_${ARCH}.whl"; \
elif [ "$PYTHON_TAG" = "cp311" ] && [ "$ARCH" = "x86_64" ]; then PTA_WHEEL="torch_npu-2.9.0.post1%2Bgitdc51c2d-${PYTHON_TAG}-${PYTHON_TAG}-manylinux_2_28_${ARCH}.whl"; \
elif [ "$PYTHON_TAG" = "cp310" ] && [ "$ARCH" = "x86_64" ]; then PTA_WHEEL="torch_npu-2.9.0.post1%2Bgita74051c-${PYTHON_TAG}-${PYTHON_TAG}-manylinux_2_28_${ARCH}.whl"; \
elif [ "$PYTHON_TAG" = "cp311" ] && [ "$ARCH" = "aarch64" ]; then PTA_WHEEL="torch_npu-2.9.0.post1%2Bgitee7ba04-${PYTHON_TAG}-${PYTHON_TAG}-manylinux_2_28_${ARCH}.whl"; \
else echo "Unsupported PYTHON_TAG=$PYTHON_TAG ARCH=$ARCH"; exit 1; fi

# Install wheels
python3 -m pip install "https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/${PTA_WHEEL}"

There is an known issue on the current triton-ascend, as shown in #7782 Please upgrade triton-ascend to 3.2.0.dev20260322 to avoid this issue, please use the official docker images or manually install the specific triton-ascend version as following:

PYTHON_TAG=$(python3 -c "import sys; print(f'cp{sys.version_info.major}{sys.version_info.minor}')") && \
ARCH=$(python3 -c "import platform; machine = platform.machine().lower(); arch_map = {'x86_64': 'x86_64', 'amd64': 'x86_64', 'aarch64': 'aarch64', 'arm64': 'aarch64'}; print(arch_map.get(machine, machine))") && \
TRITON_ASCEND_WHEEL="triton_ascend-3.2.0.dev20260322-${PYTHON_TAG}-${PYTHON_TAG}-manylinux_2_27_${ARCH}.manylinux_2_28_${ARCH}.whl" && \
python3 -m pip install "https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/${TRITON_ASCEND_WHEEL}"

Known Issue

When running DeepSeek-R1 W8A8 with MTP and KV Pool enabled under high concurrency, a ValueError: Counters can only be incremented by non-negative amounts may occur. #7489
triton-ascend may fail to compile with a g++ internal compiler error (Segmentation fault). Workaround: update to triton-ascend==3.2.0.dev20260322 and clear the Triton cache (rm -rf ~/.triton/cache/*). #7782
FIA does not support all MHA head dimensions when using tp-size >= 16 on Ascend. Affected models will fail with an error on unsupported head dimensions. This will be resolved in a future release when FIA supports more head dimensions. #7729
While Minimax-2.5 now supports PD Disaggregation, internal testing has identified a 13% regression on the GPQA benchmark when this feature is enabled. We currently do not recommend enabling PD Disaggregation for this model and We are working on an optimization fix.

New Contributors

@GGGGua made their first contribution in #7295
@asunxiao made their first contribution in #7066
@liuhy1213-cell made their first contribution in #7300
@jiangmengyu18 made their first contribution in #7383
@ksiyuan made their first contribution in #7417
@yesyue-w made their first contribution in #7046
@lijiahang226 made their first contribution in #7232
@ZhuQi-seu made their first contribution in #7368
@GoMarck made their first contribution in #7392

Full Changelog: v0.17.0rc1...0.18.0rc1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.18.0rc1

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

Features

Performance

Documentation

Others

Dependencies

Known Issue

New Contributors

Contributors

Uh oh!