v0.18.0rc1
Pre-release
Pre-release
·
129 commits
to releases/v0.18.0
since this release
This is the first release candidate of v0.18.0 for vLLM Ascend. Please follow the official doc to get started.
Highlights
- C8(INT8 KV cache) is now supported for DeepSeek-V3.1 with PD disaggregation scenario. #7222
- DeepSeek models are now supported on A5 through new MLA operators. #7232
Features
- Flash Comm V1 now supports VL models with MLA, removing a previous limitation for multimodal serving. #7390
- Support separate attention backends for target and draft models in speculative decoding, allowing finer backend tuning per model. #7342
- VL MoE models now support SP, and
sp_thresholdis removed in favor ofsp_min_token_numfrom vLLM. #7044 - Qwen VL models now support
w8a8_mxfp8quantization. #7417
Performance
- Optimized Triton operator recompilation to reduce redundant rebuilds and unnecessary recompilation triggered by function parameter optimization. #7647 #7645
- Optimized the Qwen3.5 and Qwen3-Next GDN prefill path by prebuilding chunk metadata, reducing host-device synchronization overhead. #7487
- Simplified the FIA prefill context merge path for better runtime efficiency. #7293
Documentation
- Refreshed deployment and model docs for Kimi-K2.5, GLM-4.7, DeepSeek-V3.2, MiniMax-M2.5, and PD disaggregation guides. #7371 #7403 #7292 #7296 #7300
Others
- Fixed a PD separation issue where decode nodes could get stuck because shapes were not aligned across DP nodes. #7534
- Fixed a regression where hybrid attention plus mamba models on Ascend could start with an incorrect block size after the v0.18.0 upgrade. #7528
- Fixed multi-instance serving OOM calculation on single-card deployments. #7427
- Fixed DeepSeek v3.1 C8 when overlaying MTP with full decode and full graph modes. #7571
- Fixed quantization config key mapping in
AscendModelSlimConfigby switching from reverse mapping to forward mapping. #7716
Dependencies
- To address issues triggered by multi-stream parallel operations within ACL Graph, we have integrated temporary dependency versions for
torch_npu. These fixes are already included in our official Docker images. If you prefer to build your own environment from source, please manually install the specific versions as follows:
# Set environment variables
PYTHON_TAG=$(python3 -c "import sys; print(f'cp{sys.version_info.major}{sys.version_info.minor}')")
ARCH=$(python3 -c "import platform; m=platform.machine().lower(); arch_map={'x86_64':'x86_64','amd64':'x86_64','aarch64':'aarch64','arm64':'aarch64'}; print(arch_map.get(m,m))")
# Select the specific torch_npu wheel based on your environment
if [ "$PYTHON_TAG" = "cp310" ] && [ "$ARCH" = "aarch64" ]; then PTA_WHEEL="torch_npu-2.9.0.post1%2Bgit4c901a4-${PYTHON_TAG}-${PYTHON_TAG}-manylinux_2_28_${ARCH}.whl"; \
elif [ "$PYTHON_TAG" = "cp311" ] && [ "$ARCH" = "x86_64" ]; then PTA_WHEEL="torch_npu-2.9.0.post1%2Bgitdc51c2d-${PYTHON_TAG}-${PYTHON_TAG}-manylinux_2_28_${ARCH}.whl"; \
elif [ "$PYTHON_TAG" = "cp310" ] && [ "$ARCH" = "x86_64" ]; then PTA_WHEEL="torch_npu-2.9.0.post1%2Bgita74051c-${PYTHON_TAG}-${PYTHON_TAG}-manylinux_2_28_${ARCH}.whl"; \
elif [ "$PYTHON_TAG" = "cp311" ] && [ "$ARCH" = "aarch64" ]; then PTA_WHEEL="torch_npu-2.9.0.post1%2Bgitee7ba04-${PYTHON_TAG}-${PYTHON_TAG}-manylinux_2_28_${ARCH}.whl"; \
else echo "Unsupported PYTHON_TAG=$PYTHON_TAG ARCH=$ARCH"; exit 1; fi
# Install wheels
python3 -m pip install "https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/${PTA_WHEEL}"- There is an known issue on the current triton-ascend, as shown in #7782 Please upgrade triton-ascend to 3.2.0.dev20260322 to avoid this issue, please use the official docker images or manually install the specific triton-ascend version as following:
PYTHON_TAG=$(python3 -c "import sys; print(f'cp{sys.version_info.major}{sys.version_info.minor}')") && \
ARCH=$(python3 -c "import platform; machine = platform.machine().lower(); arch_map = {'x86_64': 'x86_64', 'amd64': 'x86_64', 'aarch64': 'aarch64', 'arm64': 'aarch64'}; print(arch_map.get(machine, machine))") && \
TRITON_ASCEND_WHEEL="triton_ascend-3.2.0.dev20260322-${PYTHON_TAG}-${PYTHON_TAG}-manylinux_2_27_${ARCH}.manylinux_2_28_${ARCH}.whl" && \
python3 -m pip install "https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/${TRITON_ASCEND_WHEEL}"Known Issue
- When running DeepSeek-R1 W8A8 with MTP and KV Pool enabled under high concurrency, a
ValueError: Counters can only be incremented by non-negative amountsmay occur. #7489 - triton-ascend may fail to compile with a g++ internal compiler error (Segmentation fault). Workaround: update to
triton-ascend==3.2.0.dev20260322and clear the Triton cache (rm -rf ~/.triton/cache/*). #7782 - FIA does not support all MHA head dimensions when using tp-size >= 16 on Ascend. Affected models will fail with an error on unsupported head dimensions. This will be resolved in a future release when FIA supports more head dimensions. #7729
- While Minimax-2.5 now supports PD Disaggregation, internal testing has identified a 13% regression on the GPQA benchmark when this feature is enabled. We currently do not recommend enabling PD Disaggregation for this model and We are working on an optimization fix.
New Contributors
- @GGGGua made their first contribution in #7295
- @asunxiao made their first contribution in #7066
- @liuhy1213-cell made their first contribution in #7300
- @jiangmengyu18 made their first contribution in #7383
- @ksiyuan made their first contribution in #7417
- @yesyue-w made their first contribution in #7046
- @lijiahang226 made their first contribution in #7232
- @ZhuQi-seu made their first contribution in #7368
- @GoMarck made their first contribution in #7392
Full Changelog: v0.17.0rc1...0.18.0rc1