Skip to content

v0.19.1rc1

Pre-release
Pre-release

Choose a tag to compare

@MengqingCao MengqingCao released this 30 Apr 10:48
· 218 commits to main since this release
da421af

This is the first release candidate of v0.19.1 for vLLM Ascend, based on vLLM v0.19.1. This release includes significant performance optimizations, new model support, hardware expansion, and important bug fixes.

Please follow the official doc to get started.

Highlights

  • DFlash Attention Backend: Added DFlash attention backend with FULL_DECODE_ONLY support for improved inference performance (#8118, #8516, #8627)
  • Zero Bubble Async Scheduling: Implemented zero bubble optimization for async scheduling and speculative decoding, significantly reducing scheduling overhead (#7640)
  • A2/A3 Attention Operator Upgrade: Replaced npu_fusion_attention with _npu_flash_attention_unpad operator for better performance on A2 and A3 hardware (#8671)
  • Eagle3 + MiniMax-M2.5 Support: Applied Eagle3 speculative decoding to MiniMax-M2.5 model for faster inference (#7619)
  • C8 INT8 KV Cache for GQA: Added C8 (INT8 KV cache) support for GQA attention models, including DeepSeek-V3.1 with PD disaggregation (#7474, #7222)
  • Bailing Model Support: Full support for Bailing MoE model including linear adaptation and ModelSlim quantization (#8657, #8709)

Features

  • Flash Comm V1 for Qwen3-VL: Support Flash Comm V1 for Qwen3-VL multimodal models (#7897)
  • Eagle + PCP + Full Graph Mode: Support Eagle combined with PCP and full graph mode (#7924)
  • Multimodal Reasoning with PCP: Support multimodal reasoning when prefill context parallel feature is enabled (#8038)
  • Dynamic Chunk for PP: Support Dynamic Chunk for Chunked Pipeline Parallelism (#7896)
  • Hamming-based Sparse Attention: Added Hamming-based sparse attention inference framework and operators (#8564, #8346)
  • Optimized Causal Conv1d Operator: Added optimized causal conv1d operator (#8215)
  • Recurrent AscendC Operators: Added recurrent AscendC operators for specific model architectures (#8055)
  • GLM4.7 C8 Support: Support GLM4.7 with C8 (INT8 KV cache) scenarios (#8174)
  • Minitron-8B-Base Support: Verified and supported nvidia/Minitron-8B-Base model (#8157)
  • Bailing Model Support: Full support for Bailing MoE model with linear adaptation and ModelSlim quantization configuration (#8657, #8709)
  • Qwen3.5 MoE Flash Comm: Support Flash Comm for Qwen3.5 MoE models (#7486)
  • Initial MoE Support for MRv2: Add initial MoE models support for Model Runner V2 (#7922)
  • Xlite Backend Expansion:
    • XLite GLM-4.7 support (#7935)
    • Support Qwen3VLMoeForConditionalGeneration in xlite backend (#8046)
  • EPLB Enhancements:
    • Swift balancer policy supports mix placement (#8035)
    • EPLB adaptation to multimodal models (#7743)
  • Eagle Improvements for model_runner_v2:
    • Fixed Eagle's acceptance rate problem in graph mode (#8365)
    • Fixed Eagle's precision problems (#8230, #8033)
    • Adapted Eagle for model_runner_v2 (#7885)
  • MTP Merged Graph: Support merged graph for MTP (Multi-Token Prediction) (#6860)
  • Unified MoE Expert Placement: Support unified placement for shared & router experts (#7188)
  • Dispatch V2 Hierarchy Communication: Support dispatch_v2/combine_v2 hierarchy communication for better MoE performance (#7583)
  • Xmask for Dispatch FFN Combine: Add xmask feature for dispatch_ffn_combine operator (w8a8 branch) (#8560)
  • Fused W4A8 Kernel: Fuse W4A8 dispatch + FFN + combine into a single fused kernel (#7779)
  • KV Cache Memory Accounting: Account for graph capture memory in KV cache planning (#8289)
  • Qwen3-Next Hybrid Attention: Support Qwen3-next hybrid attention in piecewise & full_decode_only modes (#7422)
  • GDN Optimization: Optimize GDN non-spec prefill fallback metadata (#7756)
  • Qwen3-VL Support: Support kv_rmsnorm_mrope for Qwen3-VL (#7762)
  • Mamba Prefix Caching: Layerwise connector supports Mamba prefill prefix caching (#7814)
  • Yuanrong KV Pool Backend: Add Yuanrong backend support to KV Pool (#6869)

Hardware and Operator Support

  • 310P Enhancements:
    • Qwen3.5 model adaptation synchronized with main (#8009)
    • Support W8A8 dynamic linear method (#7725)
    • Support shared experts path in fused MoE for Qwen3.5 (#7674)
    • Add npu_causal_conv1d_310 AscendC Custom Op (#7798)
    • Add recurrent_gated_delta_rule_310 AscendC Custom Op (#7926)

Performance

  • A2/A3 Attention: Replace npu_fusion_attention with _npu_flash_attention_unpad operator for better performance (#8671)
  • MLA PCP Prefill Optimization: Optimize MLA PCP prefill attention by avoiding projecting unnecessary tail KV tokens (#8787)
  • Async Scheduling Optimization:
    • Asynchronous scheduling issuance bubble optimization (#8766)
    • Zero bubble async scheduling and spec decoding (#7640)
  • KV Cache Optimization:
    • Batch KV cache offloading via aclrtMemcpyBatchAsync (#7819)
    • Optimize KV cache gathering by selecting blocks before all-gather (#8050)
  • Operator Optimizations:
    • Optimize split_qkv_tp_rmsnorm_rope ops (#8059)
    • Optimize host-device sync problem in prefill phase for Qwen3Next/Qwen3.5 (#7967)
    • Reduce prefill KV all-gather communication for PCP/DCP (SFA) (#8043)
    • Add penalty-related Triton kernel for better performance of penalties (#7569)
  • Triton Kernel Optimizations (model_runner_v2):
    • Optimize _temperature_kernel and _topk_log_softmax_kernel (#8083)
    • Optimize _min_p_kernel performance (#8243, #7767)
    • Add bad-words-kernel triton kernel (#8030)
    • Optimize bincount_kernel performance (#7757)
    • Optimize _ranks_kernel performance (#7767)
    • Optimize triton recompilation triggered by function parameters (#7480, #7481, #7483)
  • HCCL Process Group Reuse: Reuse equivalent HCCL process groups on Ascend (#7654)
  • CPU Binding Defer: Defer CPU binding until worker warmup completes (#7829)
  • Conv3d to Linear Conversion: Convert conv3d to linear when kernel size equals stride (#8318)

Dependencies

  • vLLM: Upgraded to vLLM v0.19.1 (#8448)
  • Transformers: Upgraded to transformers 5.5.3 (from 4.57.4), a major version upgrade with significant improvements and API changes (#8448)
  • lm-eval: Upgraded to lm-eval 0.4.11 for compatibility with transformers 5.5.3 (#8448)
  • New Dependencies: Added memcache and memfabric into requirements (#8747)

Documentation

  • PD Disaggregation Guides:
    • PD Disaggregation with UCM and Mooncake (#8338)
    • Dynamic chunked pipeline parallel guide (#8728)
  • Model Documentation:
    • GLM-5.1 model tutorial (#8054)
    • GLM4.7 documentation update (#8450)
    • GLM5 documentation with parameters and FAQs (#7958, #7850)
    • Qwen3.5 user guide update (#7866)
    • Kimi-K2.5 documentation update (#7901)
    • Qwen3-Omni-30B-A3B-Thinking documentation (#8628)
    • DeepSeekOCR2 documentation (#8573)
    • Hunyuan-A13B-Instruct verification and documentation (#7381)
    • LLaVA-OneVision-Qwen2-0.5B-OV tutorial (#7912)
  • Documentation Improvements:
    • Enable MathJax rendering for Markdown formulas (#8793)
    • Update version policy (#8656)
    • Add preemption description in FAQs (#8131)
    • Update supported vLLM versions (#7923)
    • Parameterize versioning policy compatibility matrix (#8002)
    • Avoid A2 CPU binding overlap from hidden NPUs and doc updates (#8792)

Others

Important Bug Fixes:

  • GQA C8 Fullgraph: Fixed a bug in GQA C8 fullgraph mode (#8779)
  • DSV3.1 W4A8 TTFT: Revert change of balance_flag to fix DSV3.1 W4A8 TTFT degradation (#8675)
  • DSV3.1 Service Startup: Fix DeepSeek-V3.1 service failed to start (#8208)
  • Qwen3.5 MoE High Concurrency: Fix Qwen3.5 MoE FC1 error under high concurrency when dp>1 (#8396)
  • Qwen3.5 MoE Flash Comm: Fix Qwen3.5 MoE flash comm v1 shared expert shape error of mtp layer on A2 (#7683)
  • Graph Capture OOM: Fix the graph capturing OOM in model_runner_v2 (#8111)
  • DeepSeek 3.2 C8 Precision: Fix DeepSeek 3.2 C8 precision by reverting quantization layers (#7628)
  • DeepSeek 3.2 DCP MTP: Fix ds3.2 dcp mtp issues (#7617)
  • MTP1 Concurrent Crash: Fix MTP1 crashing in multiple concurrent scenarios (#7459)
  • Spec Decode + Async: Fix spec decode and async bugs (#8461)
  • Spec Decode + Logprobs: Fix spec decode + logprobs crash when async scheduling is disabled (#7861)
  • Repetition Penalty: Fix repetition_penalty not effective in asynchronous scheduling (#7789)
  • P/D KV Cache: Fix KV cache at MTP layer when TP is not equal in P/D scenarios (#8540)
  • P/D Short Sequence: Fix short sequence has no response in P/D mode (#8104)
  • P/D Retry Mechanism: Add retry mechanism to prevent packet loss in P/D (#8166)
  • Layerwise Connector OOM: Fix layerwise connector OOM during large buffer transfer (#7834)
  • KV Pool Put Logic: Fix KV Pool not putting KV cache and fix KV transfer Put Logic (#7875, #7717)
  • KV Pool PCP/DCP: Fix PCP and DCP bugs for KV Pool (#8099)
  • Mooncake Backend: MooncakeBackend handles protocol besides Ascend (#8514)
  • FlashComm Server Init: Fix server init error when max_num_seqs not multiple of tp with FLASHCOMM (#7801)
  • Triton Reinstall: Reinstall triton-ascend after vllm-ascend install (#7790)
  • DBO Compatibility: Add compatibility guard for --enable-dbo on Ascend NPU (#8507)
  • GPU Params on NPU: Guard GPU-specific parallel config params on Ascend NPU (#8703)
  • A2 CPU Binding: Avoid A2 CPU binding overlap from hidden NPUs (#8792)
  • FIA Pad Bug: Fix FIA pad bug under max concurrency for EAGLE (#7740)
  • MoE Load Precision: Fix moe_load precision in allgather (#7887)
  • FlashComm1 + DCP for Qwen: Support FlashComm1 & DCP for Qwen models (#7673)
  • Block Verify: Disable block verify to avoid incorrect verification on NPU (#7603)
  • Model Runner V2 Full Graph: Fix model_runner_v2 in full graph mode (#7945)
  • MRv2 Spec Decode: Fix mrv2 runtime error with speculative decoding (#8209)
  • GLM Tool Call Streaming: Fix GLM tool call streaming issues (#8832)
  • Forced Tool Choice: Fix forced tool choice none-content handling (#8833)
  • MiniMax Reasoning Usage: Fix MiniMax reasoning usage accounting (#8831)

Other Bug Fixes:

  • MTP recurrent batch size after lmhead TP logits truncation (#8718)
  • Fix error caused by layer_sharding in DSV32 mixed deployment (#8717)
  • EPLB topk_ids uses logical experts count (#8501)
  • EPLB validation logic optimization and MTP support redundant experts (#8710)
  • SP Preserve graph stringification in MoE sequence parallel (#8780)
  • SpecDecode Fix draft quarot model loading timeout (#8736)
  • Fix _dummy_run warmup mismatch with --language-model-only (#8556)
  • Fix AttributeError in AscendYaRNRotaryEmbedding (#8734)
  • Eagle3 Add fullgraph case and check mock function (#8668)
  • Fix atten_mask in npu_fused_infer_attention_score_v2 (#8387)
  • Fix conflicts between eagle and dflash about pcp (#8598)
  • Fix incorrect slot mapping for DeepSeek 3.2 PCP+MTP (#8547)
  • dispatch_ffn_combine kernel rollback (#8539)
  • Require kv producer for layer sharding (#8562)
  • 310P Use CPU generator cache for sampling (#8495)
  • Fix compute_slot_mapping triton for pcp+eagle3 (#8435)
  • Handle enum-based MoE activation in fuse_moe (#8465)
  • Gate recompute/balance/fused_mc2 by PD mode (#8373)
  • w8a8 dispatch ffn combine bias param adapt (#8342)
  • Fix quant_bias missing in w8a8_static for GLM-5 with flashcomm1 (#8220)
  • Fix DSA-CP PD role gating for deepseek v3.2 (#8290)
  • Require piecewise cudagraph for layerwise AscendStorConnector (#8283)
  • Fix remote KV waiting promotion in patch balance scheduler (#8279)
  • Enforce C locale for CPU binding subprocess parsing (#8251)
  • Add wait_for_kv_layer_from_connector in mlapo branch SFA (#8195)
  • Fix dimension mismatch when SP padding (#7858)
  • 310P Fixed Triton kernel block_table crash (#8144)
  • Fix attention state of short prompt (#8029)
  • 310P Fix post-sampling not working in graph mode (#8017)
  • 310P Align GDN state semantics with vLLM (#7902)
  • 310P Handle null quant config in ShardedStateLoader310 (#7546)
  • unpad block table when enable_sp and eagle3 in eager mode (#7986)
  • Fix qwen3-next compilation error (#7936)
  • Fix the weightsmapper bug of qwen3-vl (#7869)
  • Fix quant config attribute error (#7736)
  • Remove unnecessary weight_scale wrap behavior for eplb (#7733)
  • Adapt to main2main for model runnerv2 and add gc in sleep mode (#7709)
  • Fix prefix caching support for embedding models (#7452)
  • Reuse weight address in graph + RL scenario (#7473)

Known Issues

  • When running GLM-5 / GLM-5.1 models in single-node (non-PD-disaggregated) scenarios, incorrect results or runtime errors may occur. See #8843 for details and workarounds.
  • triton-ascend may fail to compile with a g++ internal compiler error (Segmentation fault). Workaround: update to triton-ascend==3.2.0.dev20260322 and clear the Triton cache (rm -rf ~/.triton/cache/*). #7782
  • torch-npu: Please upgrade to 2.9.0.post1+git4c901a4 because of some known issue. This version can't install by default, please upgrade manually. We can get installstion pkg from this link: https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/torch_npu-2.9.0.post1%2Bgit4c901a4-${PYTHON_TAG}-${PYTHON_TAG}-manylinux_2_28_${ARCH}.whl. PS: If CANN has been upgraded to version 9.0.0, please upgrade torch-npu version to 2.9.0.post2 synchronously.

New Contributors

Full Changelog: v0.17.0rc1...v0.19.1rc1