Skip to content

Releases: EnflameTechnology/vllm-gcu

v0.11.0

12 May 02:41

Choose a tag to compare

🚀 Release Summary

This release delivers a broad set of vLLM-GCU enhancements across model enablement, DeepSeek 3.2 optimization, distributed inference, GCU custom operators, KV-cache transfer, FP8/INT8 inference paths, async scheduling, and stability fixes. It also includes updated model documentation and build improvements for GCU Docker-based source compilation.


✨ Highlights

  • Added GCU Docker source compilation support.
  • Added or improved support for multiple model families, including Qwen2.5-VL, Qwen3-VL, Qwen3-Next, Qwen3 dense models, Libra, DeepSeek 3.2, DeepSeek-OCR, Hunyuan-OCR, Hunyuan A13B, MiMo-V2-Flash, MiniCPM-V, Paddle-VL, Step3-VL, GLM-4.5-Air, and Tongyi-DeepResearch.
  • Introduced major DeepSeek 3.2 runtime optimizations, including async scheduling, MTP support, DBO adaptation, FlashMLA enhancements, DS fusion improvements, indexer optimizations, and FP8 KV support.
  • Added first-token reuse and layer-wise KV cache transfer infrastructure.
  • Added new GCU custom operators and native op support for vLLM-GCU.
  • Improved INT8 KV, FP8 linear/MoE, W8A8 INT8 MoE, Split-Q, and fused MoE support.
  • Fixed multiple runtime correctness issues around MRoPE, rotary embedding, LoRA with graph mode, async + MTP, DP/SP/TP behavior, dtype mismatch, and model-specific inference paths.

🎁 Features

Build and Packaging

  • Added source code compilation support with GCU Docker.
  • Updated FlashAttention release rules for GCU release packaging.
  • Improved coverage build and package uninstall handling in Dolphin-based workflows.
  • Avoided unnecessary topsfactor and tops-sdk installation during tests to streamline release/test environments.

Model Support and Enablement

  • Added MRoPE support for Qwen2.5-VL and Libra.
  • Added Paddle-VL support for vLLM 0.11.0.
  • Added support for MiMo-V2-Flash.
  • Added Step3-VL support on vLLM 0.11.0 Libra.
  • Added Step3-VL model integration.
  • Added Qwen3-VL inference performance optimization.
  • Added Qwen2.5-VL performance improvements.
  • Added DeepSeek 3.2 MTP support.
  • Added DeepSeek 3.2 support for mtp > 1.
  • Added DeepSeek 3.2 support for dp2tp4sp4ep8.
  • Added DeepSeek 3.2 async scheduler support.
  • Added DeepSeek 3.2 MTP disablement path for EPLB.
  • Added DS-V3.2 chat-template support.
  • Added Hunyuan-OCR support in vLLM 0.11.0.
  • Added DS-OCR2 support in vLLM 0.11.0.
  • Added reasoning parser support.
  • Added repetition penalties operator support for S60.
  • Added repetition penalties operator support in application runtime.

GCU Runtime and Operator Enhancements

  • Added Torch native custom op support in vLLM-GCU.
  • Added custom op support in the native op module.
  • Added convert_req_index_to_global_index operator.
  • Added cp_gather_indexer_k_quant_cache operator for DeepSeek V3.2.
  • Added topk_per_row operators for DeepSeek V3.2.
  • Enabled normalized parameter support for topk_softmax.
  • Added topsvllmMRotaryEmbedding operator for interleaved RoPE.
  • Added new attention backend for INT8 KV on Scorpio.
  • Added W8A8 INT8 MoE inference support.
  • Added FusedMoE bias support.
  • Added swigluoai_and_mul.
  • Added FP8-per-tensor support for linear and MoE layers.
  • Added Split-Q support for D in DeepSeek 3.2.
  • Added token bin count return support.

DeepSeek 3.2 Optimization

  • Added DeepSeek V3.2 optimization series.
  • Added DS3.2 operator elimination.
  • Added DS3.2 per-tensor KV FP8 support.
  • Added DS3.2 indexer module using inplace rotary embedding.
  • Added DS3.2 preprocessing for indexer.weights_proj.
  • Added DS3.2 rope_with_kvcache support.
  • Added DS3.2 TopSTX device annotation support.
  • Added DS3.2 MQA logits threshold support.
  • Added DS3.2 top-k threshold support.
  • Added FlashMLA threshold support.
  • Added FlashMLA sparse and mixed runtime improvements.
  • Added DeepEP / DeepGEMM integration into vLLM.
  • Added DeepEP HT chunk support.
  • Added MQA logits cleanup behavior.
  • Added DCP support for MLA.

Distributed and Parallel Inference

  • Added support for P tensor_parallel_size > D tensor_parallel_size.
  • Added DecodeBenchConnector.
  • Enabled layer-wise KV cache transfer using the NIXL backend.
  • Added first-token reuse for layer-wise KV cache transfer.
  • Moved proxy code into the vLLM-GCU repository.
  • Upgraded toy_proxy_server.py to an enhanced version.
  • Added toy_proxy_server.py from vLLM 0.11.0.
  • Added DBO adaptation.
  • Added DBO graph compatibility improvements.

Pipeline and Infrastructure

  • Added pipeline commit configuration support in .pipeline/mod_version.config.
  • Added SAST workflow integration.
  • Added clone step for SAST source code.
  • Updated push branch information handling.
  • Updated CI workflow by removing SAST result settings.
  • Added CI workflow foundation.

🛠️ Fixes

Model Runtime Fixes

  • Fixed XDRoPE error.
  • Fixed MRoPE behavior with async execution.
  • Fixed FP8 logits path to use V1.
  • Fixed missing num_q_heads in MLA V1.
  • Fixed indexer RoPE issue.
  • Fixed Qwen3-Next chunk_delta_h op grid dimension.
  • Fixed DP error for Qwen3-Next.
  • Fixed DP=2 behavior for MiniCPM-V-4.
  • Fixed Python 3.10 incompatibility in Libra code.
  • Fixed Qwen3-VL 8B Instruct documentation/runtime alignment issue.
  • Fixed DS3.2 check_impl.
  • Removed slice_scatter after rotary embedding for DS3.2.
  • Fixed DS-V3.2 shared experts layer name import issue.
  • Fixed DS-V3.2 attention SP-to-TP behavior.
  • Fixed DS-V3.2 MTP weight loading.
  • Fixed MTP2 + graph non-uniform behavior.
  • Fixed fused MTP none usqz issue.
  • Fixed advanced step behavior when chunked execution is enabled.
  • Fixed async + embedding mismatch.
  • Fixed async + MTP max_model_len.
  • Fixed LoRA support with graph mode.
  • Fixed W8A8 INT8 bias type for 3.0.
  • Fixed GCU model runner profiling patch for record_function_or_nullcontext.

GCU and Distributed Runtime Fixes

  • Fixed GPT-OSS compile error.
  • Fixed all-gather input dtype mismatch.
  • Fixed unnecessary speculative decoding computation.
  • Added s_aux support.
  • Fixed DBO graph compatibility.
  • Prevented adding zero full-graph when mode is unsupported.
  • Fixed coverage build error in Dolphin.
  • Fixed coverage test packaging uninstall failure.

v0.9.2

23 Dec 10:20

Choose a tag to compare

Summary

This release focuses on stability, performance, and expanded inference capabilities, particularly around vLLM GCU, MoE, quantization, CI automation, and distributed execution. It also includes multiple bug fixes, documentation updates, and internal refactors to improve robustness and maintainability.


Highlights

  • Expanded quantized inference support (W8A8, INT8, AWQ, GPTQ, MoE).
  • Improved distributed and parallel execution stability (DP/TP, sampler behavior).
  • New asynchronous scheduling and execution support.
  • Enhanced vLLM GCU operators and kernels, including TopK/TopP and GELU optimizations.
  • CI automation improvements and new test/benchmark capabilities.
  • Documentation and README updates aligned with v0.9.2.

New Features

  • Support asynchronous scheduling and asynchronous execution.

  • CI automation enhancements:

    • First token reuse.
    • Proxy modification and execution robustness fixes.
  • vLLM GCU:

    • W8A8 INT8 MoE inference support.
    • TopK / TopP operator support.
    • New TopkToppRandomSamplerFromLogits operator.
    • GELU quick implementation migrated to optimized backend.
  • MLA:

    • Quantization support for chunked prefill with KV cache.
  • vLLM:

    • LMCache operators added.
  • Improved handling of PCI/NIC configuration logging.


Bug Fixes

  • Fixed execute_dummy_batch_patch error under DP=2 / TP=1 configurations.

  • Corrected B_scale dimension handling when dimension equals 2.

  • Fixed MoE:

    • Quant/unquant mismatch.
    • Equality and DeepSeek VL compatibility issues.
  • Fixed AWQ and GPTQ inference issues.

  • Disabled EPLB for MTP layer where inappropriate.

  • Corrected weight scale exchange behavior for W4A8 during EPLB.

  • Fixed flash MLA KV-cache bridging issues.

  • Ensured correct behavior when network configuration is missing.

  • Forced DP sampler disable on specific hardware targets.

  • Backported fixes from higher vLLM versions into 0.9.x.


Refactors & Workarounds

  • Refactored EPLB physical mapping and recording operations.
  • Implemented reduce-scatter / all-gather workaround for specific distributed paths.
  • Internal cleanup: removed unused code paths and legacy logic.

Documentation

  • Updated installation documentation.
  • README updated for v0.9.2.
  • Minor documentation corrections and clarifications.

Maintenance

  • Branch merges and release stabilization for v0.9.2.
  • General cleanup and CI reliability improvements.

0.8.0

17 Sep 08:30

Choose a tag to compare

v0.8.0

Keyword replacement