Releases: EnflameTechnology/vllm-gcu
v0.11.0
🚀 Release Summary
This release delivers a broad set of vLLM-GCU enhancements across model enablement, DeepSeek 3.2 optimization, distributed inference, GCU custom operators, KV-cache transfer, FP8/INT8 inference paths, async scheduling, and stability fixes. It also includes updated model documentation and build improvements for GCU Docker-based source compilation.
✨ Highlights
- Added GCU Docker source compilation support.
- Added or improved support for multiple model families, including Qwen2.5-VL, Qwen3-VL, Qwen3-Next, Qwen3 dense models, Libra, DeepSeek 3.2, DeepSeek-OCR, Hunyuan-OCR, Hunyuan A13B, MiMo-V2-Flash, MiniCPM-V, Paddle-VL, Step3-VL, GLM-4.5-Air, and Tongyi-DeepResearch.
- Introduced major DeepSeek 3.2 runtime optimizations, including async scheduling, MTP support, DBO adaptation, FlashMLA enhancements, DS fusion improvements, indexer optimizations, and FP8 KV support.
- Added first-token reuse and layer-wise KV cache transfer infrastructure.
- Added new GCU custom operators and native op support for vLLM-GCU.
- Improved INT8 KV, FP8 linear/MoE, W8A8 INT8 MoE, Split-Q, and fused MoE support.
- Fixed multiple runtime correctness issues around MRoPE, rotary embedding, LoRA with graph mode, async + MTP, DP/SP/TP behavior, dtype mismatch, and model-specific inference paths.
🎁 Features
Build and Packaging
- Added source code compilation support with GCU Docker.
- Updated FlashAttention release rules for GCU release packaging.
- Improved coverage build and package uninstall handling in Dolphin-based workflows.
- Avoided unnecessary
topsfactorandtops-sdkinstallation during tests to streamline release/test environments.
Model Support and Enablement
- Added MRoPE support for Qwen2.5-VL and Libra.
- Added Paddle-VL support for vLLM 0.11.0.
- Added support for MiMo-V2-Flash.
- Added Step3-VL support on vLLM 0.11.0 Libra.
- Added Step3-VL model integration.
- Added Qwen3-VL inference performance optimization.
- Added Qwen2.5-VL performance improvements.
- Added DeepSeek 3.2 MTP support.
- Added DeepSeek 3.2 support for
mtp > 1. - Added DeepSeek 3.2 support for
dp2tp4sp4ep8. - Added DeepSeek 3.2 async scheduler support.
- Added DeepSeek 3.2 MTP disablement path for EPLB.
- Added DS-V3.2 chat-template support.
- Added Hunyuan-OCR support in vLLM 0.11.0.
- Added DS-OCR2 support in vLLM 0.11.0.
- Added reasoning parser support.
- Added repetition penalties operator support for S60.
- Added repetition penalties operator support in application runtime.
GCU Runtime and Operator Enhancements
- Added Torch native custom op support in vLLM-GCU.
- Added custom op support in the native op module.
- Added
convert_req_index_to_global_indexoperator. - Added
cp_gather_indexer_k_quant_cacheoperator for DeepSeek V3.2. - Added
topk_per_rowoperators for DeepSeek V3.2. - Enabled normalized parameter support for
topk_softmax. - Added
topsvllmMRotaryEmbeddingoperator for interleaved RoPE. - Added new attention backend for INT8 KV on Scorpio.
- Added W8A8 INT8 MoE inference support.
- Added FusedMoE bias support.
- Added
swigluoai_and_mul. - Added FP8-per-tensor support for linear and MoE layers.
- Added Split-Q support for D in DeepSeek 3.2.
- Added token bin count return support.
DeepSeek 3.2 Optimization
- Added DeepSeek V3.2 optimization series.
- Added DS3.2 operator elimination.
- Added DS3.2 per-tensor KV FP8 support.
- Added DS3.2 indexer module using inplace rotary embedding.
- Added DS3.2 preprocessing for
indexer.weights_proj. - Added DS3.2
rope_with_kvcachesupport. - Added DS3.2 TopSTX device annotation support.
- Added DS3.2 MQA logits threshold support.
- Added DS3.2 top-k threshold support.
- Added FlashMLA threshold support.
- Added FlashMLA sparse and mixed runtime improvements.
- Added DeepEP / DeepGEMM integration into vLLM.
- Added DeepEP HT chunk support.
- Added MQA logits cleanup behavior.
- Added DCP support for MLA.
Distributed and Parallel Inference
- Added support for
P tensor_parallel_size > D tensor_parallel_size. - Added DecodeBenchConnector.
- Enabled layer-wise KV cache transfer using the NIXL backend.
- Added first-token reuse for layer-wise KV cache transfer.
- Moved proxy code into the vLLM-GCU repository.
- Upgraded
toy_proxy_server.pyto an enhanced version. - Added
toy_proxy_server.pyfrom vLLM 0.11.0. - Added DBO adaptation.
- Added DBO graph compatibility improvements.
Pipeline and Infrastructure
- Added pipeline commit configuration support in
.pipeline/mod_version.config. - Added SAST workflow integration.
- Added clone step for SAST source code.
- Updated push branch information handling.
- Updated CI workflow by removing SAST result settings.
- Added CI workflow foundation.
🛠️ Fixes
Model Runtime Fixes
- Fixed XDRoPE error.
- Fixed MRoPE behavior with async execution.
- Fixed FP8 logits path to use V1.
- Fixed missing
num_q_headsin MLA V1. - Fixed indexer RoPE issue.
- Fixed Qwen3-Next
chunk_delta_hop grid dimension. - Fixed DP error for Qwen3-Next.
- Fixed DP=2 behavior for MiniCPM-V-4.
- Fixed Python 3.10 incompatibility in Libra code.
- Fixed Qwen3-VL 8B Instruct documentation/runtime alignment issue.
- Fixed DS3.2
check_impl. - Removed
slice_scatterafter rotary embedding for DS3.2. - Fixed DS-V3.2 shared experts layer name import issue.
- Fixed DS-V3.2 attention SP-to-TP behavior.
- Fixed DS-V3.2 MTP weight loading.
- Fixed MTP2 + graph non-uniform behavior.
- Fixed fused MTP
none usqzissue. - Fixed advanced step behavior when chunked execution is enabled.
- Fixed async + embedding mismatch.
- Fixed async + MTP
max_model_len. - Fixed LoRA support with graph mode.
- Fixed W8A8 INT8 bias type for 3.0.
- Fixed GCU model runner profiling patch for
record_function_or_nullcontext.
GCU and Distributed Runtime Fixes
- Fixed GPT-OSS compile error.
- Fixed all-gather input dtype mismatch.
- Fixed unnecessary speculative decoding computation.
- Added
s_auxsupport. - Fixed DBO graph compatibility.
- Prevented adding zero full-graph when mode is unsupported.
- Fixed coverage build error in Dolphin.
- Fixed coverage test packaging uninstall failure.
v0.9.2
Summary
This release focuses on stability, performance, and expanded inference capabilities, particularly around vLLM GCU, MoE, quantization, CI automation, and distributed execution. It also includes multiple bug fixes, documentation updates, and internal refactors to improve robustness and maintainability.
Highlights
- Expanded quantized inference support (W8A8, INT8, AWQ, GPTQ, MoE).
- Improved distributed and parallel execution stability (DP/TP, sampler behavior).
- New asynchronous scheduling and execution support.
- Enhanced vLLM GCU operators and kernels, including TopK/TopP and GELU optimizations.
- CI automation improvements and new test/benchmark capabilities.
- Documentation and README updates aligned with v0.9.2.
New Features
-
Support asynchronous scheduling and asynchronous execution.
-
CI automation enhancements:
- First token reuse.
- Proxy modification and execution robustness fixes.
-
vLLM GCU:
- W8A8 INT8 MoE inference support.
- TopK / TopP operator support.
- New
TopkToppRandomSamplerFromLogitsoperator. - GELU quick implementation migrated to optimized backend.
-
MLA:
- Quantization support for chunked prefill with KV cache.
-
vLLM:
- LMCache operators added.
-
Improved handling of PCI/NIC configuration logging.
Bug Fixes
-
Fixed execute_dummy_batch_patch error under DP=2 / TP=1 configurations.
-
Corrected B_scale dimension handling when dimension equals 2.
-
Fixed MoE:
- Quant/unquant mismatch.
- Equality and DeepSeek VL compatibility issues.
-
Fixed AWQ and GPTQ inference issues.
-
Disabled EPLB for MTP layer where inappropriate.
-
Corrected weight scale exchange behavior for W4A8 during EPLB.
-
Fixed flash MLA KV-cache bridging issues.
-
Ensured correct behavior when network configuration is missing.
-
Forced DP sampler disable on specific hardware targets.
-
Backported fixes from higher vLLM versions into 0.9.x.
Refactors & Workarounds
- Refactored EPLB physical mapping and recording operations.
- Implemented reduce-scatter / all-gather workaround for specific distributed paths.
- Internal cleanup: removed unused code paths and legacy logic.
Documentation
- Updated installation documentation.
- README updated for v0.9.2.
- Minor documentation corrections and clarifications.
Maintenance
- Branch merges and release stabilization for v0.9.2.
- General cleanup and CI reliability improvements.