Releases · EnflameTechnology/vllm-gcu

12 May 02:41

v0.11.0

f81a741

v0.11.0 Latest

Latest

🚀 Release Summary

This release delivers a broad set of vLLM-GCU enhancements across model enablement, DeepSeek 3.2 optimization, distributed inference, GCU custom operators, KV-cache transfer, FP8/INT8 inference paths, async scheduling, and stability fixes. It also includes updated model documentation and build improvements for GCU Docker-based source compilation.

✨ Highlights

Added GCU Docker source compilation support.
Added or improved support for multiple model families, including Qwen2.5-VL, Qwen3-VL, Qwen3-Next, Qwen3 dense models, Libra, DeepSeek 3.2, DeepSeek-OCR, Hunyuan-OCR, Hunyuan A13B, MiMo-V2-Flash, MiniCPM-V, Paddle-VL, Step3-VL, GLM-4.5-Air, and Tongyi-DeepResearch.
Introduced major DeepSeek 3.2 runtime optimizations, including async scheduling, MTP support, DBO adaptation, FlashMLA enhancements, DS fusion improvements, indexer optimizations, and FP8 KV support.
Added first-token reuse and layer-wise KV cache transfer infrastructure.
Added new GCU custom operators and native op support for vLLM-GCU.
Improved INT8 KV, FP8 linear/MoE, W8A8 INT8 MoE, Split-Q, and fused MoE support.
Fixed multiple runtime correctness issues around MRoPE, rotary embedding, LoRA with graph mode, async + MTP, DP/SP/TP behavior, dtype mismatch, and model-specific inference paths.

🎁 Features

Build and Packaging

Added source code compilation support with GCU Docker.
Updated FlashAttention release rules for GCU release packaging.
Improved coverage build and package uninstall handling in Dolphin-based workflows.
Avoided unnecessary topsfactor and tops-sdk installation during tests to streamline release/test environments.

Model Support and Enablement

Added MRoPE support for Qwen2.5-VL and Libra.
Added Paddle-VL support for vLLM 0.11.0.
Added support for MiMo-V2-Flash.
Added Step3-VL support on vLLM 0.11.0 Libra.
Added Step3-VL model integration.
Added Qwen3-VL inference performance optimization.
Added Qwen2.5-VL performance improvements.
Added DeepSeek 3.2 MTP support.
Added DeepSeek 3.2 support for mtp > 1.
Added DeepSeek 3.2 support for dp2tp4sp4ep8.
Added DeepSeek 3.2 async scheduler support.
Added DeepSeek 3.2 MTP disablement path for EPLB.
Added DS-V3.2 chat-template support.
Added Hunyuan-OCR support in vLLM 0.11.0.
Added DS-OCR2 support in vLLM 0.11.0.
Added reasoning parser support.
Added repetition penalties operator support for S60.
Added repetition penalties operator support in application runtime.

GCU Runtime and Operator Enhancements

Added Torch native custom op support in vLLM-GCU.
Added custom op support in the native op module.
Added convert_req_index_to_global_index operator.
Added cp_gather_indexer_k_quant_cache operator for DeepSeek V3.2.
Added topk_per_row operators for DeepSeek V3.2.
Enabled normalized parameter support for topk_softmax.
Added topsvllmMRotaryEmbedding operator for interleaved RoPE.
Added new attention backend for INT8 KV on Scorpio.
Added W8A8 INT8 MoE inference support.
Added FusedMoE bias support.
Added swigluoai_and_mul.
Added FP8-per-tensor support for linear and MoE layers.
Added Split-Q support for D in DeepSeek 3.2.
Added token bin count return support.

DeepSeek 3.2 Optimization

Added DeepSeek V3.2 optimization series.
Added DS3.2 operator elimination.
Added DS3.2 per-tensor KV FP8 support.
Added DS3.2 indexer module using inplace rotary embedding.
Added DS3.2 preprocessing for indexer.weights_proj.
Added DS3.2 rope_with_kvcache support.
Added DS3.2 TopSTX device annotation support.
Added DS3.2 MQA logits threshold support.
Added DS3.2 top-k threshold support.
Added FlashMLA threshold support.
Added FlashMLA sparse and mixed runtime improvements.
Added DeepEP / DeepGEMM integration into vLLM.
Added DeepEP HT chunk support.
Added MQA logits cleanup behavior.
Added DCP support for MLA.

Distributed and Parallel Inference

Added support for P tensor_parallel_size > D tensor_parallel_size.
Added DecodeBenchConnector.
Enabled layer-wise KV cache transfer using the NIXL backend.
Added first-token reuse for layer-wise KV cache transfer.
Moved proxy code into the vLLM-GCU repository.
Upgraded toy_proxy_server.py to an enhanced version.
Added toy_proxy_server.py from vLLM 0.11.0.
Added DBO adaptation.
Added DBO graph compatibility improvements.

Pipeline and Infrastructure

Added pipeline commit configuration support in .pipeline/mod_version.config.
Added SAST workflow integration.
Added clone step for SAST source code.
Updated push branch information handling.
Updated CI workflow by removing SAST result settings.
Added CI workflow foundation.

🛠️ Fixes

Model Runtime Fixes

Fixed XDRoPE error.
Fixed MRoPE behavior with async execution.
Fixed FP8 logits path to use V1.
Fixed missing num_q_heads in MLA V1.
Fixed indexer RoPE issue.
Fixed Qwen3-Next chunk_delta_h op grid dimension.
Fixed DP error for Qwen3-Next.
Fixed DP=2 behavior for MiniCPM-V-4.
Fixed Python 3.10 incompatibility in Libra code.
Fixed Qwen3-VL 8B Instruct documentation/runtime alignment issue.
Fixed DS3.2 check_impl.
Removed slice_scatter after rotary embedding for DS3.2.
Fixed DS-V3.2 shared experts layer name import issue.
Fixed DS-V3.2 attention SP-to-TP behavior.
Fixed DS-V3.2 MTP weight loading.
Fixed MTP2 + graph non-uniform behavior.
Fixed fused MTP none usqz issue.
Fixed advanced step behavior when chunked execution is enabled.
Fixed async + embedding mismatch.
Fixed async + MTP max_model_len.
Fixed LoRA support with graph mode.
Fixed W8A8 INT8 bias type for 3.0.
Fixed GCU model runner profiling patch for record_function_or_nullcontext.

GCU and Distributed Runtime Fixes

Fixed GPT-OSS compile error.
Fixed all-gather input dtype mismatch.
Fixed unnecessary speculative decoding computation.
Added s_aux support.
Fixed DBO graph compatibility.
Prevented adding zero full-graph when mode is unsupported.
Fixed coverage build error in Dolphin.
Fixed coverage test packaging uninstall failure.

Assets 2

23 Dec 10:20

guoqingbao

v0.9.2

34cdc6c

v0.9.2

Summary

This release focuses on stability, performance, and expanded inference capabilities, particularly around vLLM GCU, MoE, quantization, CI automation, and distributed execution. It also includes multiple bug fixes, documentation updates, and internal refactors to improve robustness and maintainability.

Highlights

Expanded quantized inference support (W8A8, INT8, AWQ, GPTQ, MoE).
Improved distributed and parallel execution stability (DP/TP, sampler behavior).
New asynchronous scheduling and execution support.
Enhanced vLLM GCU operators and kernels, including TopK/TopP and GELU optimizations.
CI automation improvements and new test/benchmark capabilities.
Documentation and README updates aligned with v0.9.2.

New Features

Support asynchronous scheduling and asynchronous execution.
CI automation enhancements:
- First token reuse.
- Proxy modification and execution robustness fixes.
vLLM GCU:
- W8A8 INT8 MoE inference support.
- TopK / TopP operator support.
- New TopkToppRandomSamplerFromLogits operator.
- GELU quick implementation migrated to optimized backend.
MLA:
- Quantization support for chunked prefill with KV cache.
vLLM:
- LMCache operators added.
Improved handling of PCI/NIC configuration logging.

Bug Fixes

Fixed execute_dummy_batch_patch error under DP=2 / TP=1 configurations.
Corrected B_scale dimension handling when dimension equals 2.
Fixed MoE:
- Quant/unquant mismatch.
- Equality and DeepSeek VL compatibility issues.
Fixed AWQ and GPTQ inference issues.
Disabled EPLB for MTP layer where inappropriate.
Corrected weight scale exchange behavior for W4A8 during EPLB.
Fixed flash MLA KV-cache bridging issues.
Ensured correct behavior when network configuration is missing.
Forced DP sampler disable on specific hardware targets.
Backported fixes from higher vLLM versions into 0.9.x.

Refactors & Workarounds

Refactored EPLB physical mapping and recording operations.
Implemented reduce-scatter / all-gather workaround for specific distributed paths.
Internal cleanup: removed unused code paths and legacy logic.

Documentation

Updated installation documentation.
README updated for v0.9.2.
Minor documentation corrections and clarifications.

Maintenance

Branch merges and release stabilization for v0.9.2.
General cleanup and CI reliability improvements.

Assets 2

17 Sep 08:30

allen7lee

v0.8.0

2ef904c

0.8.0

v0.8.0

Keyword replacement

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

🚀 Release Summary

✨ Highlights

🎁 Features

Build and Packaging

Model Support and Enablement

GCU Runtime and Operator Enhancements

DeepSeek 3.2 Optimization

Distributed and Parallel Inference

Pipeline and Infrastructure

🛠️ Fixes

Model Runtime Fixes

GCU and Distributed Runtime Fixes

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Summary

Highlights

New Features

Bug Fixes

Refactors & Workarounds

Documentation

Maintenance

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Releases: EnflameTechnology/vllm-gcu

v0.11.0

🚀 Release Summary

✨ Highlights

🎁 Features

Build and Packaging

Model Support and Enablement

GCU Runtime and Operator Enhancements

DeepSeek 3.2 Optimization

Distributed and Parallel Inference

Pipeline and Infrastructure

🛠️ Fixes

Model Runtime Fixes

GCU and Distributed Runtime Fixes

Uh oh!

v0.9.2

Summary

Highlights

New Features

Bug Fixes

Refactors & Workarounds

Documentation

Maintenance

Uh oh!

0.8.0

Uh oh!