v0.11.0 #272

xyDong0223 · 2026-03-13T12:08:23Z

xyDong0223
Mar 13, 2026
Maintainer

vLLM-Kunlun v0.11.0

vLLM-Kunlun v0.11.0 featured 154 commits from 31 contributors (including new contributors)!

✨ Highlights

🤖 DeepSeek-V3/R1/V3.2 Full Support

vLLM-Kunlun v0.11.0 delivers complete support for the DeepSeek model family on Kunlun hardware:

🆕 Full inference support for DeepSeek-V3, R1, and V3.2-Exp ([Feature] support deepseek v3/r1/v3.2 #78)
🚀 Multi-Token Prediction (MTP) support for DeepSeek-V3.2, with performance improvements in both Full and PieceWise modes (clean pr for dsv3.2 mtp support #164)
⚡ Enabled full CUDA Graph for DeepSeek models (enable full cudagraph for deepseek #106)
🔧 Removed MLA patch; --compilation-config is no longer required for DeepSeek-V3.1 ([Bugfix]remove mla patch, server args with no --compilation-config for ds v3.1 #145)
⚡ Added kernels to optimize RoPE and the decoding stage for DeepSeek-V3.2 ([Feature][DS32] Add kernels to optimize RoPE and the decoding stage #143)

🔀 Multi-LoRA Inference Optimization

🆕 Full multi-LoRA inference support on Kunlun hardware ([Feature] support multi-lora inference,latest xspeedgate needed #133)
🚀 Further optimized multi-LoRA performance; LoRA-enabled inference now achieves 80%+ of non-LoRA performance (Further optimize multi-lora inference,LoRA-enabled performance achieves 80%+ of non-LoRA performance #190)

🔍 Embedding Model Support

🆕 Support for BGE embedding models on Kunlun hardware, enabling vector retrieval and RAG use cases ([Feature] Support BGE embedding models #267)

🗜️ Quantization Enhancements

🆕 Support for Compressed-Tensors W8A8 quantization ([dev] support compressed-tensors w8a8 quantization #75)
🆕 Support for Compressed-Tensors W4A16 quantization ([Feature] support compressed-tensors w4a16 quantization #154)
🆕 Support for AWQ MoE W4A16 quantization ([Feature] Support AWQ MoE W4A16 Quantization #142)
🆕 Support for Mixed-Precision Quantization for MoE models ([Feature] Support Mixed-Precision Quantization for MoE #112)
🆕 Added INT8 quantized model list for DeepSeek, Qwen, and MiniMax series ([Doc] add int8 model list #254, [Models]add Qwen3.5 and MinMax INT8 models #264)

⚡ Kernel Optimizations

🔄 Migrated XTorch operations to native Kunlun operations, accelerating iteration ([Misc] Migrate XTorch operations to Kunlun operations (accelerating iteration) #177)
🚀 Added topk_per_row kernel to optimize Top-K index calculation ([Kernel] add topk_per_row to optimize the calculation of topk_indexes #168)
🚀 Added flashinfer_rotary_embedding and fast_topkv2 kernels; optimized int8_paged_mqa_logits with parallelism ([Feature][DS32] add 2 kernels and optimize the calculation of topk_indices #134, [Feature][DS32] Add kernels to optimize RoPE and the decoding stage #143)
🚀 Enabled fast random sampling on the Kunlun3 platform via hardware generators ([Kernel] Enable fast random sample on Kunlun3 Platform with generators #73)
🚀 Optimized Fused MoE kernels for small batch inference ([kernel] optimize the fuse moe with small batches #196)

🆕 New Models

🤖 DeepSeek-V3 / R1 / V3.2-Exp: Full support for the DeepSeek series on Kunlun ([Feature] support deepseek v3/r1/v3.2 #78, clean pr for dsv3.2 mtp support #164)
🤖 Qwen3-Next: Support for Qwen3-Next and Qwen-next architecture ([Feature] Merge branch 'Qwen3-Next' into main && Support Qwen-next #222)
🤖 Qwen3.5-397B-A17B / Qwen3.5-122B-A10B: Support for Qwen3.5 MoE series with INT8 quantization ([Models]add Qwen3.5 and MinMax INT8 models #264)
🤖 GLM-4.7 / GLM-x: Support for GLM-4.7 with MTP, and GLM-x model family ([Feature] support GLM-4.7 MTP #187, [Model] GLM adaptation #194)
🖼️ InternVL2.5: Multimodal InternVL2.5 support on vLLM-Kunlun v0.11.0 ([Model] Supporet InternVL2_5 on v0.11.0 #72)
🖼️ XiaoMi MIMO Flash V2: Support for XiaoMi MIMO Flash V2 ([Feature] Support XiaoMi MIMO Flash V2 #62)
🤖 MiniMax-M2.1 / MiniMax-M2.5: Support for MiniMax series with INT8 quantization ([Models]add Qwen3.5 and MinMax INT8 models #264)
🤖 GPT-OSS: Support for GPT-OSS and updated model list ([Feature] Support gpt-oss and update model list #71)
🔍 BGE Embedding Models: Support for BGE embedding models for vector retrieval and RAG ([Feature] Support BGE embedding models #267)
🛠️ GLM-4.7 Tool Parser: Added GLM-4.7 tool parser with thinking/non-thinking mode toggle (【Feature】: add GLM-47 tool parser and support thinking/non-thinking mode toggle` #151)

🔧 Features

🔍 Embedding

🆕 Support BGE embedding models on Kunlun; remove unnecessary params in attention implementation interfaces ([Feature] Support BGE embedding models #267) by @lishaobing448

🗜️ Quantization

🆕 Support Compressed-Tensors W8A8 quantization ([dev] support compressed-tensors w8a8 quantization #75) by @liwei109
🆕 Support Compressed-Tensors W4A16 quantization ([Feature] support compressed-tensors w4a16 quantization #154) by @liwei109
🆕 Support AWQ MoE W4A16 quantization ([Feature] Support AWQ MoE W4A16 Quantization #142) by @tangshiwen
🆕 Support Mixed-Precision Quantization for MoE ([Feature] Support Mixed-Precision Quantization for MoE #112) by @tangshiwen

⚡ Kernels

🚀 Add kernels to optimize RoPE and the decoding stage for DeepSeek-V3.2 ([Feature][DS32] Add kernels to optimize RoPE and the decoding stage #143) by @fromck
🚀 Add topk_per_row to optimize Top-K index calculation ([Kernel] add topk_per_row to optimize the calculation of topk_indexes #168) by @fromck
🚀 Add 2 kernels (flashinfer_rotary_embedding, fast_topkv2) and optimize topk_indices calculation ([Feature][DS32] add 2 kernels and optimize the calculation of topk_indices #134) by @fromck
🚀 Enable fast random sampling on Kunlun3 platform with hardware generators ([Kernel] Enable fast random sample on Kunlun3 Platform with generators #73) by @yuqilinaa
🚀 Optimize Fused MoE kernels for small batch scenarios ([kernel] optimize the fuse moe with small batches #196) by @ldh2020
🆕 Add gemma_rmsnorm, moe_pre_small, and split_norm_rope kernels ([Feature] Add gemma_rmsnorm, moe_pre_small and split_norm_rope. #180) by @Hanyu-Jin
🆕 Add rejection sampler kernel ([Kernel] Add rejection_sampler. #215) by @Hanyu-Jin
🆕 Enable INT8 BMM ([Feature]enable int8 bmm #91) by @zhihui96

🔀 Multi-LoRA

🆕 Full multi-LoRA inference support, requires latest xspeedgate ([Feature] support multi-lora inference,latest xspeedgate needed #133) by @15050188022
🚀 Further optimize multi-LoRA inference; LoRA performance achieves 80%+ of non-LoRA (Further optimize multi-lora inference,LoRA-enabled performance achieves 80%+ of non-LoRA performance #190) by @15050188022

🔮 MTP (Multi-Token Prediction)

🆕 MTP support for DeepSeek-V3.2 in Full and PieceWise modes (clean pr for dsv3.2 mtp support #164) by @15050188022
🆕 MTP support for GLM-4.7 ([Feature] support GLM-4.7 MTP #187) by @fromck
🆕 MTP support for Qwen3-Next; optimize apply_top_k_top_p ([Model] Support Qwen3-Next MTP #268) by @ldh2020
🚀 Optimize MTP ([Misc] optimize mtp #232) by @fromck

🏗️ Infrastructure

🔄 Migrate XTorch operations to Kunlun operations ([Misc] Migrate XTorch operations to Kunlun operations (accelerating iteration) #177) by @xyDong0223
🔄 Unify custom operator registration to torch.ops using OOT method ([Update] 1/N Unified the registration of custom operators to torch.ops and fixed some minor issues #203, [Update] 1/N for v0.15.1 Implement and register Fused MoE Kunlun kernels using OOT method #209) by @xyDong0223
🔄 Register layernorm, rotary_embedding, and vocab_parallel_embedding via @CustomOp.register_oot ([Feature]Using @CustomOp.register_oot to register layernorm/rotary_embedding/vocab_parallel_embdding #234) by @lishiyong110
⚡ Enable full CUDA Graph for DeepSeek models (enable full cudagraph for deepseek #106) by @baoqian426
🆕 Use data parallelism (DP) for distributed inference (use for dp #90) by @baoqian426
🆕 Eager mode support for expert parallelism ([Bugfix] eager mode support expert parallel #260) by @Wfd567
🚀 Reduce host-device sync overhead in Qwen3.5 ([Misc] Reduce Host and device sync in Qwen3.5 #265) by @xyDong0223
🔄 Recover use of reshape-and-cache kernel to update Mamba cache ([Model] Recover use reshape and cache kernal to update mamba cache #261) by @xyDong0223
🛠️ Add collect_env feature for environment diagnostics ([Misc] add collect_env feat #218) by @Lidang-Jiang

🐛 Bug Fixes

🐛 Fix Kunlun Graph failure ([Bugfix] Fixed Kunlun Graph Failed #193) by @xyDong0223
🐛 Fix CUDA Graph weak reference binding; bind to torch.ops._C instead of _kunlun ([Bugfix] fix error for cudagraph, bind weak_ref_tensor to torch.ops._C instead of _kunlun #220) by @lishiyong110
🐛 Fix long-context chunked attention crash (longcontext chunk make attention crash, fix it #117) by @baoqian426
🐛 Fix kunlun_scale_mm bias bug ([fix]bias bug in kunlun_scale_mm #126) by @liwei109
🐛 Fix cutlass_scaled_mm inference error ([fix] resolve cutlass_scaled_mm inference error #82) by @tangshiwen
🐛 Fix MoE when bias is absent ([Bugs] Fix moe when without bias #76) by @xyDong0223
🐛 Fix InternVL KeyError: ((1, 1, 3), '<i8') ([Bug] Fix InternVL KeyError: ((1, 1, 3), '<i8') #108) by @Lidang-Jiang
🐛 Fix apply_top_k_top_p not applied issue ([Bug] Fix no apply_top_k_top_p issue. #101) by @Hanyu-Jin
🐛 Fix Qwen2-VL for v0.11.0 (fix qwen2_vl for 0.11.0 #94) by @roger-lcc
🐛 Fix compressed_tensors import error ([Bugfix] fix can not import compressed_tensors #87) by @baoqian426
🐛 Fix cocopod ops not found ([Bugfix] cocopod ops can't be finded #242) by @liwei109
🐛 Fix missing xspeedgate_ops import in Kunlun ops and FLA chunk ([Bugfix] fix miss import xspeedgate_ops in fla chunk #237, [Bugfix] fix miss import xspeedgate_ops in kunlun ops #238) by @xyDong0223
🐛 Fix distributed environment initialization issue ([Bugfix] Fixed the distributed environment initialization issue #231) by @xyDong0223
🐛 Adapt GLM5 config for transformers 4.57 ([BugFix] Adapt GLM5 config for transformers 4.57 #207) by @tangshiwen
🐛 Fix eager mode LayerNorm failure ([Bugfix] fix eager mode layernorm failed #247) by @Hyfreadom
🐛 Register apply_repetition_penalties_ in custom op (register apply_repetition_penalties_ in custom_op #110) by @roger-lcc
🐛 Fix function call invoking xgrammar failed ([Bugfix] fix function call call xgrammar falied #262) by @xyDong0223
🐛 Fix Qwen3.5 reasoning parser in thinking + non-streaming + tool call scenario ([Bugfix] update qwen3.5 reasoning parser #257) by @ljayx
🐛 Fix expert parallelism bug in eager mode ([Bugfix] eager mode support expert parallel #260) by @Wfd567

🔬 CI / Build

🆕 Add CI end-to-end (E2E) tests ([CI/Build] Add CI end-to-end (E2E) tests #139) by @1916hcc
🆕 Add Unit Test (UT) CI ([CI] Add UT CI #157) by @Joeegin
🔄 Refactor E2E CI: split monolithic workflow into modular scripts ([CI/Build] Refactor E2E CI: split monolithic workflow into modular sc… #162) by @1916hcc
🔧 Update .pre-commit-config.yaml, add _pylint.yml ([CI/Build] update .pre-commit-config.yaml && add _pylint.yml && updat… #155) by @WeiJie-520
🆕 Add foundational GitHub Actions configuration (Add foundational configuration #57) by @tanjunchen
🆕 Add PULL_REQUEST_TEMPLATE.md and ISSUE_TEMPLATE (【Docs】add PULL_REQUEST_TEMPLATE.md and ISSUE_TEMPLATE #56) by @tanjunchen
🆕 Add CODE_OF_CONDUCT.md, MAINTAINERS.md, and contributing guide (【Docs】update readme and contributing guide #55) by @tanjunchen

📝 Documentation

📖 Add vLLM-Kunlun New Model Adaptation Manual and update model support list ([Docs] Add vLLM-Kunlun New Model Adaptation Manual and Update Model Support #211) by @xyDong0223
📖 Add XPU tutorials for Qwen and InternVL ([Docs] Add XPU tutorials for Qwen / InternVL #140) by @Joeegin
📖 Add DeepSeek-V3.2-Exp-w8a8 to installation guide and tutorials ([Doc] add DeepSeek-V3.2-Exp-w8a8 to installation.md and tutorials #186) by @WeiJie-520
🔧 Update base image URL: replace conda with uv; integrate xpytorch and ops into image ([Doc] update base image url（1.Replace conda with uv; 2.Integrate xpyt… #146) by @WeiJie-520
📖 Update quantization guide documentation ([doc] update quantization guide doc #88) by @liwei109
📖 Optimize documentation structure ([Doc] Optimize the document #136) by @Lidang-Jiang
📖 Update xspeedgate_ops documentation ([Doc] update xspeedgate_ops (20260130) #188) by @WeiJie-520
🐛 Fix Read the Docs build configuration ([Docs] Fix app.readthedocs buliding #210, [Doc] Fix 5 Sphinx warnings causing Read the Docs build failure #251) by @xyDong0223, @Lidang-Jiang
📖 Update README with latest model support and environment information ([Docs] Update README #206) by @xyDong0223
📖 Add INT8 quantized model list for DeepSeek, Qwen, MiniMax series ([Doc] add int8 model list #254) by @liwei109
🔧 Remove --compilation-config from all documentation; P800 no longer requires this parameter ([Doc] Remove --compilation-config from all docs #253) by @Lidang-Jiang

📋 What's Changed

PR	Title	Author
#268	[Model] Support Qwen3-Next MTP	@ldh2020
#267	[Feature] Support BGE embedding models	@lishaobing448
#265	[Misc] Reduce Host and device sync in Qwen3.5	@xyDong0223
#264	[Models] Add Qwen3.5 and MiniMax INT8 models	@liwei109
#262	[Bugfix] Fix function call invoking xgrammar failed	@xyDong0223
#261	[Model] Recover use reshape and cache kernel to update mamba cache	@xyDong0223
#260	Eager mode support expert parallel	@Wfd567
#257	[Bugfix] Update Qwen3.5 reasoning parser	@ljayx
#254	[Doc] Add INT8 model list	@liwei109
#253	[Doc] Remove --compilation-config from all docs	@Lidang-Jiang
#251	[Doc] Fix 5 Sphinx warnings causing Read the Docs build failure	@Lidang-Jiang
#252	[Bugfix] Fix cache indices problem for Qwen3.5-MoE	@xyDong0223
#247	[Bugfix] Fix eager mode layernorm failed	@Hyfreadom
#244	[Bugfix] use cuda visible	@lishaobing448
#242	[Bugfix] cocopod ops can't be finded	@liwei109
#241	[Model] Support qwen3.5 moe	@roger-lcc
#240	[Misc] Remove qwen3 and qwen3moe redundant code	@xyDong0223
#239	[Doc] Update dependencies for Feb	@Joeegin
#238	[Bugfix] Fix miss import xspeedgate_ops in kunlun ops	@xyDong0223
#237	[Bugfix] Fix miss import xspeedgate_ops in fla chunk	@xyDong0223
#234	[Feature] Register layernorm/rotary_embedding via @CustomOp.register_oot	@lishiyong110
#233	[Model] Support qwen3-next model	@xyDong0223
#232	[Misc] Optimize mtp	@fromck
#231	[Bugfix] Fixed distributed environment initialization issue	@xyDong0223
#229	[Misc] Temporarily work around Torch compatibility issues	@xyDong0223
#228	[Update] Update dependencies for v0.15.1	@xyDong0223
#227	[Update] Partially supports torch compile	@xyDong0223
#225	[Doc] Update dependencies	@Joeegin
#224	[Kernel] Register custom_op for kunlun graph (torch compile)	@xyDong0223
#222	[Feature] Support Qwen3-Next	@chanzhennan
#220	[Bugfix] Fix cudagraph weak_ref_tensor binding	@lishiyong110
#218	[Misc] Add collect_env feature	@Lidang-Jiang
#215	[Kernel] Add rejection_sampler	@Hanyu-Jin
#214	[Model] Support qwen3_next_mtp with eager mode	@ldh2020
#212	[Update] Remove V0 code and fix circular reference	@xyDong0223
#211	[Docs] Add New Model Adaptation Manual	@xyDong0223
#209	[Update] Implement and register Fused MoE Kunlun kernels via OOT	@xyDong0223
#207	[BugFix] Adapt GLM5 config for transformers 4.57	@tangshiwen
#206	[Docs] Update README	@xyDong0223
#203	[Update] Unify custom operator registration to torch.ops	@xyDong0223
#202	[Update] Optimize utils, remove VLLM_USE_V1 check	@xyDong0223
#201	[Update] Fix Kunlun plugin circular reference in v0.15.1	@xyDong0223
#198	[Bugfix] Fix run failed	@xyDong0223
#196	[Kernel] Optimize fused MoE for small batches	@ldh2020
#194	[Model] GLM adaptation (GLM-x)	@liwei109
#193	[Bugfix] Fixed Kunlun Graph Failed	@xyDong0223
#190	Further optimize multi-lora inference (80%+ non-LoRA perf)	@15050188022
#188	[Doc] Update xspeedgate_ops	@WeiJie-520
#187	[Feature] Support GLM-4.7 MTP	@fromck
#186	[Doc] Add DeepSeek-V3.2-Exp-w8a8 to installation guide	@WeiJie-520
#182	[Attention] Optimize build of attn_metadata	@ldh2020
#180	[Feature] Add gemma_rmsnorm, moe_pre_small, split_norm_rope	@Hanyu-Jin
#177	[Misc] Migrate XTorch ops to Kunlun ops	@xyDong0223
#169	[CI/Build] ruff format checks only	@WeiJie-520
#168	[Kernel] Add topk_per_row to optimize topk_indexes	@fromck
#164	DeepSeek-V3.2 MTP support (Full & PieceWise)	@15050188022
#162	[CI/Build] Refactor E2E CI into modular scripts	@1916hcc
#159	Update CI workflow	@tanjunchen
#157	[CI] Add UT CI	@Joeegin
#155	[CI/Build] Update pre-commit-config and add pylint	@WeiJie-520
#154	[Feature] Support compressed-tensors W4A16 quantization	@liwei109
#151	[Feature] Add GLM-4.7 tool parser	@astrophel0
#147	[Doc] Remove internal pip index from requirements	@1916hcc
#146	[Doc] Update base image URL (conda → uv)	@WeiJie-520
#145	[Bugfix] Remove MLA patch for DeepSeek-V3.1	@baoqian426
#143	[Feature] Add RoPE and decoding stage kernels for DS32	@fromck
#142	[Feature] Support AWQ MoE W4A16 quantization	@tangshiwen
#140	[Docs] Add XPU tutorials for Qwen / InternVL	@Joeegin
#139	[CI/Build] Add CI E2E tests	@1916hcc
#136	[Doc] Optimize documentation	@Lidang-Jiang
#134	[Feature] Add 2 kernels and optimize topk_indices	@fromck
#133	[Feature] Full multi-LoRA support	@15050188022
#132	Delete redundant GlmForCausalLM register	@kurkol
#126	[Fix] Bias bug in kunlun_scale_mm	@liwei109
#122	[Refactor] Update Kunlun classes with monkey patch	@liwei109
#117	Fix long-context chunk attention crash	@baoqian426
#112	[Feature] Support Mixed-Precision Quantization for MoE	@tangshiwen
#110	Register apply_repetition_penalties_ in custom_op	@roger-lcc
#108	[Bug] Fix InternVL KeyError	@Lidang-Jiang
#106	Enable full CUDA Graph for DeepSeek	@baoqian426
#101	[Bug] Fix no apply_top_k_top_p issue	@Hanyu-Jin
#94	[Bugs] Fix qwen2_vl for v0.11.0	@roger-lcc
#91	Enable INT8 BMM	@zhihui96
#90	[Feature] Use data parallelism (DP)	@baoqian426
#88	[Doc] Update quantization guide	@liwei109
#87	[Bugfix] Fix cannot import compressed_tensors	@baoqian426
#82	[Fix] Resolve cutlass_scaled_mm inference error	@tangshiwen
#78	[Feature] Support DeepSeek-V3/R1/V3.2	@baoqian426
#76	[Bugs] Fix MoE when without bias	@xyDong0223
#75	[Dev] Support compressed-tensors W8A8 quantization	@liwei109
#73	[Kernel] Enable fast random sample on Kunlun3	@yuqilinaa
#72	[Model] Support InternVL2.5 on v0.11.0	@Joeegin
#71	[Feature] Support GPT-OSS and update model list	@xyDong0223
#62	[Feature] Support XiaoMi MIMO Flash V2	@xyDong0223

🎉 New Contributors

We warmly welcome all first-time contributors to vLLM-Kunlun!

🌟 @yuqilinaa made their first contribution in [Kernel] Enable fast random sample on Kunlun3 Platform with generators #73
🌟 @astrophel0 made their first contribution in 【Feature】: add GLM-47 tool parser and support thinking/non-thinking mode toggle` #151
🌟 @Hyfreadom made their first contribution in [Bugfix] fix eager mode layernorm failed #247
🌟 @lishaobing448 made their first contribution in [Bugfix] use cuda visible #244
🌟 @haoli5009-debug made their first contribution in [CI/Build] Modify biweekly report readme files #131
🌟 @callmelaoyi made their first contribution in [Kernel] Optimize recompute_w_u_fwd & chunk_fwd_o in Qwen3-next #74
🌟 @kurkol made their first contribution in delete GlmForCausalLM register #132
🌟 @Wfd567 made their first contribution in [Bugfix] eager mode support expert parallel #260
🌟 @ljayx made their first contribution in [Bugfix] update qwen3.5 reasoning parser #257

Full Changelog: https://github.com/xyDong0223/vLLM-Kunlun-kunlunops/commits/main

What's Changed

【Docs】add PULL_REQUEST_TEMPLATE.md and ISSUE_TEMPLATE by @tanjunchen in 【Docs】add PULL_REQUEST_TEMPLATE.md and ISSUE_TEMPLATE #56
【Docs】update readme and contributing guide by @tanjunchen in 【Docs】update readme and contributing guide #55
Add foundational configuration by @tanjunchen in Add foundational configuration #57
[fix]remove weight_loader_v2 to suport cuda graph by @liwei109 in [fix]remove weight_loader_v2 to suport cuda graph #59
【Docs】update readme.md by @tanjunchen in 【Docs】update readme.md #60
[Doc] Update base image path in Installation.md by @WeiJie-520 in [Doc] Update base image path in Installation.md #63
[Feature] Support XiaoMi MIMO Flash V2 by @xyDong0223 in [Feature] Support XiaoMi MIMO Flash V2 #62
[Feature] remove qwen2.py llama.py fix llama output by @baoqian426 in [Feature] remove qwen2.py llama.py fix llama output #66
【Docs】update readme.md by @tanjunchen in 【Docs】update readme.md #68
[Docs] Update torch and ops for mimo v2 by @xyDong0223 in [Docs] Update torch and ops for mimo v2 #67
[Docs] : update readme.md by @chanzhennan in [Docs] : update readme.md #69
[Model] Supporet InternVL2_5 on v0.11.0 by @Joeegin in [Model] Supporet InternVL2_5 on v0.11.0 #72
[Feature] Support gpt-oss and update model list by @xyDong0223 in [Feature] Support gpt-oss and update model list #71
[Kernel] Optimize recompute_w_u_fwd & chunk_fwd_o in Qwen3-next by @callmelaoyi in [Kernel] Optimize recompute_w_u_fwd & chunk_fwd_o in Qwen3-next #74
[Bugs] Fix moe when without bias by @xyDong0223 in [Bugs] Fix moe when without bias #76
[Feature] support deepseek v3/r1/v3.2 by @baoqian426 in [Feature] support deepseek v3/r1/v3.2 #78
[dev] support compressed-tensors w8a8 quantization by @liwei109 in [dev] support compressed-tensors w8a8 quantization #75
[fix]matmul not support cuda graph by @liwei109 in [fix]matmul not support cuda graph #80
[fix] resolve cutlass_scaled_mm inference error by @tangshiwen in [fix] resolve cutlass_scaled_mm inference error #82
[Feature] DeepSeek Support MTP by @xyDong0223 in [Feature] DeepSeek Support MTP #84
[fix] update compressed-tensors scheme by @liwei109 in [fix] update compressed-tensors scheme #85
[Bugfix] fix can not import compressed_tensors by @baoqian426 in [Bugfix] fix can not import compressed_tensors #87
[doc] update quantization guide doc by @liwei109 in [doc] update quantization guide doc #88
use for dp by @baoqian426 in use for dp #90
fix qwen2_vl for 0.11.0 by @roger-lcc in fix qwen2_vl for 0.11.0 #94
[Docs] Fix v0.11.0 Docs config by @xyDong0223 in [Docs] Fix v0.11.0 Docs config #96
[Bugs] Fix Docs Build Problem by @xyDong0223 in [Bugs] Fix Docs Build Problem #97
[Docs] Upate URL by @xyDong0223 in [Docs] Upate URL #98
【Docs】update maintainer for vllm-kunlun by @tanjunchen in 【Docs】update maintainer for vllm-kunlun #100
[Bug] Fix no apply_top_k_top_p issue. by @Hanyu-Jin in [Bug] Fix no apply_top_k_top_p issue. #101
enable full cudagraph for deepseek by @baoqian426 in enable full cudagraph for deepseek #106
register apply_repetition_penalties_ in custom_op by @roger-lcc in register apply_repetition_penalties_ in custom_op #110
[Bug] Fix InternVL KeyError: ((1, 1, 3), '<i8') by @Lidang-Jiang in [Bug] Fix InternVL KeyError: ((1, 1, 3), '<i8') #108
[Feature]enable int8 bmm by @zhihui96 in [Feature]enable int8 bmm #91
[Feature] Support Mixed-Precision Quantization for MoE by @tangshiwen in [Feature] Support Mixed-Precision Quantization for MoE #112
[Misc]Specify that DS32 only supports --kv-cache-dtype bfloat16 by @fromck in [Misc]Specify that DS32 only supports --kv-cache-dtype bfloat16 #119
longcontext chunk make attention crash, fix it by @baoqian426 in longcontext chunk make attention crash, fix it #117
[refactor]update Kunlun classes with monkey patch by @liwei109 in [refactor]update Kunlun classes with monkey patch #122
support glm47 in 0.11.0 version by @roger-lcc in support glm47 in 0.11.0 version #116
Revert "support glm47 in 0.11.0 version" by @liwei109 in Revert "support glm47 in 0.11.0 version" #123
[fix]bias bug in kunlun_scale_mm by @liwei109 in [fix]bias bug in kunlun_scale_mm #126
[CI/Build] Modify biweekly report readme files by @haoli5009-debug in [CI/Build] Modify biweekly report readme files #131
delete GlmForCausalLM register by @kurkol in delete GlmForCausalLM register #132
[Feature] support multi-lora inference,latest xspeedgate needed by @15050188022 in [Feature] support multi-lora inference,latest xspeedgate needed #133
[Kernel] Enable fast random sample on Kunlun3 Platform with generators by @yuqilinaa in [Kernel] Enable fast random sample on Kunlun3 Platform with generators #73
[Feature][DS32] add 2 kernels and optimize the calculation of topk_indices by @fromck in [Feature][DS32] add 2 kernels and optimize the calculation of topk_indices #134
[Docs] Add XPU tutorials for Qwen / InternVL by @Joeegin in [Docs] Add XPU tutorials for Qwen / InternVL #140
[Doc] Optimize the document by @Lidang-Jiang in [Doc] Optimize the document #136
[Feature][DS32] Add kernels to optimize RoPE and the decoding stage by @fromck in [Feature][DS32] Add kernels to optimize RoPE and the decoding stage #143
[Bugfix]remove mla patch, server args with no --compilation-config for ds v3.1 by @baoqian426 in [Bugfix]remove mla patch, server args with no --compilation-config for ds v3.1 #145
[Doc] docs: remove internal pip index from requirements by @1916hcc in [Doc] docs: remove internal pip index from requirements #147
[Doc] update base image url（1.Replace conda with uv; 2.Integrate xpyt… by @WeiJie-520 in [Doc] update base image url（1.Replace conda with uv; 2.Integrate xpyt… #146
[Feature] Support AWQ MoE W4A16 Quantization by @tangshiwen in [Feature] Support AWQ MoE W4A16 Quantization #142
[Feature] support compressed-tensors w4a16 quantization by @liwei109 in [Feature] support compressed-tensors w4a16 quantization #154
[CI/Build] update .pre-commit-config.yaml && add _pylint.yml && updat… by @WeiJie-520 in [CI/Build] update .pre-commit-config.yaml && add _pylint.yml && updat… #155
[CI] Add UT CI by @Joeegin in [CI] Add UT CI #157
[CI/Build] Add CI end-to-end (E2E) tests by @1916hcc in [CI/Build] Add CI end-to-end (E2E) tests #139
update ci workflow by @tanjunchen in update ci workflow #159
[CI/Build] Refactor E2E CI: split monolithic workflow into modular sc… by @1916hcc in [CI/Build] Refactor E2E CI: split monolithic workflow into modular sc… #162
[Kernel] add topk_per_row to optimize the calculation of topk_indexes by @fromck in [Kernel] add topk_per_row to optimize the calculation of topk_indexes #168
[CI/Build] ruff performs only format checks and does not restrict merging. by @WeiJie-520 in [CI/Build] ruff performs only format checks and does not restrict merging. #169
clean pr for dsv3.2 mtp support by @15050188022 in clean pr for dsv3.2 mtp support #164
[Doc] add DeepSeek-V3.2-Exp-w8a8 to installation.md and tutorials by @WeiJie-520 in [Doc] add DeepSeek-V3.2-Exp-w8a8 to installation.md and tutorials #186
[Doc] update xspeedgate_ops (20260130) by @WeiJie-520 in [Doc] update xspeedgate_ops (20260130) #188
Further optimize multi-lora inference,LoRA-enabled performance achieves 80%+ of non-LoRA performance by @15050188022 in Further optimize multi-lora inference,LoRA-enabled performance achieves 80%+ of non-LoRA performance #190
[Feature] support GLM-4.7 MTP by @fromck in [Feature] support GLM-4.7 MTP #187
[Bugfix] Fixed Kunlun Graph Failed by @xyDong0223 in [Bugfix] Fixed Kunlun Graph Failed #193
[Model] GLM adaptation by @liwei109 in [Model] GLM adaptation #194
[Misc] Migrate XTorch operations to Kunlun operations (accelerating iteration) by @xyDong0223 in [Misc] Migrate XTorch operations to Kunlun operations (accelerating iteration) #177
[Bugsfix] Fix run failed by @xyDong0223 in [Bugsfix] Fix run failed #198
[Docs] Update README by @xyDong0223 in [Docs] Update README #206
[Docs] Fix quantization support description in README by @xyDong0223 in [Docs] Fix quantization support description in README #208
[Docs] Fix app.readthedocs buliding by @xyDong0223 in [Docs] Fix app.readthedocs buliding #210
[BugFix] Adapt GLM5 config for transformers 4.57 by @tangshiwen in [BugFix] Adapt GLM5 config for transformers 4.57 #207
[Docs] Add vLLM-Kunlun New Model Adaptation Manual and Update Model Support by @xyDong0223 in [Docs] Add vLLM-Kunlun New Model Adaptation Manual and Update Model Support #211
[Misc] add collect_env feat by @Lidang-Jiang in [Misc] add collect_env feat #218
[Feature] Merge branch 'Qwen3-Next' into main && Support Qwen-next by @chanzhennan in [Feature] Merge branch 'Qwen3-Next' into main && Support Qwen-next #222
[Doc] Update dependencies by @Joeegin in [Doc] Update dependencies #225
[Bugfix] cocopod ops can't be finded by @liwei109 in [Bugfix] cocopod ops can't be finded #242
[Bugfix] use cuda visible by @lishaobing448 in [Bugfix] use cuda visible #244
[Misc] optimize mtp by @fromck in [Misc] optimize mtp #232
[Doc] Update dependiencies for Feb by @Joeegin in [Doc] Update dependiencies for Feb #239
[Doc] add int8 model list by @liwei109 in [Doc] add int8 model list #254
[Models]add Qwen3.5 and MinMax INT8 models by @liwei109 in [Models]add Qwen3.5 and MinMax INT8 models #264
[Feature] Support BGE embedding models by @lishaobing448 in [Feature] Support BGE embedding models #267
[Model] Support Qwen3-Next MTP by @ldh2020 in [Model] Support Qwen3-Next MTP #268

New Contributors

@baoqian426 made their first contribution in [Feature] remove qwen2.py llama.py fix llama output #66
@chanzhennan made their first contribution in [Docs] : update readme.md #69
@Joeegin made their first contribution in [Model] Supporet InternVL2_5 on v0.11.0 #72
@callmelaoyi made their first contribution in [Kernel] Optimize recompute_w_u_fwd & chunk_fwd_o in Qwen3-next #74
@tangshiwen made their first contribution in [fix] resolve cutlass_scaled_mm inference error #82
@zhihui96 made their first contribution in [Feature]enable int8 bmm #91
@fromck made their first contribution in [Misc]Specify that DS32 only supports --kv-cache-dtype bfloat16 #119
@haoli5009-debug made their first contribution in [CI/Build] Modify biweekly report readme files #131
@kurkol made their first contribution in delete GlmForCausalLM register #132
@yuqilinaa made their first contribution in [Kernel] Enable fast random sample on Kunlun3 Platform with generators #73
@1916hcc made their first contribution in [Doc] docs: remove internal pip index from requirements #147
@lishaobing448 made their first contribution in [Bugfix] use cuda visible #244

Full Changelog: v0.11.0rc1...v0.11.0rc2

This discussion was created from the release v0.11.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.11.0 #272

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

v0.11.0 #272

Uh oh!

xyDong0223 Mar 13, 2026 Maintainer

vLLM-Kunlun v0.11.0

✨ Highlights

🆕 New Models

🔧 Features

🐛 Bug Fixes

🔬 CI / Build

📝 Documentation

📋 What's Changed

🎉 New Contributors

What's Changed

New Contributors

Replies: 0 comments

xyDong0223
Mar 13, 2026
Maintainer