[Feature][Bugfix]Support Kimi-K2.5 tool/reasoning parser, fix MLA attention correctness, and backport KV admission control on Kunlun XPU by GAtties · Pull Request #354 · baidu/vLLM-Kunlun

GAtties · 2026-05-08T10:42:45Z

PR Description

This PR brings Kimi-K2.5 model support to vLLM-Kunlun and fixes a series of correctness and scheduling issues observed in production on Kunlun XPU.

New model support

Added tool parser and reasoning parser for Kimi-K2.5, with per-channel support and performance optimizations.

Attention correctness (Kunlun XPU)

Fixed three MLA bugs that caused silent hallucinations and wrong outputs under chunked prefill and multi-concurrency: LSE transposition mismatch, wrong prefill attention scale (~7% error), and missing KV LOD in cross-attention.
Fixed FlashMLA output buffer (torch.ones → torch.empty) and added shape guard in merge_attn_states to catch ABI mismatches early.

Scheduling stability

Backported two-gate KV admission control from vLLM 0.19: limits partial-prefill concurrency and verifies full-sequence fit before admission, resolving a throughput collapse (68 tok/s → 12 tok/s) under high-concurrency long-context workloads.

Reasoning parser streaming

Fixed KimiK2ReasoningParser streaming output (was silently dropping all chunks) and corrected no_thinking mode routing.

Checklist (Required)

Before submitting this PR, please ensure that all the following items are completed:

All code changes pass the pre-commit checks.
Commits are signed off using git commit -s.
The PR title is properly classified (see below).

PR Type

Please prefix the PR title with one or more of the following labels to help reviewers quickly understand the nature of the change:

[Feature] – New features or enhancements (e.g. Attention, Communicator, Kernel, Worker, etc.)
[Bugfix] – Bug fixes
[CI/Build] – CI, build system, or infrastructure improvements
[Doc] – Documentation updates or fixes
[Misc] – Other changes that do not fit the above categories (use sparingly)

Note: If the PR spans multiple categories, include all relevant prefixes.

Detailed Checklist (Click to Expand)

Thank you for contributing to vLLM Kunlun! To help us maintain high code quality and streamline the review process, please ensure your PR meets the following requirements.

1. Code Quality

All linting and formatting checks pass (pre-commit).
The code is well-structured and sufficiently documented.
The change is designed with maintainability and readability in mind.

2. Testing

Relevant unit tests are added or updated.
Integration tests are included when applicable.
Existing tests continue to pass.

3. DCO Compliance

This project follows the Developer Certificate of Origin (DCO).

All commits include a Signed-off-by: line.
Use git commit -s to automatically add the sign-off.

4. Review Expectations

During the review process, maintainers may:

Request code refactoring or additional tests.
Ask for clarifications on design decisions.
Suggest performance, stability, or maintainability improvements.

We appreciate your patience and collaboration throughout the review process!

…e optimization and per-channel support Signed-off-by: zhouzijian01 <zhouzijian01@baidu.com>

…n XPU 1. merge_attn_states: transpose LSE before/after kunlun_ops.attention_merge_stage - vLLM convention: [num_heads, num_tokens] - kunlun_ops expects: [num_tokens, num_heads] - Mismatch caused wrong merge weights in chunked prefill, leading to hallucinations on long-context (30k token) inputs 2. prefill attention scale: replace hardcoded alpha=1.8738 with dynamic formula - XFA kernel divides internally by sqrt(d), so alpha must be pre-scaled: alpha = softmax_scale * sqrt(qk_head_dim) - Old value gave effective_scale≈0.1352 vs target 0.14468 (~7% error) 3. context cross-attention KV LOD: pass context_kvlen_lod to kunlun_ops.attention - In chunked prefill, Q (new tokens) and KV (cached context) have different per-request sequence boundaries - Without context_kvlen_lod, the kernel used Q boundaries to split the KV tensor, causing cross-request attention pollution in multi-concurrency Also fix compressed_tensors_moe weight scale processing order: convert to float32 before mul_(7.0) to avoid in-place multiply on lower-precision tensor. Signed-off-by: GAtties <gatties@qq.com>

Monkey-patch vLLM 0.11 (Kunlun/P800) with two scheduling improvements backported from vLLM 0.19, and fix two attention operation bugs. KV cache admission control (patches/kv_admission.py, new file): - Gate 1: partial-prefill concurrency limit (VLLM_MAX_PARTIAL_PREFILLS, default=1); blocks new admissions when too many requests are in chunked-prefill state, preventing decode starvation - Gate 2: full-sequence admission gate; before admitting a waiting request, verify that prompt + max_output_tokens fit in free KV blocks, preventing the preemption loop (68 tok/s → 12-16 tok/s) observed under high concurrency with long contexts Fix: use num_prompt_tokens + max_tokens (full sequence) instead of num_tokens (prompt-only at admission time) - Patches are applied lazily via import hook after target modules load, with logging.warning on failure instead of silent except-pass FlashMLA output buffer (ops/attention/flashmla.py): - Replace torch.ones with torch.empty for the output buffer; unwritten elements defaulting to 1.0 caused silent correctness bugs merge_attn_states shape guard (ops/attention/merge_attn_states.py): - Add assert for prefix_lse / suffix_lse shape [num_heads, num_tokens] to catch XPU kernel ABI mismatches early Signed-off-by: GAtties <gatties@qq.com>

- Add extract_reasoning_content_streaming() to match vllm base class API; the missing method caused all streaming chunks to return None and be skipped, resulting in empty stream output. - Fix DeltaMessage field name: reasoning -> reasoning_content (3 places), so reasoning content is correctly populated in stream deltas. - Fix extract_reasoning_content() to treat output as pure content when neither <think> nor </think> appears (thinking=false mode). - Add _content_mode state tracking in extract_reasoning_streaming() to detect non-thinking mode from the first generated token, correctly routing content instead of reasoning_content in stream responses. Usage: pass chat_template_kwargs={"thinking": false} to disable thinking. Signed-off-by: GAtties <gatties@qq.com>

Copilot

Pull request overview

Adds Kimi-K2.5 model-specific parsing support and multiple Kunlun XPU correctness/scheduling fixes, including MLA attention fixes and a backported KV-cache admission-control mechanism to stabilize high-concurrency long-context workloads.

Changes:

Implement Kimi-K2 tool parser + reasoning parser and register them for OpenAI serving integration.
Fix Kunlun XPU MLA attention correctness (scaling, KV LOD for cross-attn) and tighten merge-attention-state ABI/layout handling.
Backport KV admission control + partial-prefill concurrency limiting via runtime patches applied through the Kunlun import hook.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
vllm_kunlun/v1/attention/backends/mla/common.py	Adjusts XPU attention scaling and passes separate KV LOD for correct cross-attention in chunked prefill.
vllm_kunlun/reasoning/kimi_k2_reasoning_parser.py	Adds Kimi-K2 reasoning parser with streaming support and “no thinking” routing.
vllm_kunlun/reasoning/identity_reasoning_parser.py	Adds identity reasoning parser used as a fallback when thinking is disabled.
vllm_kunlun/reasoning/init.py	Registers the new Kimi-K2 reasoning parser.
vllm_kunlun/patches/kv_admission.py	Introduces KV admission + partial-prefill concurrency gates via monkey patches.
vllm_kunlun/ops/quantization/compressed_tensors/compressed_tensors_moe.py	Extends Kunlun WNA16 MoE method initialization to support multiple strategies and aligns imports/behavior.
vllm_kunlun/ops/attention/merge_attn_states.py	Fixes LSE layout expectations by transposing into/out of the Kunlun kernel and adds shape guards.
vllm_kunlun/ops/attention/flashmla.py	Uses `torch.empty` for output buffer allocation.
vllm_kunlun/ops/_kunlun_ops.py	Removes per-channel max buffer plumbing for `moe_fc_v3` by passing `None`.
vllm_kunlun/entrypoints/openai/tool_parsers/kimi_k2_tool_parser.py	Adds Kimi-K2 tool-call extraction with streaming state management and marker handling.
vllm_kunlun/entrypoints/openai/tool_parsers/init.py	Registers the new Kimi-K2 tool parser.
vllm_kunlun/init.py	Enhances module remap error visibility and applies KV admission/scheduler patches post-import.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        # Check if thinking is disabled via chat_template_kwargs
+        chat_kwargs = kwargs.get("chat_template_kwargs", {}) or {}
+        thinking = bool(chat_kwargs.get("thinking", True))
+
+        # If thinking is not enabled, use identity parser to fall through
+        self._identity_parser: IdentityReasoningParser | None
+        if not thinking:
+            self._identity_parser = IdentityReasoningParser(tokenizer, *args, **kwargs)
+        else:
+            self._identity_parser = None


+from vllm.entrypoints.openai.protocol import (
+    ChatCompletionRequest,
+    DeltaMessage,
+    ResponsesRequest,
+)
+from vllm.reasoning.abs_reasoning_parsers import ReasoningParser
+
+from vllm_kunlun.reasoning.identity_reasoning_parser import IdentityReasoningParser
+
+if TYPE_CHECKING:
+    from vllm.entrypoints.openai.protocol import ChatCompletionRequest, ResponsesRequest
+


+    def extract_reasoning_streaming(
+        self,
+        previous_text: str,
+        current_text: str,
+        delta_text: str,
+        previous_token_ids: Sequence[int],
+        current_token_ids: Sequence[int],
+        delta_token_ids: Sequence[int],
+    ) -> DeltaMessage | None:
+        # Just wrap delta_text as content, ignore reasoning
+        if delta_text:
+            return DeltaMessage(content=delta_text)
+        return None


+    def extract_reasoning(
+        self, model_output: str, request: "ChatCompletionRequest | ResponsesRequest"
+    ) -> tuple[str | None, str | None]:
+        # No reasoning separation: return None for reasoning,
+        # and full model_output as content
+        return None, model_output


+                if diff:
+                    diff = (
+                        diff.encode("utf-8").decode("unicode_escape")
+                        if diff is str


+        if self.tool_calls_start_token not in model_output:
+            return ExtractedToolCallInformation(
+                tools_called=False, tool_calls=[], content=model_output
+            )
+
+        else:
+            try:
+                # there are two possible captures - between tags, or between a
+                # tag and end-of-string so the result of
+                # findall is an array of tuples where one is a function call and
+                # the other is None
+                function_call_tuples = self.tool_call_regex.findall(model_output)
+
+                logger.debug("function_call_tuples: %s", function_call_tuples)
+
+                tool_calls = []
+                for match in function_call_tuples:
+                    function_id, function_args = match
+                    # function_id: functions.get_weather:0 or get_weather:0
+                    function_name = function_id.split(":")[0].split(".")[-1]
+
+                    # Validate function name against available tools
+                    if request and hasattr(request, "tools") and request.tools:
+                        valid_names = {
+                            tool.function.name
+                            for tool in request.tools
+                            if hasattr(tool, "function")
+                        }
+                        if function_name not in valid_names:
+                            logger.warning(
+                                "Tool '%s' not found in available tools, skipping",
+                                function_name,
+                            )
+                            continue  # Skip this tool call
+
+                    tool_calls.append(
+                        ToolCall(
+                            id=function_id,
+                            type="function",
+                            function=FunctionCall(
+                                name=function_name, arguments=function_args
+                            ),
+                        )
+                    )
+
+                content = model_output[: model_output.find(self.tool_calls_start_token)]
+                return ExtractedToolCallInformation(
+                    tools_called=True,
+                    tool_calls=tool_calls,
+                    content=content if content else None,
+                )
+
+            except Exception:
+                logger.exception("Error in extracting tool call from response.")
+                return ExtractedToolCallInformation(
+                    tools_called=False, tool_calls=[], content=model_output
+                )
+


+    out_lse_kernel = torch.empty(num_tokens, num_heads,
+                                 dtype=torch.float32,
+                                 device=output.device)



 from .platforms import current_platform
 import sys
 import importlib
+import logging
 import warnings
 import builtins
 import os
 import time
 import vllm.envs as envs


+    assert prefix_lse.shape == (num_heads, num_tokens), (
+        f"prefix_lse must be [num_heads, num_tokens]=({num_heads}, {num_tokens}), "
+        f"got {tuple(prefix_lse.shape)}")
+    assert suffix_lse.shape == (num_heads, num_tokens), (
+        f"suffix_lse must be [num_heads, num_tokens]=({num_heads}, {num_tokens}), "
+        f"got {tuple(suffix_lse.shape)}")


- identity_reasoning_parser: rename extract_reasoning -> extract_reasoning_content and extract_reasoning_streaming -> extract_reasoning_content_streaming to match the interface called by serving_chat.py (runtime AttributeError otherwise) - kimi_k2_tool_parser: fix extract_tool_calls to check all section-start token variants (plural + singular) and slice content at earliest occurrence; previously silently dropped tool calls when model emitted the singular variant - kimi_k2_reasoning_parser: remove duplicate runtime imports of ChatCompletionRequest/ResponsesRequest; keep only under TYPE_CHECKING since they appear only in string type hints - merge_attn_states: replace assert with RuntimeError so shape guards are enforced even under python -O - __init__: remove unused imports (current_platform, warnings, os, time, vllm.envs) Signed-off-by: GAtties <gatties@qq.com>

…infer_schema compatibility Signed-off-by: GAtties <gatties@qq.com>

…thinking -> thinking Signed-off-by: GAtties <gatties@qq.com>

zhouzijian01 and others added 4 commits April 14, 2026 03:42

[Feature]Support kimi-k2.5 tool parser & reasoning parser; performanc…

6db17e0

…e optimization and per-channel support Signed-off-by: zhouzijian01 <zhouzijian01@baidu.com>

xyDong0223 requested review from Copilot and liwei109 and removed request for liwei109 May 8, 2026 10:57

Copilot started reviewing on behalf of xyDong0223 May 8, 2026 10:58 View session

xyDong0223 approved these changes May 8, 2026

View reviewed changes

Copilot AI reviewed May 8, 2026

View reviewed changes

GAtties added 3 commits May 8, 2026 19:43

[Bugfix] restore import hooks and vllm_utils_wrapper for PyTorch 2.5 …

98ee816

…infer_schema compatibility Signed-off-by: GAtties <gatties@qq.com>

[Docs] fix incorrect parameter name in KimiK2ReasoningParser: enable_…

c643706

…thinking -> thinking Signed-off-by: GAtties <gatties@qq.com>

liwei109 approved these changes May 19, 2026

View reviewed changes

liwei109 merged commit 35798ee into baidu:releases/v0.11.0 May 19, 2026
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature][Bugfix]Support Kimi-K2.5 tool/reasoning parser, fix MLA attention correctness, and backport KV admission control on Kunlun XPU#354

[Feature][Bugfix]Support Kimi-K2.5 tool/reasoning parser, fix MLA attention correctness, and backport KV admission control on Kunlun XPU#354
liwei109 merged 7 commits into
baidu:releases/v0.11.0from
GAtties:kimi-op-5-fix-mla-05

GAtties commented May 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

GAtties commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Description

Checklist (Required)

PR Type

1. Code Quality

2. Testing

3. DCO Compliance

4. Review Expectations

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

GAtties commented May 8, 2026 •

edited

Loading