[Feature][Bugfix]Support Kimi-K2.5 tool/reasoning parser, fix MLA attention correctness, and backport KV admission control on Kunlun XPU#354
Merged
Conversation
…e optimization and per-channel support Signed-off-by: zhouzijian01 <zhouzijian01@baidu.com>
…n XPU
1. merge_attn_states: transpose LSE before/after kunlun_ops.attention_merge_stage
- vLLM convention: [num_heads, num_tokens]
- kunlun_ops expects: [num_tokens, num_heads]
- Mismatch caused wrong merge weights in chunked prefill, leading to
hallucinations on long-context (30k token) inputs
2. prefill attention scale: replace hardcoded alpha=1.8738 with dynamic formula
- XFA kernel divides internally by sqrt(d), so alpha must be pre-scaled:
alpha = softmax_scale * sqrt(qk_head_dim)
- Old value gave effective_scale≈0.1352 vs target 0.14468 (~7% error)
3. context cross-attention KV LOD: pass context_kvlen_lod to kunlun_ops.attention
- In chunked prefill, Q (new tokens) and KV (cached context) have different
per-request sequence boundaries
- Without context_kvlen_lod, the kernel used Q boundaries to split the KV
tensor, causing cross-request attention pollution in multi-concurrency
Also fix compressed_tensors_moe weight scale processing order: convert to
float32 before mul_(7.0) to avoid in-place multiply on lower-precision tensor.
Signed-off-by: GAtties <gatties@qq.com>
Monkey-patch vLLM 0.11 (Kunlun/P800) with two scheduling improvements backported from vLLM 0.19, and fix two attention operation bugs. KV cache admission control (patches/kv_admission.py, new file): - Gate 1: partial-prefill concurrency limit (VLLM_MAX_PARTIAL_PREFILLS, default=1); blocks new admissions when too many requests are in chunked-prefill state, preventing decode starvation - Gate 2: full-sequence admission gate; before admitting a waiting request, verify that prompt + max_output_tokens fit in free KV blocks, preventing the preemption loop (68 tok/s → 12-16 tok/s) observed under high concurrency with long contexts Fix: use num_prompt_tokens + max_tokens (full sequence) instead of num_tokens (prompt-only at admission time) - Patches are applied lazily via import hook after target modules load, with logging.warning on failure instead of silent except-pass FlashMLA output buffer (ops/attention/flashmla.py): - Replace torch.ones with torch.empty for the output buffer; unwritten elements defaulting to 1.0 caused silent correctness bugs merge_attn_states shape guard (ops/attention/merge_attn_states.py): - Add assert for prefix_lse / suffix_lse shape [num_heads, num_tokens] to catch XPU kernel ABI mismatches early Signed-off-by: GAtties <gatties@qq.com>
- Add extract_reasoning_content_streaming() to match vllm base class API;
the missing method caused all streaming chunks to return None and be
skipped, resulting in empty stream output.
- Fix DeltaMessage field name: reasoning -> reasoning_content (3 places),
so reasoning content is correctly populated in stream deltas.
- Fix extract_reasoning_content() to treat output as pure content when
neither <think> nor </think> appears (thinking=false mode).
- Add _content_mode state tracking in extract_reasoning_streaming() to
detect non-thinking mode from the first generated token, correctly
routing content instead of reasoning_content in stream responses.
Usage: pass chat_template_kwargs={"thinking": false} to disable thinking.
Signed-off-by: GAtties <gatties@qq.com>
xyDong0223
approved these changes
May 8, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
Adds Kimi-K2.5 model-specific parsing support and multiple Kunlun XPU correctness/scheduling fixes, including MLA attention fixes and a backported KV-cache admission-control mechanism to stabilize high-concurrency long-context workloads.
Changes:
- Implement Kimi-K2 tool parser + reasoning parser and register them for OpenAI serving integration.
- Fix Kunlun XPU MLA attention correctness (scaling, KV LOD for cross-attn) and tighten merge-attention-state ABI/layout handling.
- Backport KV admission control + partial-prefill concurrency limiting via runtime patches applied through the Kunlun import hook.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm_kunlun/v1/attention/backends/mla/common.py | Adjusts XPU attention scaling and passes separate KV LOD for correct cross-attention in chunked prefill. |
| vllm_kunlun/reasoning/kimi_k2_reasoning_parser.py | Adds Kimi-K2 reasoning parser with streaming support and “no thinking” routing. |
| vllm_kunlun/reasoning/identity_reasoning_parser.py | Adds identity reasoning parser used as a fallback when thinking is disabled. |
| vllm_kunlun/reasoning/init.py | Registers the new Kimi-K2 reasoning parser. |
| vllm_kunlun/patches/kv_admission.py | Introduces KV admission + partial-prefill concurrency gates via monkey patches. |
| vllm_kunlun/ops/quantization/compressed_tensors/compressed_tensors_moe.py | Extends Kunlun WNA16 MoE method initialization to support multiple strategies and aligns imports/behavior. |
| vllm_kunlun/ops/attention/merge_attn_states.py | Fixes LSE layout expectations by transposing into/out of the Kunlun kernel and adds shape guards. |
| vllm_kunlun/ops/attention/flashmla.py | Uses torch.empty for output buffer allocation. |
| vllm_kunlun/ops/_kunlun_ops.py | Removes per-channel max buffer plumbing for moe_fc_v3 by passing None. |
| vllm_kunlun/entrypoints/openai/tool_parsers/kimi_k2_tool_parser.py | Adds Kimi-K2 tool-call extraction with streaming state management and marker handling. |
| vllm_kunlun/entrypoints/openai/tool_parsers/init.py | Registers the new Kimi-K2 tool parser. |
| vllm_kunlun/init.py | Enhances module remap error visibility and applies KV admission/scheduler patches post-import. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+42
to
+51
| # Check if thinking is disabled via chat_template_kwargs | ||
| chat_kwargs = kwargs.get("chat_template_kwargs", {}) or {} | ||
| thinking = bool(chat_kwargs.get("thinking", True)) | ||
|
|
||
| # If thinking is not enabled, use identity parser to fall through | ||
| self._identity_parser: IdentityReasoningParser | None | ||
| if not thinking: | ||
| self._identity_parser = IdentityReasoningParser(tokenizer, *args, **kwargs) | ||
| else: | ||
| self._identity_parser = None |
Comment on lines
+8
to
+19
| from vllm.entrypoints.openai.protocol import ( | ||
| ChatCompletionRequest, | ||
| DeltaMessage, | ||
| ResponsesRequest, | ||
| ) | ||
| from vllm.reasoning.abs_reasoning_parsers import ReasoningParser | ||
|
|
||
| from vllm_kunlun.reasoning.identity_reasoning_parser import IdentityReasoningParser | ||
|
|
||
| if TYPE_CHECKING: | ||
| from vllm.entrypoints.openai.protocol import ChatCompletionRequest, ResponsesRequest | ||
|
|
Comment on lines
+48
to
+60
| def extract_reasoning_streaming( | ||
| self, | ||
| previous_text: str, | ||
| current_text: str, | ||
| delta_text: str, | ||
| previous_token_ids: Sequence[int], | ||
| current_token_ids: Sequence[int], | ||
| delta_token_ids: Sequence[int], | ||
| ) -> DeltaMessage | None: | ||
| # Just wrap delta_text as content, ignore reasoning | ||
| if delta_text: | ||
| return DeltaMessage(content=delta_text) | ||
| return None |
Comment on lines
+62
to
+67
| def extract_reasoning( | ||
| self, model_output: str, request: "ChatCompletionRequest | ResponsesRequest" | ||
| ) -> tuple[str | None, str | None]: | ||
| # No reasoning separation: return None for reasoning, | ||
| # and full model_output as content | ||
| return None, model_output |
| if diff: | ||
| diff = ( | ||
| diff.encode("utf-8").decode("unicode_escape") | ||
| if diff is str |
Comment on lines
+152
to
+209
| if self.tool_calls_start_token not in model_output: | ||
| return ExtractedToolCallInformation( | ||
| tools_called=False, tool_calls=[], content=model_output | ||
| ) | ||
|
|
||
| else: | ||
| try: | ||
| # there are two possible captures - between tags, or between a | ||
| # tag and end-of-string so the result of | ||
| # findall is an array of tuples where one is a function call and | ||
| # the other is None | ||
| function_call_tuples = self.tool_call_regex.findall(model_output) | ||
|
|
||
| logger.debug("function_call_tuples: %s", function_call_tuples) | ||
|
|
||
| tool_calls = [] | ||
| for match in function_call_tuples: | ||
| function_id, function_args = match | ||
| # function_id: functions.get_weather:0 or get_weather:0 | ||
| function_name = function_id.split(":")[0].split(".")[-1] | ||
|
|
||
| # Validate function name against available tools | ||
| if request and hasattr(request, "tools") and request.tools: | ||
| valid_names = { | ||
| tool.function.name | ||
| for tool in request.tools | ||
| if hasattr(tool, "function") | ||
| } | ||
| if function_name not in valid_names: | ||
| logger.warning( | ||
| "Tool '%s' not found in available tools, skipping", | ||
| function_name, | ||
| ) | ||
| continue # Skip this tool call | ||
|
|
||
| tool_calls.append( | ||
| ToolCall( | ||
| id=function_id, | ||
| type="function", | ||
| function=FunctionCall( | ||
| name=function_name, arguments=function_args | ||
| ), | ||
| ) | ||
| ) | ||
|
|
||
| content = model_output[: model_output.find(self.tool_calls_start_token)] | ||
| return ExtractedToolCallInformation( | ||
| tools_called=True, | ||
| tool_calls=tool_calls, | ||
| content=content if content else None, | ||
| ) | ||
|
|
||
| except Exception: | ||
| logger.exception("Error in extracting tool call from response.") | ||
| return ExtractedToolCallInformation( | ||
| tools_called=False, tool_calls=[], content=model_output | ||
| ) | ||
|
|
Comment on lines
28
to
31
| out_lse_kernel = torch.empty(num_tokens, num_heads, | ||
| dtype=torch.float32, | ||
| device=output.device) | ||
|
|
Comment on lines
2
to
10
| from .platforms import current_platform | ||
| import sys | ||
| import importlib | ||
| import logging | ||
| import warnings | ||
| import builtins | ||
| import os | ||
| import time | ||
| import vllm.envs as envs |
Comment on lines
+22
to
+27
| assert prefix_lse.shape == (num_heads, num_tokens), ( | ||
| f"prefix_lse must be [num_heads, num_tokens]=({num_heads}, {num_tokens}), " | ||
| f"got {tuple(prefix_lse.shape)}") | ||
| assert suffix_lse.shape == (num_heads, num_tokens), ( | ||
| f"suffix_lse must be [num_heads, num_tokens]=({num_heads}, {num_tokens}), " | ||
| f"got {tuple(suffix_lse.shape)}") |
- identity_reasoning_parser: rename extract_reasoning -> extract_reasoning_content and extract_reasoning_streaming -> extract_reasoning_content_streaming to match the interface called by serving_chat.py (runtime AttributeError otherwise) - kimi_k2_tool_parser: fix extract_tool_calls to check all section-start token variants (plural + singular) and slice content at earliest occurrence; previously silently dropped tool calls when model emitted the singular variant - kimi_k2_reasoning_parser: remove duplicate runtime imports of ChatCompletionRequest/ResponsesRequest; keep only under TYPE_CHECKING since they appear only in string type hints - merge_attn_states: replace assert with RuntimeError so shape guards are enforced even under python -O - __init__: remove unused imports (current_platform, warnings, os, time, vllm.envs) Signed-off-by: GAtties <gatties@qq.com>
…infer_schema compatibility Signed-off-by: GAtties <gatties@qq.com>
…thinking -> thinking Signed-off-by: GAtties <gatties@qq.com>
liwei109
approved these changes
May 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Description
This PR brings Kimi-K2.5 model support to vLLM-Kunlun and fixes a series of correctness and scheduling issues observed in production on Kunlun XPU.
New model support
Attention correctness (Kunlun XPU)
torch.ones→torch.empty) and added shape guard inmerge_attn_statesto catch ABI mismatches early.Scheduling stability
Reasoning parser streaming
KimiK2ReasoningParserstreaming output (was silently dropping all chunks) and correctedno_thinkingmode routing.Checklist (Required)
Before submitting this PR, please ensure that all the following items are completed:
pre-commitchecks.git commit -s.PR Type
Please prefix the PR title with one or more of the following labels to help reviewers quickly understand the nature of the change:
[Feature]– New features or enhancements (e.g. Attention, Communicator, Kernel, Worker, etc.)[Bugfix]– Bug fixes[CI/Build]– CI, build system, or infrastructure improvements[Doc]– Documentation updates or fixes[Misc]– Other changes that do not fit the above categories (use sparingly)Detailed Checklist (Click to Expand)
Thank you for contributing to vLLM Kunlun! To help us maintain high code quality and streamline the review process, please ensure your PR meets the following requirements.
1. Code Quality
pre-commit).2. Testing
3. DCO Compliance
This project follows the Developer Certificate of Origin (DCO).
Signed-off-by:line.git commit -sto automatically add the sign-off.4. Review Expectations
During the review process, maintainers may:
We appreciate your patience and collaboration throughout the review process!