Skip to content

[Feature][Bugfix]Support Kimi-K2.5 tool/reasoning parser, fix MLA attention correctness, and backport KV admission control on Kunlun XPU#354

Merged
liwei109 merged 7 commits into
baidu:releases/v0.11.0from
GAtties:kimi-op-5-fix-mla-05
May 19, 2026
Merged

[Feature][Bugfix]Support Kimi-K2.5 tool/reasoning parser, fix MLA attention correctness, and backport KV admission control on Kunlun XPU#354
liwei109 merged 7 commits into
baidu:releases/v0.11.0from
GAtties:kimi-op-5-fix-mla-05

Conversation

@GAtties
Copy link
Copy Markdown

@GAtties GAtties commented May 8, 2026

PR Description

This PR brings Kimi-K2.5 model support to vLLM-Kunlun and fixes a series of correctness and scheduling issues observed in production on Kunlun XPU.

New model support

  • Added tool parser and reasoning parser for Kimi-K2.5, with per-channel support and performance optimizations.

Attention correctness (Kunlun XPU)

  • Fixed three MLA bugs that caused silent hallucinations and wrong outputs under chunked prefill and multi-concurrency: LSE transposition mismatch, wrong prefill attention scale (~7% error), and missing KV LOD in cross-attention.
  • Fixed FlashMLA output buffer (torch.onestorch.empty) and added shape guard in merge_attn_states to catch ABI mismatches early.

Scheduling stability

  • Backported two-gate KV admission control from vLLM 0.19: limits partial-prefill concurrency and verifies full-sequence fit before admission, resolving a throughput collapse (68 tok/s → 12 tok/s) under high-concurrency long-context workloads.

Reasoning parser streaming

  • Fixed KimiK2ReasoningParser streaming output (was silently dropping all chunks) and corrected no_thinking mode routing.

Checklist (Required)

Before submitting this PR, please ensure that all the following items are completed:

  • All code changes pass the pre-commit checks.
  • Commits are signed off using git commit -s.
  • The PR title is properly classified (see below).

PR Type

Please prefix the PR title with one or more of the following labels to help reviewers quickly understand the nature of the change:

  • [Feature] – New features or enhancements (e.g. Attention, Communicator, Kernel, Worker, etc.)
  • [Bugfix] – Bug fixes
  • [CI/Build] – CI, build system, or infrastructure improvements
  • [Doc] – Documentation updates or fixes
  • [Misc] – Other changes that do not fit the above categories (use sparingly)

Note: If the PR spans multiple categories, include all relevant prefixes.


Detailed Checklist (Click to Expand)

Thank you for contributing to vLLM Kunlun! To help us maintain high code quality and streamline the review process, please ensure your PR meets the following requirements.

1. Code Quality

  • All linting and formatting checks pass (pre-commit).
  • The code is well-structured and sufficiently documented.
  • The change is designed with maintainability and readability in mind.

2. Testing

  • Relevant unit tests are added or updated.
  • Integration tests are included when applicable.
  • Existing tests continue to pass.

3. DCO Compliance

This project follows the Developer Certificate of Origin (DCO).

  • All commits include a Signed-off-by: line.
  • Use git commit -s to automatically add the sign-off.

4. Review Expectations

During the review process, maintainers may:

  • Request code refactoring or additional tests.
  • Ask for clarifications on design decisions.
  • Suggest performance, stability, or maintainability improvements.

We appreciate your patience and collaboration throughout the review process!

zhouzijian01 and others added 4 commits April 14, 2026 03:42
…e optimization and per-channel support

Signed-off-by: zhouzijian01 <zhouzijian01@baidu.com>
…n XPU

1. merge_attn_states: transpose LSE before/after kunlun_ops.attention_merge_stage
   - vLLM convention: [num_heads, num_tokens]
   - kunlun_ops expects: [num_tokens, num_heads]
   - Mismatch caused wrong merge weights in chunked prefill, leading to
     hallucinations on long-context (30k token) inputs

2. prefill attention scale: replace hardcoded alpha=1.8738 with dynamic formula
   - XFA kernel divides internally by sqrt(d), so alpha must be pre-scaled:
     alpha = softmax_scale * sqrt(qk_head_dim)
   - Old value gave effective_scale≈0.1352 vs target 0.14468 (~7% error)

3. context cross-attention KV LOD: pass context_kvlen_lod to kunlun_ops.attention
   - In chunked prefill, Q (new tokens) and KV (cached context) have different
     per-request sequence boundaries
   - Without context_kvlen_lod, the kernel used Q boundaries to split the KV
     tensor, causing cross-request attention pollution in multi-concurrency

Also fix compressed_tensors_moe weight scale processing order: convert to
float32 before mul_(7.0) to avoid in-place multiply on lower-precision tensor.

Signed-off-by: GAtties <gatties@qq.com>
Monkey-patch vLLM 0.11 (Kunlun/P800) with two scheduling improvements
backported from vLLM 0.19, and fix two attention operation bugs.

KV cache admission control (patches/kv_admission.py, new file):
- Gate 1: partial-prefill concurrency limit (VLLM_MAX_PARTIAL_PREFILLS,
  default=1); blocks new admissions when too many requests are in
  chunked-prefill state, preventing decode starvation
- Gate 2: full-sequence admission gate; before admitting a waiting request,
  verify that prompt + max_output_tokens fit in free KV blocks, preventing
  the preemption loop (68 tok/s → 12-16 tok/s) observed under high
  concurrency with long contexts
  Fix: use num_prompt_tokens + max_tokens (full sequence) instead of
  num_tokens (prompt-only at admission time)
- Patches are applied lazily via import hook after target modules load,
  with logging.warning on failure instead of silent except-pass

FlashMLA output buffer (ops/attention/flashmla.py):
- Replace torch.ones with torch.empty for the output buffer; unwritten
  elements defaulting to 1.0 caused silent correctness bugs

merge_attn_states shape guard (ops/attention/merge_attn_states.py):
- Add assert for prefix_lse / suffix_lse shape [num_heads, num_tokens]
  to catch XPU kernel ABI mismatches early

Signed-off-by: GAtties <gatties@qq.com>
- Add extract_reasoning_content_streaming() to match vllm base class API;
  the missing method caused all streaming chunks to return None and be
  skipped, resulting in empty stream output.
- Fix DeltaMessage field name: reasoning -> reasoning_content (3 places),
  so reasoning content is correctly populated in stream deltas.
- Fix extract_reasoning_content() to treat output as pure content when
  neither <think> nor </think> appears (thinking=false mode).
- Add _content_mode state tracking in extract_reasoning_streaming() to
  detect non-thinking mode from the first generated token, correctly
  routing content instead of reasoning_content in stream responses.

Usage: pass chat_template_kwargs={"thinking": false} to disable thinking.
Signed-off-by: GAtties <gatties@qq.com>
@xyDong0223 xyDong0223 requested review from Copilot and liwei109 and removed request for liwei109 May 8, 2026 10:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Kimi-K2.5 model-specific parsing support and multiple Kunlun XPU correctness/scheduling fixes, including MLA attention fixes and a backported KV-cache admission-control mechanism to stabilize high-concurrency long-context workloads.

Changes:

  • Implement Kimi-K2 tool parser + reasoning parser and register them for OpenAI serving integration.
  • Fix Kunlun XPU MLA attention correctness (scaling, KV LOD for cross-attn) and tighten merge-attention-state ABI/layout handling.
  • Backport KV admission control + partial-prefill concurrency limiting via runtime patches applied through the Kunlun import hook.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
vllm_kunlun/v1/attention/backends/mla/common.py Adjusts XPU attention scaling and passes separate KV LOD for correct cross-attention in chunked prefill.
vllm_kunlun/reasoning/kimi_k2_reasoning_parser.py Adds Kimi-K2 reasoning parser with streaming support and “no thinking” routing.
vllm_kunlun/reasoning/identity_reasoning_parser.py Adds identity reasoning parser used as a fallback when thinking is disabled.
vllm_kunlun/reasoning/init.py Registers the new Kimi-K2 reasoning parser.
vllm_kunlun/patches/kv_admission.py Introduces KV admission + partial-prefill concurrency gates via monkey patches.
vllm_kunlun/ops/quantization/compressed_tensors/compressed_tensors_moe.py Extends Kunlun WNA16 MoE method initialization to support multiple strategies and aligns imports/behavior.
vllm_kunlun/ops/attention/merge_attn_states.py Fixes LSE layout expectations by transposing into/out of the Kunlun kernel and adds shape guards.
vllm_kunlun/ops/attention/flashmla.py Uses torch.empty for output buffer allocation.
vllm_kunlun/ops/_kunlun_ops.py Removes per-channel max buffer plumbing for moe_fc_v3 by passing None.
vllm_kunlun/entrypoints/openai/tool_parsers/kimi_k2_tool_parser.py Adds Kimi-K2 tool-call extraction with streaming state management and marker handling.
vllm_kunlun/entrypoints/openai/tool_parsers/init.py Registers the new Kimi-K2 tool parser.
vllm_kunlun/init.py Enhances module remap error visibility and applies KV admission/scheduler patches post-import.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +42 to +51
# Check if thinking is disabled via chat_template_kwargs
chat_kwargs = kwargs.get("chat_template_kwargs", {}) or {}
thinking = bool(chat_kwargs.get("thinking", True))

# If thinking is not enabled, use identity parser to fall through
self._identity_parser: IdentityReasoningParser | None
if not thinking:
self._identity_parser = IdentityReasoningParser(tokenizer, *args, **kwargs)
else:
self._identity_parser = None
Comment on lines +8 to +19
from vllm.entrypoints.openai.protocol import (
ChatCompletionRequest,
DeltaMessage,
ResponsesRequest,
)
from vllm.reasoning.abs_reasoning_parsers import ReasoningParser

from vllm_kunlun.reasoning.identity_reasoning_parser import IdentityReasoningParser

if TYPE_CHECKING:
from vllm.entrypoints.openai.protocol import ChatCompletionRequest, ResponsesRequest

Comment on lines +48 to +60
def extract_reasoning_streaming(
self,
previous_text: str,
current_text: str,
delta_text: str,
previous_token_ids: Sequence[int],
current_token_ids: Sequence[int],
delta_token_ids: Sequence[int],
) -> DeltaMessage | None:
# Just wrap delta_text as content, ignore reasoning
if delta_text:
return DeltaMessage(content=delta_text)
return None
Comment on lines +62 to +67
def extract_reasoning(
self, model_output: str, request: "ChatCompletionRequest | ResponsesRequest"
) -> tuple[str | None, str | None]:
# No reasoning separation: return None for reasoning,
# and full model_output as content
return None, model_output
if diff:
diff = (
diff.encode("utf-8").decode("unicode_escape")
if diff is str
Comment on lines +152 to +209
if self.tool_calls_start_token not in model_output:
return ExtractedToolCallInformation(
tools_called=False, tool_calls=[], content=model_output
)

else:
try:
# there are two possible captures - between tags, or between a
# tag and end-of-string so the result of
# findall is an array of tuples where one is a function call and
# the other is None
function_call_tuples = self.tool_call_regex.findall(model_output)

logger.debug("function_call_tuples: %s", function_call_tuples)

tool_calls = []
for match in function_call_tuples:
function_id, function_args = match
# function_id: functions.get_weather:0 or get_weather:0
function_name = function_id.split(":")[0].split(".")[-1]

# Validate function name against available tools
if request and hasattr(request, "tools") and request.tools:
valid_names = {
tool.function.name
for tool in request.tools
if hasattr(tool, "function")
}
if function_name not in valid_names:
logger.warning(
"Tool '%s' not found in available tools, skipping",
function_name,
)
continue # Skip this tool call

tool_calls.append(
ToolCall(
id=function_id,
type="function",
function=FunctionCall(
name=function_name, arguments=function_args
),
)
)

content = model_output[: model_output.find(self.tool_calls_start_token)]
return ExtractedToolCallInformation(
tools_called=True,
tool_calls=tool_calls,
content=content if content else None,
)

except Exception:
logger.exception("Error in extracting tool call from response.")
return ExtractedToolCallInformation(
tools_called=False, tool_calls=[], content=model_output
)

Comment on lines 28 to 31
out_lse_kernel = torch.empty(num_tokens, num_heads,
dtype=torch.float32,
device=output.device)

Comment thread vllm_kunlun/__init__.py Outdated
Comment on lines 2 to 10
from .platforms import current_platform
import sys
import importlib
import logging
import warnings
import builtins
import os
import time
import vllm.envs as envs
Comment on lines +22 to +27
assert prefix_lse.shape == (num_heads, num_tokens), (
f"prefix_lse must be [num_heads, num_tokens]=({num_heads}, {num_tokens}), "
f"got {tuple(prefix_lse.shape)}")
assert suffix_lse.shape == (num_heads, num_tokens), (
f"suffix_lse must be [num_heads, num_tokens]=({num_heads}, {num_tokens}), "
f"got {tuple(suffix_lse.shape)}")
GAtties added 3 commits May 8, 2026 19:43
- identity_reasoning_parser: rename extract_reasoning ->
  extract_reasoning_content and extract_reasoning_streaming ->
  extract_reasoning_content_streaming to match the interface
  called by serving_chat.py (runtime AttributeError otherwise)
- kimi_k2_tool_parser: fix extract_tool_calls to check all
  section-start token variants (plural + singular) and slice
  content at earliest occurrence; previously silently dropped
  tool calls when model emitted the singular variant
- kimi_k2_reasoning_parser: remove duplicate runtime imports of
  ChatCompletionRequest/ResponsesRequest; keep only under
  TYPE_CHECKING since they appear only in string type hints
- merge_attn_states: replace assert with RuntimeError so shape
  guards are enforced even under python -O
- __init__: remove unused imports (current_platform, warnings,
  os, time, vllm.envs)

Signed-off-by: GAtties <gatties@qq.com>
…infer_schema compatibility

Signed-off-by: GAtties <gatties@qq.com>
…thinking -> thinking

Signed-off-by: GAtties <gatties@qq.com>
@liwei109 liwei109 merged commit 35798ee into baidu:releases/v0.11.0 May 19, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants