Skip to content

Conversation

@baxingpiaochong
Copy link
Contributor

@baxingpiaochong baxingpiaochong commented Dec 15, 2025

What this PR does / why we need it?

When kv caches are evicted from the key-value pool, it's possible that the kv cache for pp0 is still active, but the kv cache for pp1 has already been evicted. Therefore, a unified check is needed during the get operation.

Does this PR introduce any user-facing change?

How was this patch tested?

Signed-off-by: baxingpiaochong <[email protected]>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a bug in the key-value pool's get operation for pipeline parallelism by introducing a unified check. However, the change in vllm_ascend/distributed/kvpool/pool_worker.py introduces a critical issue. It will cause an IndexError when both tensor and pipeline parallelism are enabled, because the list of keys being checked is not constructed correctly to cover all parallel ranks. This change assumes the key list is larger than it is, leading to an out-of-bounds access. The root cause lies in the key generation logic which needs to be fixed for this change to be effective.

Comment on lines +575 to 577
for i in range(
min(self.tp_size, self.num_kv_head) * self.pp_size)
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change is likely to cause an IndexError when both tensor parallelism and pipeline parallelism are enabled (i.e., self.tp_size > 1 and self.pp_size > 1).

The res list's size is determined by multi_tp_keys. The construction of multi_tp_keys on lines 554-565 does not generate all combinations of TP and PP ranks. It only generates keys for (tp_rank=any, pp_rank=0) and (tp_rank=0, pp_rank>0), missing cross-combinations.

As a result, len(res) will be num_block * (tp_factor + self.pp_size - 1), where tp_factor = min(self.tp_size, self.num_kv_head). This loop, however, attempts to access up to num_block * tp_factor * self.pp_size elements. An IndexError will occur if tp_factor * self.pp_size > tp_factor + self.pp_size - 1, which is true when tp_factor > 1 and self.pp_size > 1.

For this unified check to work correctly, the multi_tp_keys list must be populated with keys for all combinations of TP and PP ranks. The logic for generating multi_tp_keys needs to be corrected first.

@jianzs jianzs added ready read for review ready-for-test start test by label for PR labels Dec 15, 2025
@LCAIZJ
Copy link
Contributor

LCAIZJ commented Dec 15, 2025

For the PP (Pipeline Parallelism) scenario, consider the query pattern of each PP stage.

@jianzs jianzs merged commit 95e6400 into vllm-project:main Dec 15, 2025
14 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants