-
Notifications
You must be signed in to change notification settings - Fork 661
[KVPool]Fix PP get bug #5007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KVPool]Fix PP get bug #5007
Conversation
Signed-off-by: baxingpiaochong <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix a bug in the key-value pool's get operation for pipeline parallelism by introducing a unified check. However, the change in vllm_ascend/distributed/kvpool/pool_worker.py introduces a critical issue. It will cause an IndexError when both tensor and pipeline parallelism are enabled, because the list of keys being checked is not constructed correctly to cover all parallel ranks. This change assumes the key list is larger than it is, leading to an out-of-bounds access. The root cause lies in the key generation logic which needs to be fixed for this change to be effective.
| for i in range( | ||
| min(self.tp_size, self.num_kv_head) * self.pp_size) | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is likely to cause an IndexError when both tensor parallelism and pipeline parallelism are enabled (i.e., self.tp_size > 1 and self.pp_size > 1).
The res list's size is determined by multi_tp_keys. The construction of multi_tp_keys on lines 554-565 does not generate all combinations of TP and PP ranks. It only generates keys for (tp_rank=any, pp_rank=0) and (tp_rank=0, pp_rank>0), missing cross-combinations.
As a result, len(res) will be num_block * (tp_factor + self.pp_size - 1), where tp_factor = min(self.tp_size, self.num_kv_head). This loop, however, attempts to access up to num_block * tp_factor * self.pp_size elements. An IndexError will occur if tp_factor * self.pp_size > tp_factor + self.pp_size - 1, which is true when tp_factor > 1 and self.pp_size > 1.
For this unified check to work correctly, the multi_tp_keys list must be populated with keys for all combinations of TP and PP ranks. The logic for generating multi_tp_keys needs to be corrected first.
|
For the PP (Pipeline Parallelism) scenario, consider the query pattern of each PP stage. |
What this PR does / why we need it?
When kv caches are evicted from the key-value pool, it's possible that the kv cache for pp0 is still active, but the kv cache for pp1 has already been evicted. Therefore, a unified check is needed during the get operation.
Does this PR introduce any user-facing change?
How was this patch tested?