-
Notifications
You must be signed in to change notification settings - Fork 661
Pcp dcp refactor #5001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Pcp dcp refactor #5001
Conversation
Signed-off-by: zhenwenqi2024 <[email protected]>
Signed-off-by: zhenwenqi2024 <[email protected]>
Signed-off-by: zhenwenqi2024 <[email protected]>
Signed-off-by: zhenwenqi2024 <[email protected]>
Signed-off-by: zhenwenqi2024 <[email protected]>
Signed-off-by: zhenwenqi2024 <[email protected]>
Signed-off-by: zhenwenqi2024 <[email protected]>
Signed-off-by: zhenwenqi2024 <[email protected]>
Signed-off-by: zhenwenqi2024 <[email protected]>
Signed-off-by: zhenwenqi2024 <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the Prefill Context Parallelism (PCP) and Decode Context Parallelism (DCP) logic by moving it from NPUModelRunner into a new PCPManager class. While this is a good structural improvement, the refactoring has introduced several critical bugs, including typos, incorrect method calls with missing or wrong arguments, and usage of undefined attributes. These issues will lead to runtime errors and must be addressed.
| self.dcp_rank = 0 | ||
| self.pcp_size = 1 | ||
| self.pcp_rank = 0 | ||
| max_buffer_num_tokens = self.max_num_token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| self.pcp_manager = PCPManager( | ||
| self.pcp_size, | ||
| self.pcp_rank, | ||
| self.dcp_size, | ||
| self.dcp_rank, | ||
| max_buffer_num_tokens, | ||
| self.max_num_reqs, | ||
| self.device, | ||
| self.pin_memory, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constructor for PCPManager is called with incorrect arguments. The arguments decode_threshold and vllm_config are missing, which will lead to a TypeError at runtime.
self.pcp_manager = PCPManager(
self.pcp_size,
self.pcp_rank,
self.dcp_size,
self.dcp_rank,
max_buffer_num_tokens,
self.max_num_reqs,
self.decode_threshold,
self.device,
self.vllm_config,
self.pin_memory,
)|
|
||
| total_num_pcp_pads = sum(self.num_pcp_pads) | ||
| max_num_scheduled_tokens = max(tokens) | ||
| if self.pcp__world_szie > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| discard_request_indices = np.nonzero( | ||
| discard_requests_mask.np[:num_reqs])[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| slot_mapping = self.pcp_manager.get_padded_slot_mapping( | ||
| num_tokens, | ||
| slot_mapping, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vllm_ascend/worker/pcp_utils.py
Outdated
| return dcp_local_seq_lens | ||
|
|
||
| def generate_kv_idx(self, scheduler_output, input_batch): | ||
| if not self.pcp_size > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vllm_ascend/worker/pcp_utils.py
Outdated
| def generate_kv_idx(self, scheduler_output, input_batch): | ||
| if not self.pcp_size > 1: | ||
| return | ||
| self.cp_kv_recover_idx_for_chunk = [[] for _ in range(self.pcp_size)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The attribute self.pcp_size is not defined in this class. This will cause an AttributeError. You probably meant to use self.pcp_world_size.
| self.cp_kv_recover_idx_for_chunk = [[] for _ in range(self.pcp_size)] | |
| self.cp_kv_recover_idx_for_chunk = [[] for _ in range(self.pcp_world_size)] |
vllm_ascend/worker/pcp_utils.py
Outdated
| is_prefill = input_batch.num_computed_tokens_cpu[ | ||
| i] < self.input_batch.num_prompt_tokens[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The attribute self.input_batch is not defined in PCPManager. You should use the input_batch argument passed to the method.
| is_prefill = input_batch.num_computed_tokens_cpu[ | |
| i] < self.input_batch.num_prompt_tokens[i] | |
| is_prefill = input_batch.num_computed_tokens_cpu[ | |
| i] < input_batch.num_prompt_tokens[i] |
vllm_ascend/worker/pcp_utils.py
Outdated
| kv_req_offset = 0 | ||
| q_head_chunk_id = self.pcp_world_rank | ||
| q_tail_chunk_id = self.pcp_world_size * 2 - 1 - self.pcp_world_rank | ||
| for i, seq_len in enumerate(self.query_lens): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vllm_ascend/worker/pcp_utils.py
Outdated
| tail_attn_nomask_seqlens = torch.tensor( | ||
| [chunk_seqlens, kv_with_q_tail_nomask_seqlens], | ||
| dtype=torch.int32) | ||
| pcp_prefill_mask = self.attn_mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: zhenwenqi2024 <[email protected]>
Signed-off-by: zhenwenqi2024 <[email protected]>
vllm_ascend/worker/pcp_utils.py
Outdated
| @@ -0,0 +1,592 @@ | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | |||
| from typing import List | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不单独创建新的pcp_utils文件,采用和GPU的commn_utils方式,创建公共的工具文件
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok,i will chanage it
Signed-off-by: zhenwenqi2024 <[email protected]>
|
Is it possible to refactor the logic of pcp_dcp in attention so that both pcp and non-pcp paths share as much common processing as possible, minimizing major branching differences? |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
1 similar comment
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?