Pcp dcp refactor #5001

zhenwenqi2024 · 2025-12-14T14:40:08Z

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

Signed-off-by: zhenwenqi2024 <[email protected]>

github-actions · 2025-12-14T14:40:15Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request refactors the Prefill Context Parallelism (PCP) and Decode Context Parallelism (DCP) logic by moving it from NPUModelRunner into a new PCPManager class. While this is a good structural improvement, the refactoring has introduced several critical bugs, including typos, incorrect method calls with missing or wrong arguments, and usage of undefined attributes. These issues will lead to runtime errors and must be addressed.

gemini-code-assist · 2025-12-14T14:42:25Z

vllm_ascend/worker/model_runner_v1.py

            self.dcp_rank = 0
            self.pcp_size = 1
            self.pcp_rank = 0
+        max_buffer_num_tokens = self.max_num_token


There's a typo in the attribute name. self.max_num_token should be self.max_num_tokens. This will cause an AttributeError at runtime as max_num_token is not defined.

Suggested change

max_buffer_num_tokens = self.max_num_token

max_buffer_num_tokens = self.max_num_tokens

gemini-code-assist · 2025-12-14T14:42:25Z

vllm_ascend/worker/model_runner_v1.py

+            self.pcp_manager = PCPManager(
+                self.pcp_size,
+                self.pcp_rank,
+                self.dcp_size,
+                self.dcp_rank,
+                max_buffer_num_tokens,
+                self.max_num_reqs,
+                self.device,
+                self.pin_memory,
+            )


The constructor for PCPManager is called with incorrect arguments. The arguments decode_threshold and vllm_config are missing, which will lead to a TypeError at runtime.

self.pcp_manager = PCPManager( self.pcp_size, self.pcp_rank, self.dcp_size, self.dcp_rank, max_buffer_num_tokens, self.max_num_reqs, self.decode_threshold, self.device, self.vllm_config, self.pin_memory, )

gemini-code-assist · 2025-12-14T14:42:25Z

vllm_ascend/worker/model_runner_v1.py

-
-        total_num_pcp_pads = sum(self.num_pcp_pads)
-        max_num_scheduled_tokens = max(tokens)
+        if self.pcp__world_szie > 1:


There is a typo in the attribute name. self.pcp__world_szie should be self.pcp_size. This will cause an AttributeError at runtime.

if self.pcp_size > 1:

gemini-code-assist · 2025-12-14T14:42:25Z

vllm_ascend/worker/model_runner_v1.py

+        discard_request_indices = np.nonzero(
+            discard_requests_mask.np[:num_reqs])[0]


The get_discard_request_mask method returns a numpy array, which does not have a .np attribute. Accessing discard_requests_mask.np will raise an AttributeError. You should use the returned numpy array directly.

discard_request_indices = np.nonzero(discard_requests_mask)[0]

gemini-code-assist · 2025-12-14T14:42:25Z

vllm_ascend/worker/model_runner_v1.py

+                    slot_mapping = self.pcp_manager.get_padded_slot_mapping(
+                        num_tokens,
+                        slot_mapping,
+                    )


The variable num_tokens is not defined in this scope, which will cause a NameError at runtime. Based on the context, it seems you intended to use total_num_scheduled_tokens.

slot_mapping = self.pcp_manager.get_padded_slot_mapping( total_num_scheduled_tokens, slot_mapping, )

gemini-code-assist · 2025-12-14T14:42:25Z