Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions tests/e2e/nightly/multi_node/config/Qwen3-235B-A22B-A2.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ deployment:
--trust-remote-code
--gpu-memory-utilization 0.9
--async-scheduling
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Suggested PR Title:

[CI][Misc] Modify nightly test case configuration for Qwen3-235B

Suggested PR Summary:

### What this PR does / why we need it?
This PR updates the nightly test configurations for the Qwen3-235B model by adding specific compilation configurations (`cudagraph_mode`). This ensures that the nightly tests cover different execution modes (FULL and PIECEWISE) for better performance and stability validation on Ascend NPUs.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
This is a configuration change for nightly tests.
References
  1. The PR title and summary must follow the specified format and be provided in markdown code blocks. (link)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The cudagraph_mode value "FULL_DECODE_ONLY" appears to be incorrect. The CUDAGraphMode enum (referenced in vllm_ascend/compilation/acl_graph.py) typically supports "FULL" or "PIECEWISE". Using an unsupported string will likely cause a configuration error or skip graph capture entirely. Please use "FULL" if full graph capture is intended.

        --compilation-config '{"cudagraph_mode":"FULL"}'

-
server_cmd: >
vllm serve "Qwen/Qwen3-235B-A22B"
Expand All @@ -49,6 +50,7 @@ deployment:
--trust-remote-code
--gpu-memory-utilization 0.9
--async-scheduling
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The cudagraph_mode value "FULL_DECODE_ONLY" should be changed to "FULL" to match the expected CUDAGraphMode enum values.

        --compilation-config '{"cudagraph_mode":"FULL"}'

benchmarks:
perf:
case_type: performance
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ deployment:
--trust-remote-code
--no-enable-prefix-caching
--gpu-memory-utilization 0.9
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This server command includes --enforce-eager (line 32), which explicitly disables graph capturing. Adding --compilation-config with a cudagraph_mode here is contradictory and will have no effect. If the goal is to test ACL graphs for this component, --enforce-eager should be removed. Additionally, "FULL_DECODE_ONLY" should be corrected to "FULL".

        --compilation-config '{"cudagraph_mode":"FULL"}'

--additional-config '{"recompute_scheduler_enable": true,"enable_shared_expert_dp": true}'
--kv-transfer-config
'{"kv_connector": "MooncakeLayerwiseConnector",
Expand Down Expand Up @@ -73,6 +74,7 @@ deployment:
--trust-remote-code
--no-enable-prefix-caching
--gpu-memory-utilization 0.9
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The cudagraph_mode value "FULL_DECODE_ONLY" should be changed to "FULL" to match the expected CUDAGraphMode enum values.

        --compilation-config '{"cudagraph_mode":"FULL"}'

--additional-config '{"torchair_graph_config":{"enabled":true}}'
--kv-transfer-config
'{"kv_connector": "MooncakeLayerwiseConnector",
Expand Down
2 changes: 2 additions & 0 deletions tests/e2e/nightly/multi_node/config/Qwen3-235B-A22B.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ deployment:
--trust-remote-code
--no-enable-prefix-caching
--gpu-memory-utilization 0.9
--compilation-config '{"cudagraph_mode":"PIECEWISE"}'
-
server_cmd: >
vllm serve "Qwen/Qwen3-235B-A22B"
Expand All @@ -50,6 +51,7 @@ deployment:
--trust-remote-code
--no-enable-prefix-caching
--gpu-memory-utilization 0.9
--compilation-config '{"cudagraph_mode":"PIECEWISE"}'
benchmarks:
perf:
case_type: performance
Expand Down
Loading