[ci] feat: add profiling tests to vLLM ci by Gary-cjy · Pull Request #5215 · verl-project/verl

Gary-cjy · 2026-02-06T06:23:38Z

What does this PR do?

This PR integrates NPU profiling capabilities into the vLLM CI pipeline. It enables detailed performance monitoring for both the Actor and Reference components within the actor_rollout_ref module. By adding these profiling tests, we can capture execution traces, analyze hardware utilization (NPU/CPU), and identify performance bottlenecks during the CI process.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

The changes have been tested by running the updated CI script on an Ascend NPU environment.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

CLAassistant · 2026-02-06T06:23:45Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ Gary-cjy
❌ ChenGary13

ChenGary13 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

gemini-code-assist

Code Review

This pull request adds a shell script for a profiling test. I've found two critical issues in the script that will cause it to fail. One is the use of a placeholder path for saving results, which is not suitable for a CI environment. The other is a shell syntax error in a variable assignment. Both issues need to be fixed for the script to run correctly.

tests/special_npu/run_qwen2_5_05b_grpo.sh

tardis-key · 2026-02-13T01:13:14Z

@mengchengTang

tests/utils/test_check_profiler_output.py

tests/utils/test_check_and_clean_profiler_output.py

tests/special_npu/run_qwen2_5_05b_grpo.sh

tardis-key · 2026-02-27T07:15:41Z

@wucong25 This CI script involves file writing and deletion. Please confirm the file read/write usage in the CI environment.

tardis-key · 2026-03-02T02:03:04Z

lgtm

wuxibin89 · 2026-03-02T02:40:17Z

tests/special_npu/run_qwen2_5_05b_grpo.sh

+    global_profiler.steps=$PROFILE_STEPS \
+    global_profiler.save_path="$SAVE_PATH" $@
+
+python3 "tests/utils/test_check_and_profiler_output.py" --profiler-dir="$SAVE_PATH"


Could you also help to enable profiler in GPU fsdp e2e test?

ok， We will perform the same verification on the GPU

tardis-key · 2026-03-10T06:12:37Z

In test_check_profiler_output, there is duplicate logic between NPU and GPU; please reuse it.

tardis-key · 2026-03-12T06:47:28Z

tests/utils/test_check_profiler_output.py

+
+        # Call corresponding check logic based on device type
+        if self.device_type == "gpu":
+            return self._check_gpu_profiler()


There is still a large amount of duplicate code in these two functions. Their functionalities are almost identical, differing only in the part where numeric checks are performed. We should extract the common logic.

tardis-key · 2026-03-12T06:49:14Z

tests/utils/test_check_profiler_output.py

+                return False
+
+            # Check files in directory
+            for gpu_dir in dirs:


if u expect len(dirs) == 1(line 51), there is no need to iterate over it. Unless u share common logic with npu.

tardis-key · 2026-03-12T06:52:48Z

tests/utils/test_check_profiler_output.py

+        for stage in self.TARGET_STAGES:
+            # Determine expected directory count for each stage
+            if stage == "*_rollout_*":
+                expected_count = 2


This number came from a specific test scenario and lacks generalization. We can directly require that len >= 1 during rollout, and len == 1 in ref and actor_update

tardis-key · 2026-03-12T06:54:03Z

tests/utils/test_check_profiler_output.py

+
+            # Print debug information
+            for d in dirs:
+                print(f"[{stage}] Found: {d}")


Unify your output method. Change it to logger.info

gemini-code-assist bot reviewed Feb 6, 2026

View reviewed changes

tests/special_npu/run_qwen2_5_05b_grpo.sh Outdated Show resolved Hide resolved

tests/special_npu/run_qwen2_5_05b_grpo.sh Show resolved Hide resolved

tardis-key marked this pull request as draft February 6, 2026 06:29

tardis-key reviewed Feb 6, 2026

View reviewed changes

tests/special_npu/run_qwen2_5_05b_grpo.sh Outdated Show resolved Hide resolved

tardis-key changed the title ~~Add a ci test shell~~ [ci] feat: add profiling tests to vLLM ci Feb 6, 2026

tardis-key reviewed Feb 9, 2026

View reviewed changes

tardis-key marked this pull request as ready for review February 9, 2026 02:56

tardis-key reviewed Feb 9, 2026

View reviewed changes

tests/special_npu/run_qwen2_5_05b_grpo.sh Outdated Show resolved Hide resolved

tardis-key reviewed Feb 12, 2026

View reviewed changes

tests/special_npu/run_qwen2_5_05b_grpo.sh Show resolved Hide resolved

tardis-key reviewed Feb 13, 2026

View reviewed changes

tests/utils/test_check_profiler_output.py Show resolved Hide resolved

mengchengTang reviewed Feb 13, 2026

View reviewed changes

tests/utils/test_check_and_clean_profiler_output.py Outdated Show resolved Hide resolved

tardis-key reviewed Feb 27, 2026

View reviewed changes

tests/special_npu/run_qwen2_5_05b_grpo.sh Outdated Show resolved Hide resolved

tardis-key mentioned this pull request Feb 27, 2026

[perf, trtllm] feat: Add Nsight support for rollout server mode (trtllm) #5391

Open

8 tasks

Gary-cjy requested review from ISEEKYAN and vermouth1992 as code owners February 27, 2026 13:17

wuxibin89 reviewed Mar 2, 2026

View reviewed changes

Gary-cjy force-pushed the main branch from 4f6ffe0 to 96d1cb1 Compare March 10, 2026 03:28

tardis-key reviewed Mar 12, 2026

View reviewed changes

tardis-key marked this pull request as draft March 12, 2026 06:54

Gary-cjy and others added 7 commits March 13, 2026 16:49

add ci test shell

6a51ef0

fix ci test shell

1fa46bb

fix a ci test shell

ce3a3ea

update ci test shell

480e56d

add test check and clean profiler output

5119ce3

fix:pre-commit

2bc749e

feat:add proling check for rollout and ref

23aae84

ChenGary13 added 5 commits March 13, 2026 16:49

feat:fix clean check

eea367a

feat:add profiling check in GPU

adc8e34

feat: reuse of NPU/GPU profiler files

6fb5529

feat:fix profiler check and pre-commit

acc1dee

feat:fix profiler check

0fbb353

Gary-cjy force-pushed the main branch from 4e8a9c7 to 0fbb353 Compare March 13, 2026 08:51

tardis-key marked this pull request as ready for review March 13, 2026 09:24

feat:fix gpu/npu profile check

51e36db

wuxibin89 approved these changes Mar 16, 2026

View reviewed changes

wuxibin89 merged commit 18f1a7a into verl-project:main Mar 16, 2026
22 of 38 checks passed

Conversation

Gary-cjy commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

CLAassistant commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tardis-key commented Feb 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tardis-key commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tardis-key commented Mar 2, 2026

Uh oh!

wuxibin89 Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

tardis-key Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Gary-cjy Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

tardis-key commented Mar 10, 2026

Uh oh!

tardis-key Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

tardis-key Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

tardis-key Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

tardis-key Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Gary-cjy commented Feb 6, 2026 •

edited

Loading

CLAassistant commented Feb 6, 2026 •

edited

Loading

tardis-key commented Feb 27, 2026 •

edited

Loading