[ci] feat: add profiling tests to vLLM ci#5215
Conversation
|
ChenGary13 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
There was a problem hiding this comment.
Code Review
This pull request adds a shell script for a profiling test. I've found two critical issues in the script that will cause it to fail. One is the use of a placeholder path for saving results, which is not suitable for a CI environment. The other is a shell syntax error in a variable assignment. Both issues need to be fixed for the script to run correctly.
|
@wucong25 This CI script involves file writing and deletion. Please confirm the file read/write usage in the CI environment. |
|
lgtm |
| global_profiler.steps=$PROFILE_STEPS \ | ||
| global_profiler.save_path="$SAVE_PATH" $@ | ||
|
|
||
| python3 "tests/utils/test_check_and_profiler_output.py" --profiler-dir="$SAVE_PATH" |
There was a problem hiding this comment.
Could you also help to enable profiler in GPU fsdp e2e test?
There was a problem hiding this comment.
ok, We will perform the same verification on the GPU
|
In test_check_profiler_output, there is duplicate logic between NPU and GPU; please reuse it. |
|
|
||
| # Call corresponding check logic based on device type | ||
| if self.device_type == "gpu": | ||
| return self._check_gpu_profiler() |
There was a problem hiding this comment.
There is still a large amount of duplicate code in these two functions. Their functionalities are almost identical, differing only in the part where numeric checks are performed. We should extract the common logic.
| return False | ||
|
|
||
| # Check files in directory | ||
| for gpu_dir in dirs: |
There was a problem hiding this comment.
if u expect len(dirs) == 1(line 51), there is no need to iterate over it. Unless u share common logic with npu.
| for stage in self.TARGET_STAGES: | ||
| # Determine expected directory count for each stage | ||
| if stage == "*_rollout_*": | ||
| expected_count = 2 |
There was a problem hiding this comment.
This number came from a specific test scenario and lacks generalization. We can directly require that len >= 1 during rollout, and len == 1 in ref and actor_update
|
|
||
| # Print debug information | ||
| for d in dirs: | ||
| print(f"[{stage}] Found: {d}") |
There was a problem hiding this comment.
Unify your output method. Change it to logger.info
What does this PR do?
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.