-
Notifications
You must be signed in to change notification settings - Fork 2.9k
[perf] fix: profiler bug fix for agent loop #4320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[perf] fix: profiler bug fix for agent loop #4320
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces profiling capabilities for the agent loop by adding start_profile and stop_profile methods to AgentLoopManager, vLLMHttpServerBase, and vLLMReplica. The changes correctly propagate the profiling calls down to the workers. My review includes one high-severity suggestion to fix a blocking call (ray.get) within an asynchronous actor in vLLMHttpServerBase, which should be converted to a non-blocking asyncio.gather to prevent blocking the event loop and to maintain consistency with other asynchronous methods in the class.
b620922 to
d721a17
Compare
|
It can be seen that the overall process control and the actual profiling startup are far apart, and the original solution requires passing the calling function many times, which is not elegant enough. |
verl/utils/profiler/mstx_profile.py
Outdated
| if (not self.discrete or self.async_start) and NPUProfiler._define_count == 0: | ||
| if not self.discrete: | ||
| prof_role = "e2e" | ||
| prof_step = profile_step |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prof_step and profile_step are confusing. And new variables are unnecessary, the original role&profile_step can still meet the requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The corresponding variable(s) have been deprecated in the new solution.
8486b44 to
b76689b
Compare
b76689b to
17d67e9
Compare
|
@FightingZhen @wuxibin89 ready for review. This pr is bugfix for legacy_workers. |
bd1e8e3 to
882201c
Compare
882201c to
c49435b
Compare
verl/workers/rollout/replica.py
Outdated
|
|
||
| async def start_profile(self, **kwargs): | ||
| """Start profiling on all workers.""" | ||
| await asyncio.gather(*[worker.start_capture_profile.remote(**kwargs) for worker in self.workers]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only suitable for vllm with vLLMAsyncRollout(ModelRunner) colocated with trainer in same process, while sglang have separate processes. We're working on separating vllm process as well #4280.
For rollout performance profile, we should reuse vllm/sglang server metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. I completely agree that using native server metrics is the right direction.
And this pr is necessary as a bugfix for the currently supported architecture (colocated vLLM) to ensure profiling data is captured correctly right now. We treat this as an interim solution and will align with the native metrics approach after the process separation (#4280) lands.
c49435b to
af06917
Compare
af06917 to
9204dd5
Compare
0229ef0 to
4d19488
Compare
4d19488 to
e033c06
Compare
|
Is there any way to unify start_e2e_profiler and start_capture_profiler? |
| if self.device_mesh["infer_tp"].get_local_rank() == 0: | ||
| await self._engine.flush_cache() | ||
|
|
||
| async def start_async_rollout_profile(self, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about moving start_async_rollout_profile to DistProfilerExtension?
| """Start an async rollout profiling segment.""" | ||
| await self.rollout.start_async_rollout_profile(**kwargs) | ||
|
|
||
| @register(dispatch_mode=Dispatch.DIRECT_ROLLOUT_METHOD) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to encapsulate the profiler method of rollout, and support custom use of either the engine's built-in profiler capability or the VeriL profiler capability.
tardis-key
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Do not rename start/stop. The implementation of the second pair of functions depends on the patch implementation.
- Patch the start_profiler capability of vLLM. Do not unravel vLLM's complex scheduling mechanism, so as to keep the external scheduling logic of VeRL cleaner.
- Retain distinct serer profiling capabilities for SGLang and vLLM, and consider implementing rolloutExtension or slangExtension/vllmExtension.
Co-authored-by: tardis-key <[email protected]>
e033c06 to
54f19ce
Compare
What does this PR do?
Updated Description: This PR fixes the missing rollout profiling data issue caused by the Agent Loop architecture shift.
Root Cause: Inference requests now bypass the generate_sequences method of the worker, causing missing performance data during the rollout phase in Discrete Mode.
Solution: We add explicit profiling control to support the current rollout workers. This is a backward-compatible fix.
New methods start_capture_profile and stop_capture_profile in DistProfiler and Workers.
AgentLoopManager now manually starts and stops profiling via vLLMReplica around inference requests.
This ensures existing rollout workers can collect performance data correctly.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)