Skip to content

Commit 4ce150f

Browse files
[rollout] fix: remove unexpected concurrency bound at 1000 (verl-project#5402)
### What does this PR do? Remove the unexpected concurrency bound of 1000 that preventing rollout engine from reaching `actor_rollout_ref.rollout.max_num_seqs` if it's larger than 1000. [Ray doc](https://docs.ray.io/en/latest/ray-core/api/doc/ray.actor.ActorClass.options.html#ray.actor.ActorClass.options:~:text=calls%20is%20unlimited.-,max_concurrency,-%E2%80%93%20The%20max%20number) says: ``` max_concurrency: The max number of concurrent calls to allow for this actor. This only works with direct actor calls. The max concurrency defaults to 1 for threaded execution, and 1000 for asyncio execution. Note that the execution order is not guaranteed when max_concurrency > 1. ``` and the call to `{TRTLLM,vLLM,SGLang}HttpServer.generate` is an async remote call: https://github.com/verl-project/verl/blob/6f4942b1153b23720e74564e00817526b342198c/verl/experimental/agent_loop/agent_loop.py#L114-L120 So the default value will limit the request concurrency to 1000. This PR sets `max_concurrency` based on `actor_rollout_ref.rollout.max_num_seqs` so that a higher concurrency configured by the user can be achieved. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: https://github.com/verl-project/verl/pulls?q=is%3Apr+is%3Aopen+rollout+concurrency+ - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test Current tests should be enough to ensure it does not break anything. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`.
1 parent de87452 commit 4ce150f

File tree

4 files changed

+15
-1
lines changed

4 files changed

+15
-1
lines changed

verl/workers/rollout/replica.py

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,11 @@
3131
logger = logging.getLogger(__file__)
3232

3333

34+
# Max number of concurrent calls to the methods of Rollout,
35+
# excluding calls to generate method.
36+
CONTROL_METHOD_CONCURRENCY = 16
37+
38+
3439
class TokenOutput(BaseModel):
3540
token_ids: list[int]
3641
"""response token ids"""
@@ -92,7 +97,7 @@ def __init__(
9297
is_reward_model: bool = False,
9398
) -> None:
9499
self.replica_rank = replica_rank
95-
self.config = omega_conf_to_dataclass(config)
100+
self.config: RolloutConfig = omega_conf_to_dataclass(config)
96101
self.model_config: HFModelConfig = model_config
97102

98103
self.world_size = (
@@ -229,6 +234,12 @@ def server_handle(self) -> ActorHandle:
229234
"""Get rollout server handle for Token-in-token-out generation."""
230235
return self._server_handle
231236

237+
@property
238+
def max_concurrency(self) -> int:
239+
# 1000 is Ray's default max_concurrency for async execution.
240+
# Add some margin to account for control method call.
241+
return max(1000, self.config.max_num_seqs + CONTROL_METHOD_CONCURRENCY)
242+
232243
def rollout_worker_use_gpu(self) -> bool:
233244
return True
234245

verl/workers/rollout/sglang_rollout/async_sglang_server.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -491,6 +491,7 @@ async def launch_servers(self):
491491
),
492492
runtime_env={"env_vars": {f"RAY_EXPERIMENTAL_NOSET_{visible_devices_keyword}": "1"}},
493493
name=name,
494+
max_concurrency=self.max_concurrency,
494495
).remote(
495496
config=self.config,
496497
model_config=self.model_config,

verl/workers/rollout/trtllm_rollout/trtllm_async_server.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -339,6 +339,7 @@ async def launch_servers(self):
339339
),
340340
runtime_env={"env_vars": {"RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "1"}},
341341
name=name,
342+
max_concurrency=self.max_concurrency,
342343
).remote(
343344
config=self.config,
344345
model_config=self.model_config,

verl/workers/rollout/vllm_rollout/vllm_async_server.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -839,6 +839,7 @@ async def launch_servers(self):
839839
}
840840
},
841841
name=name,
842+
max_concurrency=self.max_concurrency,
842843
).remote(
843844
config=self.config,
844845
model_config=self.model_config,

0 commit comments

Comments
 (0)