Replies: 1 comment 4 replies
-
|
cc @qgallouedec : tagging you since it looks like you built rollout_func and environment_factory. We hit this while wiring up a multi-machine agent training setup (environment infra on one machine, TRL+vLLM on afew GPU node). Curious if decoupling the rollout interface is something that fits with the async GRPO direction. |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Background
I'm building an agent training environment library where the environment infrastructure (task orchestration, sandboxed code execution, multi-step tool use, rubric evaluation) is heavy enough that it can't reasonably run on the same machine as TRL. Think: Postgres, Inngest, E2B sandboxes, LLM judge calls, all needed to run one episode.
The current
rollout_funcAPI requires this to run in-process on the trainer, which means I have to either:rollout_funcmake outbound HTTP calls to a remote environment server and poll a remote DB for resultsI ended up with option 2 and it works, but the networking is painful — the GPU node drives the loop while my laptop acts as a servant, which is the opposite of what you'd expect. The ephemeral GPU node is the "master" and my persistent environment machine is the "worker."
What I think would help
A
submit/pollinterface instead of a synchronous callback:InProcessRolloutServicewraps the existingrollout_funcfor backwards compat.HttpRolloutServicecalls a remote endpoint. The environment pushes completed trajectories to a shared buffer, the trainer pulls from it.I think this also lines up with async GRPO? the submit/poll pattern is inherently async, and policy_version gives you importance sampling for stale batches. The external buffer ends up being both the decoupling mechanism and the async training buffer.
Is this something that aligns with where GRPO is headed? Happy to put together a more concrete proposal if there's interest.
Beta Was this translation helpful? Give feedback.
All reactions