Decouple rollout from trainer process — RolloutService protocol #5509

cm2435 · 2026-04-10T16:14:34Z

cm2435
Apr 10, 2026

Background

I'm building an agent training environment library where the environment infrastructure (task orchestration, sandboxed code execution, multi-step tool use, rubric evaluation) is heavy enough that it can't reasonably run on the same machine as TRL. Think: Postgres, Inngest, E2B sandboxes, LLM judge calls, all needed to run one episode.

The current rollout_func API requires this to run in-process on the trainer, which means I have to either:

Cram the entire environment stack onto the GPU node (wasteful, complex), or
Have rollout_func make outbound HTTP calls to a remote environment server and poll a remote DB for results

I ended up with option 2 and it works, but the networking is painful — the GPU node drives the loop while my laptop acts as a servant, which is the opposite of what you'd expect. The ephemeral GPU node is the "master" and my persistent environment machine is the "worker."

What I think would help

A submit/poll interface instead of a synchronous callback:

class RolloutService(Protocol):
    def submit(self, prompts: list[str], policy_version: str) -> str: ...
    def poll(self, batch_id: str) -> RolloutResult | None: ...

InProcessRolloutService wraps the existing rollout_func for backwards compat. HttpRolloutService calls a remote endpoint. The environment pushes completed trajectories to a shared buffer, the trainer pulls from it.

I think this also lines up with async GRPO? the submit/poll pattern is inherently async, and policy_version gives you importance sampling for stale batches. The external buffer ends up being both the decoupling mechanism and the async training buffer.

Is this something that aligns with where GRPO is headed? Happy to put together a more concrete proposal if there's interest.

cm2435 · 2026-04-10T16:17:50Z

cm2435
Apr 10, 2026
Author

cc @qgallouedec : tagging you since it looks like you built rollout_func and environment_factory. We hit this while wiring up a multi-machine agent training setup (environment infra on one machine, TRL+vLLM on afew GPU node). Curious if decoupling the rollout interface is something that fits with the async GRPO direction.

4 replies

cm2435 Apr 10, 2026
Author

happy to give (much) more detail on request / flesh out into a proper RFC; didn't want to drop a massive unsolicited design doc haha

qgallouedec Apr 10, 2026
Maintainer

thanks for the discussion; the use of rollout_func is discouraged. You should use environment_factory whenever possible. Is there anything specific you can't do with this abstraction?

cm2435 Apr 14, 2026
Author

Thanks @qgallouedec, appreciate the pointer.

environment_factory works well for single-agent tool-use environments (Wordle / Sudoku / BrowserGym), but our setup is a full multi-agent system: multiple agents in a task DAG with dependency edges, inter-agent communication, durable orchestration (think Postgres + task queues + sandboxed code execution + LLM judge evaluation per episode).

The key issue is that environment_factory has TRL driving the episode loop: it generates, parses tool calls, invokes methods, repeats. In our case the environment needs to drive the loop: it schedules which agents run, propagates DAG state, manages sandbox lifecycles, and evaluates with rubrics. We're building this as part of Ergon (agent training environment library), happy to share more detail on the architecture if useful.

The secondary friction is that our entire stack is async (sandbox calls, DB, LLM judges) and environment_factory expects sync methods. This is workable with asyncio.run() wrappers but gets messy when nested in an existing event loop.

So the concrete ask isn't "environment_factory is bad", more that it's that there's a class of environments (multi-agent, externally orchestrated, infrastructure-heavy) where the environment needs to submit completed trajectories rather than respond to individual tool calls.

cm2435 Apr 14, 2026
Author

If it'd be useful, I'm happy to draft a short RFC with a minimal reproducer (multi-agent episode where the environment drives the loop) and a proposed abstraction change that would support it (backwards-compatible with the existing environment_factory path). Let me know if that's worth putting together.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple rollout from trainer process — RolloutService protocol #5509

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Decouple rollout from trainer process — RolloutService protocol #5509

Uh oh!

Uh oh!

cm2435 Apr 10, 2026

Background

What I think would help

Replies: 1 comment · 4 replies

Uh oh!

cm2435 Apr 10, 2026 Author

Uh oh!

cm2435 Apr 10, 2026 Author

Uh oh!

qgallouedec Apr 10, 2026 Maintainer

Uh oh!

cm2435 Apr 14, 2026 Author

Uh oh!

cm2435 Apr 14, 2026 Author

cm2435
Apr 10, 2026

Replies: 1 comment 4 replies

cm2435
Apr 10, 2026
Author

cm2435 Apr 10, 2026
Author

qgallouedec Apr 10, 2026
Maintainer

cm2435 Apr 14, 2026
Author

cm2435 Apr 14, 2026
Author