Use cases, pain points, and background
Any environment where the agent produces artifacts that must be executed rather than parsed needs execution isolation. Today this is primarily code (Python, shell commands, git patches, Lean proofs), but the same problem applies to SQL execution against real databases, browser interaction, simulation control, and domains we haven't built environments for yet. Existing environments have independently implemented their own execution isolation with no shared interface, no shared code, and no documentation of security guarantees. The result:
Duplication: math_with_code and newton_bench both implement multiprocessing.Process + exec()/eval() workers with slightly different restrictions. swerl_gen and swe_agents both shell out to Singularity/Apptainer containers through different subprocess wrappers. Each reimplements lifecycle management, timeout handling, and cleanup.
Inconsistent security properties: There is no documentation of what each approach actually isolates. code_gen's reliability_guard explicitly says "NOT a security sandbox." math_with_code restricts __builtins__ but still allows __import__. swerl_gen runs inside a real Singularity container. A benchmark author has no way to know which isolation level they're getting or which they should choose.
No composability: You cannot swap a process-based sandbox for a container-based one without rewriting the server. There is no configuration knob — the isolation mechanism is hard-wired into each server's implementation.
No guidance for new benchmarks: A new environment author who needs execution isolation must read 3-4 existing servers, pick one to cargo-cult from, and hope they picked the right isolation level for their threat model.
Existing execution isolation implementations:
| Approach |
Used by |
Isolation |
multiprocessing.Process + exec()/eval() with restricted builtins |
math_with_code |
Same-user process, restricted builtins (still allows __import__) |
multiprocessing.Process + AST-validated exec()/eval() |
newton_bench |
Same-user process, static pattern blocking |
multiprocessing + Ray + reliability_guard |
code_gen |
Same-user process, explicitly "NOT a security sandbox" |
singularity exec via subprocess.Popen |
swerl_gen |
Container (Singularity .sif) |
Apptainer commands via asyncio.create_subprocess_shell |
swe_agents |
Container (Apptainer) |
| HTTP client to external sandbox service |
math_formal_lean (Lean4), ns_tools (NeMo-Skills) |
Depends on remote service |
| Docker via Aviary library |
aviary (BixBench) |
Container (Docker, managed by Aviary) |
Note: this is distinct from session isolation. Gym already has well-designed session isolation via session_id in SimpleResourcesServer — parallel rollouts don't share state. That's application-level isolation and it works. Execution isolation is the OS-level layer underneath it: preventing agent-generated code from escaping its sandbox, accessing the host filesystem, or leaking state between rollouts. Not all environments need it — environments that do string matching, LLM judging, or API calls don't execute agent output at all.
Note: this is also distinct from cluster orchestration (Kubernetes, SLURM). Kubernetes answers "how do I schedule and manage containers across a cluster." The sandbox protocol answers "how does an environment author say 'I need an isolated place to run this agent's output' without caring whether that runs as a k8s pod, a Singularity container on SLURM, a Modal function, a local subprocess, or a remote service."
Description
Introduce a sandbox protocol that unifies the execution isolation layer across environments that need to execute agent-generated artifacts.
-
Define a common interface — start(), run(command) -> (stdout, exit_code), upload(), download(), stop(). Every environment that needs execution isolation programs against this interface, regardless of backend.
-
Provide backend implementations for the patterns already in use:
ProcessSandbox — multiprocessing-based, for lightweight math/code execution (what math_with_code and newton_bench do today)
ContainerSandbox — Singularity/Apptainer/Docker, for full filesystem isolation (what swerl_gen and swe_agents do today)
RemoteSandbox — HTTP client to an external sandbox service (what math_formal_lean and ns_tools do today)
-
Document the security spectrum — make explicit what each backend isolates and what it doesn't, so benchmark authors can make informed choices.
-
Integrate with session lifecycle — sandbox creation in seed_session, teardown in session cleanup. Configuration in YAML (sandbox_type, image, timeout, network_policy) rather than ad-hoc per-server fields like sandbox_host/sandbox_port/sandbox_timeout.
-
Enable future backends — the protocol should be open to third-party sandbox providers (Modal, Daytona, E2B) without changes to environments that program against the interface. Harbor's config already lists "docker", "singularity", "daytona", "modal" as environment types — this would make that extensibility real and general.
Design
What files should be touched? What logic should be written?
Out of scope
What are some items that this issue could be mistaken to cover that this issue should explicitly NOT cover?
Acceptance Criteria
Use cases, pain points, and background
Any environment where the agent produces artifacts that must be executed rather than parsed needs execution isolation. Today this is primarily code (Python, shell commands, git patches, Lean proofs), but the same problem applies to SQL execution against real databases, browser interaction, simulation control, and domains we haven't built environments for yet. Existing environments have independently implemented their own execution isolation with no shared interface, no shared code, and no documentation of security guarantees. The result:
Duplication:
math_with_codeandnewton_benchboth implementmultiprocessing.Process+exec()/eval()workers with slightly different restrictions.swerl_genandswe_agentsboth shell out to Singularity/Apptainer containers through different subprocess wrappers. Each reimplements lifecycle management, timeout handling, and cleanup.Inconsistent security properties: There is no documentation of what each approach actually isolates.
code_gen'sreliability_guardexplicitly says "NOT a security sandbox."math_with_coderestricts__builtins__but still allows__import__.swerl_genruns inside a real Singularity container. A benchmark author has no way to know which isolation level they're getting or which they should choose.No composability: You cannot swap a process-based sandbox for a container-based one without rewriting the server. There is no configuration knob — the isolation mechanism is hard-wired into each server's implementation.
No guidance for new benchmarks: A new environment author who needs execution isolation must read 3-4 existing servers, pick one to cargo-cult from, and hope they picked the right isolation level for their threat model.
Existing execution isolation implementations:
multiprocessing.Process+exec()/eval()with restricted builtinsmath_with_code__import__)multiprocessing.Process+ AST-validatedexec()/eval()newton_benchmultiprocessing+ Ray +reliability_guardcode_gensingularity execviasubprocess.Popenswerl_gen.sif)asyncio.create_subprocess_shellswe_agentsmath_formal_lean(Lean4),ns_tools(NeMo-Skills)aviary(BixBench)Note: this is distinct from session isolation. Gym already has well-designed session isolation via
session_idinSimpleResourcesServer— parallel rollouts don't share state. That's application-level isolation and it works. Execution isolation is the OS-level layer underneath it: preventing agent-generated code from escaping its sandbox, accessing the host filesystem, or leaking state between rollouts. Not all environments need it — environments that do string matching, LLM judging, or API calls don't execute agent output at all.Note: this is also distinct from cluster orchestration (Kubernetes, SLURM). Kubernetes answers "how do I schedule and manage containers across a cluster." The sandbox protocol answers "how does an environment author say 'I need an isolated place to run this agent's output' without caring whether that runs as a k8s pod, a Singularity container on SLURM, a Modal function, a local subprocess, or a remote service."
Description
Introduce a sandbox protocol that unifies the execution isolation layer across environments that need to execute agent-generated artifacts.
Define a common interface —
start(),run(command) -> (stdout, exit_code),upload(),download(),stop(). Every environment that needs execution isolation programs against this interface, regardless of backend.Provide backend implementations for the patterns already in use:
ProcessSandbox—multiprocessing-based, for lightweight math/code execution (whatmath_with_codeandnewton_benchdo today)ContainerSandbox— Singularity/Apptainer/Docker, for full filesystem isolation (whatswerl_genandswe_agentsdo today)RemoteSandbox— HTTP client to an external sandbox service (whatmath_formal_leanandns_toolsdo today)Document the security spectrum — make explicit what each backend isolates and what it doesn't, so benchmark authors can make informed choices.
Integrate with session lifecycle — sandbox creation in
seed_session, teardown in session cleanup. Configuration in YAML (sandbox_type,image,timeout,network_policy) rather than ad-hoc per-server fields likesandbox_host/sandbox_port/sandbox_timeout.Enable future backends — the protocol should be open to third-party sandbox providers (Modal, Daytona, E2B) without changes to environments that program against the interface. Harbor's config already lists
"docker","singularity","daytona","modal"as environment types — this would make that extensibility real and general.Design
What files should be touched? What logic should be written?
Out of scope
What are some items that this issue could be mistaken to cover that this issue should explicitly NOT cover?
Acceptance Criteria