feat: unified execution sandbox infra

## Use cases, pain points, and background

Any environment where the agent produces artifacts that must be *executed* rather than *parsed* needs execution isolation. Today this is primarily code (Python, shell commands, git patches, Lean proofs), but the same problem applies to SQL execution against real databases, browser interaction, simulation control, and domains we haven't built environments for yet. Existing environments have independently implemented their own execution isolation with no shared interface, no shared code, and no documentation of security guarantees. The result:

**Duplication:** `math_with_code` and `newton_bench` both implement `multiprocessing.Process` + `exec()`/`eval()` workers with slightly different restrictions. `swerl_gen` and `swe_agents` both shell out to Singularity/Apptainer containers through different subprocess wrappers. Each reimplements lifecycle management, timeout handling, and cleanup.

**Inconsistent security properties:** There is no documentation of what each approach actually isolates. `code_gen`'s `reliability_guard` explicitly says "NOT a security sandbox." `math_with_code` restricts `__builtins__` but still allows `__import__`. `swerl_gen` runs inside a real Singularity container. A benchmark author has no way to know which isolation level they're getting or which they should choose.

**No composability:** You cannot swap a process-based sandbox for a container-based one without rewriting the server. There is no configuration knob — the isolation mechanism is hard-wired into each server's implementation.

**No guidance for new benchmarks:** A new environment author who needs execution isolation must read 3-4 existing servers, pick one to cargo-cult from, and hope they picked the right isolation level for their threat model.

**Existing execution isolation implementations:**

| Approach | Used by | Isolation |
|----------|---------|-----------|
| `multiprocessing.Process` + `exec()`/`eval()` with restricted builtins | `math_with_code` | Same-user process, restricted builtins (still allows `__import__`) |
| `multiprocessing.Process` + AST-validated `exec()`/`eval()` | `newton_bench` | Same-user process, static pattern blocking |
| `multiprocessing` + Ray + `reliability_guard` | `code_gen` | Same-user process, explicitly "NOT a security sandbox" |
| `singularity exec` via `subprocess.Popen` | `swerl_gen` | Container (Singularity `.sif`) |
| Apptainer commands via `asyncio.create_subprocess_shell` | `swe_agents` | Container (Apptainer) |
| HTTP client to external sandbox service | `math_formal_lean` (Lean4), `ns_tools` (NeMo-Skills) | Depends on remote service |
| Docker via Aviary library | `aviary` (BixBench) | Container (Docker, managed by Aviary) |

**Note: this is distinct from session isolation.** Gym already has well-designed session isolation via `session_id` in `SimpleResourcesServer` — parallel rollouts don't share state. That's application-level isolation and it works. Execution isolation is the OS-level layer *underneath* it: preventing agent-generated code from escaping its sandbox, accessing the host filesystem, or leaking state between rollouts. Not all environments need it — environments that do string matching, LLM judging, or API calls don't execute agent output at all.

**Note: this is also distinct from cluster orchestration (Kubernetes, SLURM).** Kubernetes answers "how do I schedule and manage containers across a cluster." The sandbox protocol answers "how does an environment author say 'I need an isolated place to run this agent's output' without caring whether that runs as a k8s pod, a Singularity container on SLURM, a Modal function, a local subprocess, or a remote service." 

## Description
Introduce a sandbox protocol that unifies the execution isolation layer across environments that need to execute agent-generated artifacts.

1. **Define a common interface** — `start()`, `run(command) -> (stdout, exit_code)`, `upload()`, `download()`, `stop()`. Every environment that needs execution isolation programs against this interface, regardless of backend.

2. **Provide backend implementations** for the patterns already in use:
   - `ProcessSandbox` — `multiprocessing`-based, for lightweight math/code execution (what `math_with_code` and `newton_bench` do today)
   - `ContainerSandbox` — Singularity/Apptainer/Docker, for full filesystem isolation (what `swerl_gen` and `swe_agents` do today)
   - `RemoteSandbox` — HTTP client to an external sandbox service (what `math_formal_lean` and `ns_tools` do today)

3. **Document the security spectrum** — make explicit what each backend isolates and what it doesn't, so benchmark authors can make informed choices.

4. **Integrate with session lifecycle** — sandbox creation in `seed_session`, teardown in session cleanup. Configuration in YAML (`sandbox_type`, `image`, `timeout`, `network_policy`) rather than ad-hoc per-server fields like `sandbox_host`/`sandbox_port`/`sandbox_timeout`.

5. **Enable future backends** — the protocol should be open to third-party sandbox providers (Modal, Daytona, E2B) without changes to environments that program against the interface. Harbor's config already lists `"docker"`, `"singularity"`, `"daytona"`, `"modal"` as environment types — this would make that extensibility real and general.
## Design
What files should be touched? What logic should be written?

## Out of scope
What are some items that this issue could be mistaken to cover that this issue should explicitly NOT cover?

## Acceptance Criteria
- [ ] Individual items that need to be finished in order for this issue to be considered completed


Approach	Used by	Isolation
`multiprocessing.Process` + `exec()`/`eval()` with restricted builtins	`math_with_code`	Same-user process, restricted builtins (still allows `__import__`)
`multiprocessing.Process` + AST-validated `exec()`/`eval()`	`newton_bench`	Same-user process, static pattern blocking
`multiprocessing` + Ray + `reliability_guard`	`code_gen`	Same-user process, explicitly "NOT a security sandbox"
`singularity exec` via `subprocess.Popen`	`swerl_gen`	Container (Singularity `.sif`)
Apptainer commands via `asyncio.create_subprocess_shell`	`swe_agents`	Container (Apptainer)
HTTP client to external sandbox service	`math_formal_lean` (Lean4), `ns_tools` (NeMo-Skills)	Depends on remote service
Docker via Aviary library	`aviary` (BixBench)	Container (Docker, managed by Aviary)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: unified execution sandbox infra #1048

Use cases, pain points, and background

Description

Design

Out of scope

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: unified execution sandbox infra #1048

Description

Use cases, pain points, and background

Description

Design

Out of scope

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions