Skip to content

Commit 865da70

Browse files
authored
feat: autoresearch support (#103)
1 parent 5d52da3 commit 865da70

42 files changed

Lines changed: 3624 additions & 4 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/build-and-push.yml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,14 @@ jobs:
1313
matrix:
1414
include:
1515
- image_name: gateway
16+
context: src/server
1617
dockerfile: src/server/Dockerfile.gateway
1718
- image_name: server
19+
context: src/server
1820
dockerfile: src/server/Dockerfile
21+
- image_name: client
22+
context: examples
23+
dockerfile: examples/autoresearch/Dockerfile
1924
permissions:
2025
contents: read
2126
packages: write
@@ -36,7 +41,7 @@ jobs:
3641
- name: Build and push
3742
uses: docker/build-push-action@v5
3843
with:
39-
context: src/server
44+
context: ${{ matrix.context }}
4045
file: ${{ matrix.dockerfile }}
4146
build-args: |
4247
VERSION=${{ github.sha }}

.github/workflows/build-pr.yml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,14 @@ jobs:
1010
matrix:
1111
include:
1212
- image_name: gateway
13+
context: src/server
1314
dockerfile: src/server/Dockerfile.gateway
1415
- image_name: server
16+
context: src/server
1517
dockerfile: src/server/Dockerfile
18+
- image_name: client
19+
context: examples
20+
dockerfile: examples/autoresearch/Dockerfile
1621
permissions:
1722
contents: read
1823
steps:
@@ -25,7 +30,7 @@ jobs:
2530
- name: Build
2631
uses: docker/build-push-action@v5
2732
with:
28-
context: src/server
33+
context: ${{ matrix.context }}
2934
file: ${{ matrix.dockerfile }}
3035
build-args: |
3136
VERSION=${{ github.sha }}

examples/.dockerignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
*
2+
!README.md
3+
!pyproject.toml
4+
!uv.lock
5+
!autoresearch/
6+
!autoresearch/**

examples/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ This directory contains examples, demos, and helper scripts for using the Open-R
2222

2323
### Reinforcement Learning (RL)
2424
* **[Text-to-SQL RL](rl/text-to-sql):** Runs the Gemma 4 SFT+RL recipe with SQL execution rewards and curve plotting.
25+
* **[Autoresearch Demo](autoresearch):** Runs code-RL researchers against the same Open-RL gateway using cookbook DeepCoder rewards, Sandbox Fusion, and optional Agent Sandbox CRDs.
2526

2627
### Tinker Cookbook
2728
* **[Tinker Cookbook Recipes](tinker-cookbook):** Examples showing how to run [Tinker Cookbook](https://github.com/thinking-machines-lab/tinker-cookbook) recipes with Open-RL.

examples/autoresearch/Dockerfile

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
FROM python:3.12-slim
2+
3+
ENV PYTHONUNBUFFERED=1 \
4+
PYTHONDONTWRITEBYTECODE=1 \
5+
DEBIAN_FRONTEND=noninteractive \
6+
UV_COMPILE_BYTECODE=1 \
7+
UV_LINK_MODE=copy
8+
9+
RUN apt-get update && apt-get install -y --no-install-recommends \
10+
ca-certificates \
11+
curl \
12+
git \
13+
nodejs \
14+
npm \
15+
&& rm -rf /var/lib/apt/lists/*
16+
17+
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uv/bin/uv
18+
ENV PATH="/uv/bin:$PATH"
19+
20+
RUN npm install -g @google/gemini-cli
21+
22+
WORKDIR /app
23+
24+
COPY pyproject.toml uv.lock README.md ./
25+
COPY autoresearch ./autoresearch
26+
27+
RUN --mount=type=cache,target=/root/.cache/uv \
28+
uv sync --frozen
29+
30+
WORKDIR /app/autoresearch
31+
32+
RUN git init \
33+
&& git config user.email "agent@open-rl.local" \
34+
&& git config user.name "Autoresearch Agent" \
35+
&& printf ".venv/\\n__pycache__/\\n*.pyc\\nopen_rl_client.egg-info/\\n" > .gitignore \
36+
&& git add . \
37+
&& git commit -m "Initial autoresearch workspace"
38+
39+
CMD ["./run_research_agent.sh"]

examples/autoresearch/README.md

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
# Open-RL Autoresearch Demo
2+
3+
This adapts [Karpathy's autoresearch](https://github.com/karpathy/autoresearch)
4+
to Open-RL: an agent repeatedly edits one allowed target, runs a bounded
5+
measured attempt, keeps commits that improve the configured metric, and resets
6+
the rest. The same recipe contract works locally or in Kubernetes; in a cluster,
7+
each run can live in its own pod and act as a researcher while sharing the same
8+
storage and Open-RL backend.
9+
10+
## Minimal Recipe Shape
11+
12+
An autoresearch task needs three recipe-owned things:
13+
14+
```text
15+
<recipe>/
16+
program.md # instructions for the agent
17+
autoresearch.toml # command, editable files, and graphed metric
18+
thing_to_edit.py # often also the command target declared in TOML
19+
```
20+
21+
`program.md` tells the agent what to edit, what metric matters, what files are
22+
off-limits, and how to decide keep/reset.
23+
24+
`autoresearch.toml` is the harness contract. It says how to run one attempt,
25+
which files the agent may edit, and which metric decides whether the attempt
26+
improved:
27+
28+
```toml
29+
task = "my_task"
30+
command = "uv run recipe.py run_dir={run_dir} attempt_timeout_minutes={attempt_timeout_minutes}"
31+
editable = ["thing_to_edit.py"]
32+
metric = "accuracy"
33+
metric_label = "accuracy"
34+
metric_mode = "max"
35+
```
36+
37+
The command can be any runnable benchmark or training loop. It just needs to:
38+
39+
- accept the args used in `command`, usually at least `run_dir`
40+
- write attempt artifacts under `run_dir`
41+
- exit nonzero on failure
42+
- log the configured metric to `run_dir/metrics.jsonl`
43+
44+
```python
45+
ml_logger.log_metrics({"accuracy": 0.73}, step=1)
46+
```
47+
48+
Use the cookbook `ml_logger` for this; the shared harness treats
49+
`metrics.jsonl` as the only metric source. The command target does not need a
50+
special filename; it can be the editable recipe file itself, a fixed runner, or
51+
`bash run.sh`.
52+
53+
## Included Recipes
54+
55+
Both recipes use the same `program.md` + `autoresearch.toml` contract:
56+
57+
| Recipe | Command Target | Editable | Metric | Guide |
58+
| --- | --- | --- | --- | --- |
59+
| Text-SQL | `recipes.text_sql.train` | `train.py` | `accuracy` | [Text-SQL](recipes/text_sql/README.md) |
60+
| Math-RL | `recipes.math_rl.train` | `config.toml` | `accuracy` | [Math-RL](recipes/math_rl/README.md) |
61+
62+
Use the recipe guides for local one-attempt runs, local UI serving, and
63+
recipe-specific settings.
64+
65+
## Architecture
66+
67+
Autoresearch runs as a small Kubernetes add-on around the shared Open-RL
68+
infrastructure. A recipe overlay starts the UI plus one researcher Sandbox per
69+
researcher. Each Sandbox runs Gemini CLI, edits the recipe, launches attempts,
70+
and calls the shared Open-RL/Tinker services.
71+
72+
![Autoresearch architecture](arch.png)
73+
74+
## Cluster Run
75+
76+
Create the API secret for agent-backed researcher pods:
77+
78+
```bash
79+
kubectl create secret generic researcher-agent-secrets \
80+
--from-literal=GEMINI_API_KEY="${GEMINI_API_KEY}"
81+
```
82+
83+
Choose one recipe overlay:
84+
85+
```bash
86+
# Fast text-SQL, no model server.
87+
kubectl apply -k examples/autoresearch/recipes/text_sql
88+
89+
# Math-RL add-on. First deploy Open-RL with docs/setup/gke-setup.md,
90+
# or reuse an existing backend at http://open-rl-gateway-service:8000.
91+
kubectl apply -k examples/autoresearch/recipes/math_rl
92+
93+
# Convenience one-shot Math-RL stack: Open-RL backend + autoresearch add-on.
94+
kubectl apply -k examples/autoresearch/recipes/math_rl/gke
95+
```
96+
97+
Each overlay starts one Sandbox that runs one Gemini CLI researcher. If that
98+
process exits nonzero or the pod crashes, the run stops; Kubernetes does not
99+
retry it. The intended recovery is to inspect the UI/logs and start a new run
100+
explicitly.
101+
102+
Open the UI:
103+
104+
```bash
105+
kubectl port-forward svc/open-rl-autoresearch-ui 8080:8080
106+
```
107+
108+
```text
109+
http://localhost:8080/experiments.html
110+
```
111+
112+
Use the normal [GKE setup guide](../../docs/setup/gke-setup.md) for cluster,
113+
GPU, storage, and the Open-RL backend. These overlays add researcher sandboxes and
114+
the UI on top of that shared backend.
115+
116+
Researcher pods wait for comma-separated `READY_URLS` before the agent starts, so
117+
early pod startup does not race vLLM, the trainer worker, or the gateway. The
118+
convenience Math-RL stack sets those URLs for vLLM, trainer, and gateway health.
119+
120+
## Shared Pieces
121+
122+
```text
123+
run_research_agent.sh # launches one timeout-bounded agent
124+
run_attempt.py # runs one measured attempt and records UI events
125+
ui/observer.py # read-only UI server over recorded events
126+
k8s/base/ # reusable Sandbox/UI resources
127+
```
128+
129+
`run_attempt.py` runs recipe code and writes artifacts. The UI reads only
130+
`LOG_ROOT/*/ui_events.jsonl`; clearing `LOG_ROOT` resets attempts, live rows,
131+
and per-attempt agent-log cursors.
132+
133+
The launcher passes the recipe-adjacent `program.md` to Gemini as the prompt.
134+
That program tells the agent to edit only the declared target, commit the
135+
attempt, run `eval "${RUN_ATTEMPT_COMMAND}"`, record the metric, and reset if
136+
the metric did not improve.
137+
138+
## Adding A Recipe
139+
140+
Copy one existing recipe directory and update:
141+
142+
- `program.md`
143+
- `autoresearch.toml`
144+
- the command target, if you keep one
145+
- the editable target
146+
- `kustomization.yaml` settings: `RECIPE`, `LOG_ROOT`, and
147+
`ATTEMPT_TIMEOUT_MINUTES`
148+
- optionally `AGENT_TIMEOUT_MINUTES`, if the researcher pod should stop before
149+
Kubernetes cleanup does
150+
- optionally `READY_URLS`, if attempts need external services to be healthy
151+
before the agent starts
152+
153+
The shared wrapper handles logs, diffs, metrics, status, and UI events. Recipe
154+
code should focus on running the benchmark or training loop and emitting the
155+
metric.
156+
157+
## Timeouts And Cleanup
158+
159+
`ATTEMPT_TIMEOUT_MINUTES` caps one measured training/eval run. Every attempt gets
160+
the same value, so scores are comparable.
161+
162+
`AGENT_TIMEOUT_MINUTES` caps the outer Gemini process. One agent can run several
163+
attempts inside this window: run the default config, edit, commit, run attempt,
164+
decide keep/reset, then repeat. Setup happens before this clock starts.
165+
166+
Clean up a session:
167+
168+
```bash
169+
OVERLAY=examples/autoresearch/recipes/text_sql \
170+
examples/autoresearch/cleanup_research_session.sh
171+
```
172+
173+
To also clear shared run data:
174+
175+
```bash
176+
DELETE_ARTIFACTS=1 \
177+
LOG_ROOT=/mnt/shared/open-rl/autoresearch/text_sql \
178+
OVERLAY=examples/autoresearch/recipes/text_sql \
179+
examples/autoresearch/cleanup_research_session.sh
180+
```

examples/autoresearch/arch.png

706 KB
Loading
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
#!/usr/bin/env bash
2+
set -euo pipefail
3+
4+
OVERLAY="${OVERLAY:-examples/autoresearch/recipes/text_sql}"
5+
NAMESPACE="${NAMESPACE:-default}"
6+
DELETE_ARTIFACTS="${DELETE_ARTIFACTS:-0}"
7+
LOG_ROOT="${LOG_ROOT:-}"
8+
9+
kubectl -n "${NAMESPACE}" delete -k "${OVERLAY}" --ignore-not-found=true
10+
11+
if [ "${DELETE_ARTIFACTS}" = "1" ]; then
12+
if [ -z "${LOG_ROOT}" ]; then
13+
echo "DELETE_ARTIFACTS=1 requires LOG_ROOT" >&2
14+
exit 2
15+
fi
16+
rm -rf "${LOG_ROOT}"
17+
fi

examples/autoresearch/event_log.py

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
"""Shared JSONL event writer for autoresearch UI state."""
2+
3+
from __future__ import annotations
4+
5+
import argparse
6+
import json
7+
import time
8+
from pathlib import Path
9+
from typing import Any
10+
11+
UI_EVENTS_FILE = "ui_events.jsonl"
12+
13+
14+
def append_ui_events(event_dir: Path, events: list[dict[str, Any]]) -> None:
15+
path = event_dir / UI_EVENTS_FILE
16+
path.parent.mkdir(parents=True, exist_ok=True)
17+
now = time.time()
18+
with path.open("a", encoding="utf-8") as f:
19+
for event in events:
20+
f.write(json.dumps({"time": now, **event}, sort_keys=True) + "\n")
21+
22+
23+
def activity_events(args: argparse.Namespace) -> list[dict[str, Any]]:
24+
base = {
25+
"attempt_timeout_minutes": args.attempt_timeout_minutes,
26+
"kind": "activity",
27+
"order": 0,
28+
"researcher": args.researcher,
29+
"status": args.status,
30+
"work_id": f"{args.researcher}-activity",
31+
}
32+
return [
33+
{**base, "tab": "agent", "path": args.agent_log},
34+
{**base, "tab": "notes", "path": args.notes},
35+
]
36+
37+
38+
def main(argv: list[str] | None = None) -> None:
39+
parser = argparse.ArgumentParser()
40+
parser.add_argument("--event-dir", type=Path, required=True)
41+
parser.add_argument("--researcher", required=True)
42+
parser.add_argument("--status", required=True)
43+
parser.add_argument("--attempt-timeout-minutes", type=float, required=True)
44+
parser.add_argument("--agent-log", required=True)
45+
parser.add_argument("--notes", required=True)
46+
args = parser.parse_args(argv)
47+
append_ui_events(args.event_dir, activity_events(args))
48+
49+
50+
if __name__ == "__main__":
51+
main()
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
resources:
2+
- resources.yaml
3+
4+
configurations:
5+
- kustomizeconfig.yaml

0 commit comments

Comments
 (0)