|
| 1 | +# Open-RL Autoresearch Demo |
| 2 | + |
| 3 | +This adapts [Karpathy's autoresearch](https://github.com/karpathy/autoresearch) |
| 4 | +to Open-RL: an agent repeatedly edits one allowed target, runs a bounded |
| 5 | +measured attempt, keeps commits that improve the configured metric, and resets |
| 6 | +the rest. The same recipe contract works locally or in Kubernetes; in a cluster, |
| 7 | +each run can live in its own pod and act as a researcher while sharing the same |
| 8 | +storage and Open-RL backend. |
| 9 | + |
| 10 | +## Minimal Recipe Shape |
| 11 | + |
| 12 | +An autoresearch task needs three recipe-owned things: |
| 13 | + |
| 14 | +```text |
| 15 | +<recipe>/ |
| 16 | + program.md # instructions for the agent |
| 17 | + autoresearch.toml # command, editable files, and graphed metric |
| 18 | + thing_to_edit.py # often also the command target declared in TOML |
| 19 | +``` |
| 20 | + |
| 21 | +`program.md` tells the agent what to edit, what metric matters, what files are |
| 22 | +off-limits, and how to decide keep/reset. |
| 23 | + |
| 24 | +`autoresearch.toml` is the harness contract. It says how to run one attempt, |
| 25 | +which files the agent may edit, and which metric decides whether the attempt |
| 26 | +improved: |
| 27 | + |
| 28 | +```toml |
| 29 | +task = "my_task" |
| 30 | +command = "uv run recipe.py run_dir={run_dir} attempt_timeout_minutes={attempt_timeout_minutes}" |
| 31 | +editable = ["thing_to_edit.py"] |
| 32 | +metric = "accuracy" |
| 33 | +metric_label = "accuracy" |
| 34 | +metric_mode = "max" |
| 35 | +``` |
| 36 | + |
| 37 | +The command can be any runnable benchmark or training loop. It just needs to: |
| 38 | + |
| 39 | +- accept the args used in `command`, usually at least `run_dir` |
| 40 | +- write attempt artifacts under `run_dir` |
| 41 | +- exit nonzero on failure |
| 42 | +- log the configured metric to `run_dir/metrics.jsonl` |
| 43 | + |
| 44 | +```python |
| 45 | +ml_logger.log_metrics({"accuracy": 0.73}, step=1) |
| 46 | +``` |
| 47 | + |
| 48 | +Use the cookbook `ml_logger` for this; the shared harness treats |
| 49 | +`metrics.jsonl` as the only metric source. The command target does not need a |
| 50 | +special filename; it can be the editable recipe file itself, a fixed runner, or |
| 51 | +`bash run.sh`. |
| 52 | + |
| 53 | +## Included Recipes |
| 54 | + |
| 55 | +Both recipes use the same `program.md` + `autoresearch.toml` contract: |
| 56 | + |
| 57 | +| Recipe | Command Target | Editable | Metric | Guide | |
| 58 | +| --- | --- | --- | --- | --- | |
| 59 | +| Text-SQL | `recipes.text_sql.train` | `train.py` | `accuracy` | [Text-SQL](recipes/text_sql/README.md) | |
| 60 | +| Math-RL | `recipes.math_rl.train` | `config.toml` | `accuracy` | [Math-RL](recipes/math_rl/README.md) | |
| 61 | + |
| 62 | +Use the recipe guides for local one-attempt runs, local UI serving, and |
| 63 | +recipe-specific settings. |
| 64 | + |
| 65 | +## Architecture |
| 66 | + |
| 67 | +Autoresearch runs as a small Kubernetes add-on around the shared Open-RL |
| 68 | +infrastructure. A recipe overlay starts the UI plus one researcher Sandbox per |
| 69 | +researcher. Each Sandbox runs Gemini CLI, edits the recipe, launches attempts, |
| 70 | +and calls the shared Open-RL/Tinker services. |
| 71 | + |
| 72 | + |
| 73 | + |
| 74 | +## Cluster Run |
| 75 | + |
| 76 | +Create the API secret for agent-backed researcher pods: |
| 77 | + |
| 78 | +```bash |
| 79 | +kubectl create secret generic researcher-agent-secrets \ |
| 80 | + --from-literal=GEMINI_API_KEY="${GEMINI_API_KEY}" |
| 81 | +``` |
| 82 | + |
| 83 | +Choose one recipe overlay: |
| 84 | + |
| 85 | +```bash |
| 86 | +# Fast text-SQL, no model server. |
| 87 | +kubectl apply -k examples/autoresearch/recipes/text_sql |
| 88 | + |
| 89 | +# Math-RL add-on. First deploy Open-RL with docs/setup/gke-setup.md, |
| 90 | +# or reuse an existing backend at http://open-rl-gateway-service:8000. |
| 91 | +kubectl apply -k examples/autoresearch/recipes/math_rl |
| 92 | + |
| 93 | +# Convenience one-shot Math-RL stack: Open-RL backend + autoresearch add-on. |
| 94 | +kubectl apply -k examples/autoresearch/recipes/math_rl/gke |
| 95 | +``` |
| 96 | + |
| 97 | +Each overlay starts one Sandbox that runs one Gemini CLI researcher. If that |
| 98 | +process exits nonzero or the pod crashes, the run stops; Kubernetes does not |
| 99 | +retry it. The intended recovery is to inspect the UI/logs and start a new run |
| 100 | +explicitly. |
| 101 | + |
| 102 | +Open the UI: |
| 103 | + |
| 104 | +```bash |
| 105 | +kubectl port-forward svc/open-rl-autoresearch-ui 8080:8080 |
| 106 | +``` |
| 107 | + |
| 108 | +```text |
| 109 | +http://localhost:8080/experiments.html |
| 110 | +``` |
| 111 | + |
| 112 | +Use the normal [GKE setup guide](../../docs/setup/gke-setup.md) for cluster, |
| 113 | +GPU, storage, and the Open-RL backend. These overlays add researcher sandboxes and |
| 114 | +the UI on top of that shared backend. |
| 115 | + |
| 116 | +Researcher pods wait for comma-separated `READY_URLS` before the agent starts, so |
| 117 | +early pod startup does not race vLLM, the trainer worker, or the gateway. The |
| 118 | +convenience Math-RL stack sets those URLs for vLLM, trainer, and gateway health. |
| 119 | + |
| 120 | +## Shared Pieces |
| 121 | + |
| 122 | +```text |
| 123 | +run_research_agent.sh # launches one timeout-bounded agent |
| 124 | +run_attempt.py # runs one measured attempt and records UI events |
| 125 | +ui/observer.py # read-only UI server over recorded events |
| 126 | +k8s/base/ # reusable Sandbox/UI resources |
| 127 | +``` |
| 128 | + |
| 129 | +`run_attempt.py` runs recipe code and writes artifacts. The UI reads only |
| 130 | +`LOG_ROOT/*/ui_events.jsonl`; clearing `LOG_ROOT` resets attempts, live rows, |
| 131 | +and per-attempt agent-log cursors. |
| 132 | + |
| 133 | +The launcher passes the recipe-adjacent `program.md` to Gemini as the prompt. |
| 134 | +That program tells the agent to edit only the declared target, commit the |
| 135 | +attempt, run `eval "${RUN_ATTEMPT_COMMAND}"`, record the metric, and reset if |
| 136 | +the metric did not improve. |
| 137 | + |
| 138 | +## Adding A Recipe |
| 139 | + |
| 140 | +Copy one existing recipe directory and update: |
| 141 | + |
| 142 | +- `program.md` |
| 143 | +- `autoresearch.toml` |
| 144 | +- the command target, if you keep one |
| 145 | +- the editable target |
| 146 | +- `kustomization.yaml` settings: `RECIPE`, `LOG_ROOT`, and |
| 147 | + `ATTEMPT_TIMEOUT_MINUTES` |
| 148 | +- optionally `AGENT_TIMEOUT_MINUTES`, if the researcher pod should stop before |
| 149 | + Kubernetes cleanup does |
| 150 | +- optionally `READY_URLS`, if attempts need external services to be healthy |
| 151 | + before the agent starts |
| 152 | + |
| 153 | +The shared wrapper handles logs, diffs, metrics, status, and UI events. Recipe |
| 154 | +code should focus on running the benchmark or training loop and emitting the |
| 155 | +metric. |
| 156 | + |
| 157 | +## Timeouts And Cleanup |
| 158 | + |
| 159 | +`ATTEMPT_TIMEOUT_MINUTES` caps one measured training/eval run. Every attempt gets |
| 160 | +the same value, so scores are comparable. |
| 161 | + |
| 162 | +`AGENT_TIMEOUT_MINUTES` caps the outer Gemini process. One agent can run several |
| 163 | +attempts inside this window: run the default config, edit, commit, run attempt, |
| 164 | +decide keep/reset, then repeat. Setup happens before this clock starts. |
| 165 | + |
| 166 | +Clean up a session: |
| 167 | + |
| 168 | +```bash |
| 169 | +OVERLAY=examples/autoresearch/recipes/text_sql \ |
| 170 | + examples/autoresearch/cleanup_research_session.sh |
| 171 | +``` |
| 172 | + |
| 173 | +To also clear shared run data: |
| 174 | + |
| 175 | +```bash |
| 176 | +DELETE_ARTIFACTS=1 \ |
| 177 | +LOG_ROOT=/mnt/shared/open-rl/autoresearch/text_sql \ |
| 178 | +OVERLAY=examples/autoresearch/recipes/text_sql \ |
| 179 | + examples/autoresearch/cleanup_research_session.sh |
| 180 | +``` |
0 commit comments