1ShotBench runs the same task prompt through multiple Pi agent workspaces so you can compare how different models behave under the same harness.
The old Codex proxy path has been removed. Implementation agents now run through the pi CLI directly, using a small bench.toml file inside each model workspace.
Install the Python dependencies:
pip install -r requirements.txtCreate or refresh model workspaces for a task:
python scripts/create_agent_workspaces.py experiments/frontendRun a benchmark:
python -m bench.cli \
--task-dir experiments/frontend \
--prompt-file experiments/frontend/PRD.md \
--models gpt claude geminiStart the local dashboard:
uvicorn bench.web:app --port 4010Open:
http://127.0.0.1:4010
Run the Python test suite:
pytestThe build-and-test workflow runs a Python compile check and pytest. The current tests cover the benchmark harness, workspace creation, sandbox wrapping, deployment helpers, web dashboard endpoints, token backfill, and the LLM/Codex/Pi judge result handling. They do not exercise full end-to-end model runs or live external services.
Task workspaces live under experiments/:
experiments/
frontend/
PRD.md
features.yaml
gpt-workspace/
bench.toml
claude-workspace/
bench.toml
gemini-workspace/
bench.toml
evaluator/
PRD.md
features.yaml
nfcorpus-repro/
PRD.md
Future benchmark tasks can live as sibling directories with the same *-workspace/bench.toml structure.
- Getting Started: requirements, Pi auth, API keys, and task skills.
- Workspaces: workspace layout,
bench.toml, task files, isolation, and workspace creation. - Running Benchmarks: CLI usage, dashboard, run flow, artifacts, and metrics.
- Judging Runs: LLM judge, Codex judge, Pi judge, feature generation, and evaluation artifacts.
- Deployments: generic Render/GHCR deployment flow for benchmark demos.
- NFCorpus Render Runbook: task-specific deployment notes for the NFCorpus reproduction demos.