1ShotBench

1ShotBench runs the same task prompt through multiple Pi agent workspaces so you can compare how different models behave under the same harness.

The old Codex proxy path has been removed. Implementation agents now run through the pi CLI directly, using a small bench.toml file inside each model workspace.

Quick Start

Install the Python dependencies:

pip install -r requirements.txt

Create or refresh model workspaces for a task:

python scripts/create_agent_workspaces.py experiments/frontend

Run a benchmark:

python -m bench.cli \
  --task-dir experiments/frontend \
  --prompt-file experiments/frontend/PRD.md \
  --models gpt claude gemini

Start the local dashboard:

uvicorn bench.web:app --port 4010

Open:

http://127.0.0.1:4010

Development

Run the Python test suite:

pytest

The build-and-test workflow runs a Python compile check and pytest. The current tests cover the benchmark harness, workspace creation, sandbox wrapping, deployment helpers, web dashboard endpoints, token backfill, and the LLM/Codex/Pi judge result handling. They do not exercise full end-to-end model runs or live external services.

Current Layout

Task workspaces live under experiments/:

experiments/
  frontend/
    PRD.md
    features.yaml
    gpt-workspace/
      bench.toml
    claude-workspace/
      bench.toml
    gemini-workspace/
      bench.toml

  evaluator/
    PRD.md
    features.yaml

  nfcorpus-repro/
    PRD.md

Future benchmark tasks can live as sibling directories with the same *-workspace/bench.toml structure.

Documentation

Getting Started: requirements, Pi auth, API keys, and task skills.
Workspaces: workspace layout, bench.toml, task files, isolation, and workspace creation.
Running Benchmarks: CLI usage, dashboard, run flow, artifacts, and metrics.
Judging Runs: LLM judge, Codex judge, Pi judge, feature generation, and evaluation artifacts.
Deployments: generic Render/GHCR deployment flow for benchmark demos.
NFCorpus Render Runbook: task-specific deployment notes for the NFCorpus reproduction demos.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1ShotBench

Quick Start

Development

Current Layout

Documentation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

1ShotBench

Quick Start

Development

Current Layout

Documentation