Skip to content

castorini/1shot-bench

Repository files navigation

1ShotBench

Build and Test

1ShotBench runs the same task prompt through multiple Pi agent workspaces so you can compare how different models behave under the same harness.

The old Codex proxy path has been removed. Implementation agents now run through the pi CLI directly, using a small bench.toml file inside each model workspace.

Quick Start

Install the Python dependencies:

pip install -r requirements.txt

Create or refresh model workspaces for a task:

python scripts/create_agent_workspaces.py experiments/frontend

Run a benchmark:

python -m bench.cli \
  --task-dir experiments/frontend \
  --prompt-file experiments/frontend/PRD.md \
  --models gpt claude gemini

Start the local dashboard:

uvicorn bench.web:app --port 4010

Open:

http://127.0.0.1:4010

Development

Run the Python test suite:

pytest

The build-and-test workflow runs a Python compile check and pytest. The current tests cover the benchmark harness, workspace creation, sandbox wrapping, deployment helpers, web dashboard endpoints, token backfill, and the LLM/Codex/Pi judge result handling. They do not exercise full end-to-end model runs or live external services.

Current Layout

Task workspaces live under experiments/:

experiments/
  frontend/
    PRD.md
    features.yaml
    gpt-workspace/
      bench.toml
    claude-workspace/
      bench.toml
    gemini-workspace/
      bench.toml

  evaluator/
    PRD.md
    features.yaml

  nfcorpus-repro/
    PRD.md

Future benchmark tasks can live as sibling directories with the same *-workspace/bench.toml structure.

Documentation

  • Getting Started: requirements, Pi auth, API keys, and task skills.
  • Workspaces: workspace layout, bench.toml, task files, isolation, and workspace creation.
  • Running Benchmarks: CLI usage, dashboard, run flow, artifacts, and metrics.
  • Judging Runs: LLM judge, Codex judge, Pi judge, feature generation, and evaluation artifacts.
  • Deployments: generic Render/GHCR deployment flow for benchmark demos.
  • NFCorpus Render Runbook: task-specific deployment notes for the NFCorpus reproduction demos.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors