Run one prompt against multiple model workspaces:
python -m bench.cli --prompt "Your task prompt" --models gpt claude geminiOr read the prompt from a file:
python -m bench.cli \
--task-dir experiments/frontend \
--prompt-file experiments/frontend/PRD.md \
--models gpt claude gemini glm kimi minimaxUseful options:
--task-dir experiments/frontend--prompt-file path/to/prompt.txt--mode sequential|parallel--max-concurrency 2--timeout-seconds 1800or--timeout-seconds 0for no per-model timeout--retries 1--warmup--label e2e-bench-1--no-task-rewriteto skip the default per-model task file rewrite--rewrite-timeout-seconds 600--deploy renderto deploy demos as Render image-backed web services after the benchmark run
Run a non-default task directory with:
python -m bench.cli --task-dir experiments/frontend --prompt "..." --models gpt claudeThe CLI flow is:
- It reads the prompt from
--promptor--prompt-file. - It loads
*-workspace/bench.tomlfiles from--task-dir,ONESHOT_BENCH_TASK_DIR, or the defaultexperiments/frontend. - It runs preflight checks for the selected model keys, the workspace folders, the
piexecutable, a sandbox backend, and any declaredrequired_skills. - It refreshes task-local files matching
PRD*.md,TASK*.md,task*.md, orprompt*.mdinto every configured workspace. - It creates a fresh
runs/<run_id>/directory and writes the exact prompt toprompt.txt. - For each selected model, it first asks that same model to rewrite the shared task file(s) into a neutral equivalent restatement.
- It validates rewritten files, stores copies under the run artifacts, and replaces the workspace-local task files before implementation starts.
- It starts one implementation Pi subprocess per selected workspace, either sequentially or in parallel with
--max-concurrency. - It writes per-model logs and a combined summary when the run finishes.
The task rewrite step keeps the implementation run from using wording that may advantage the model that originally authored the PRD.
Start the GUI server:
uvicorn bench.web:app --port 4010Open:
http://127.0.0.1:4010
The web server opens on experiments/frontend by default. You can switch between discovered experiment directories in the dashboard, or pin the initial directory from the shell:
ONESHOT_BENCH_TASK_DIR=experiments/evaluator uvicorn bench.web:app --port 4010The UI can:
- pick configured model workspaces
- run models sequentially or in parallel
- stream stdout/stderr side by side
- show final duration and parsed token metrics
- load recent run history
Artifacts are written to:
runs/<run_id>/
Each run includes:
prompt.txt: the prompt used for the run<model>/stdout.log: readable assistant output and tool markers<model>/stderr.log: Pi stderr<model>/events.jsonl: raw Pi JSON events<model>/task-rewrite.json: status, command metadata, and paths for the per-model task rewrite<model>/task-rewrite.stdout.log: raw Pi JSON output from the rewrite step<model>/task-rewrite.stderr.log: Pi stderr from the rewrite step<model>/rewritten-task-files/: copies of the rewritten task files used by the implementation step<model>/result.json: status, timing, command, paths, attempts, and token metrics for that model<model>/workspace.sb: generated macOS sandbox profile, when usingsandbox-exec<model>/workspace.bwrap.json: generated Linux sandbox command metadata, when usingbwrap<model>/deployment.json: deployment status and preview metadata, when deployment is attempted<model>/deployment.stdout.logand<model>/deployment.stderr.log: deployment logs with secrets redacteddeployments.json: aggregate deployment resultsdeploy-staging/: copied workspaces used for Docker buildssummary.json: full machine-readable benchmark summarysummary.csv: compact table for spreadsheetssummary.md: compact Markdown summary
Statuses are process-level statuses. completed means Pi exited with code 0; failed means a non-zero exit; timeout means the process exceeded --timeout-seconds. --retries reruns only failed or timed-out model subprocesses, and the final artifact files contain the last attempt's logs.
--warmup performs a short, unreported pre-run for each selected model before the measured run. Warmup logs are saved as warmup.* files inside each model artifact directory, but the benchmark summary uses the measured run.
1ShotBench records wall-clock duration and process status for every model. New runs execute Pi in JSON event mode and parse final assistant usage fields from message_end events. Raw Pi events are saved per model as events.jsonl, while readable output remains in stdout.log.
Older runs made before JSON event parsing may show zero token and cost fields because they were run with --no-session and text output did not include usage.