1ShotBench includes three web-app evaluation paths:
bench.llm_judge: Playwright evidence plus an LLM verdict.bench.codex_judge: Codex CLI as the evaluator in a disposable copied workspace.bench.pi_judge: Pi as the evaluator in a disposable copied workspace.
All judge paths write artifacts under evals/<eval_id>/.
Judge agent-built web apps with PRD-derived feature checks, Playwright evidence, and an LLM judge. The evaluator can either read a checked-in features.yaml or ask the judge model to generate feature checks from the PRD before running the browser layer.
One-time setup for the browser layer:
cd bench/llm_judge && npm install && npx playwright install chromium
pip install -r requirements.txtRun a full evaluation with a curated feature file:
python3 -m bench.llm_judge \
--project experiments/evaluator/gpt-workspace \
--features experiments/evaluator/features.yaml \
--prd experiments/evaluator/PRD.md \
--label gpt-evaluatorOr generate the feature file from the PRD at evaluation time:
python3 -m bench.llm_judge \
--project experiments/evaluator/gpt-workspace \
--prd experiments/evaluator/PRD.md \
--label gpt-evaluator-generatedBefore starting the app, the evaluator builds setup context from the PRD, implementation README files, manifests, and any repo-local skills referenced there. The judge model can propose concrete setup commands from that context. The runner then executes only general allowlisted setup commands, such as package installs, project-local setup scripts, explicit environment assignments, downloads with project-local output paths, and smoke-check commands.
For each feature, the browser layer first runs scripted evidence steps, captures page text, ARIA, screenshots, errors, and visible interactive elements, then optionally asks the judge model for a short follow-up browser plan. The final verdict receives the collected browser evidence plus setup context/results, but it must still judge from evidence rather than assume success.
Useful options:
--base-url http://127.0.0.1:3000and--no-startwhen the app is already running--dry-runto skip LLM calls for judging/planning/generation where possible--judge-modelor envWEB_EVAL_JUDGE_MODEL--setup neverto skip README setup commands--no-agentic-evidenceto use only scripted Playwright steps--max-generated-features 8to cap PRD-generated feature checks
The Codex judge uses Codex CLI as the evaluator, runs it in a disposable copied workspace, and gives it a dedicated judging skill stored under judge_skills/web_judge/.
Before using the Codex judge, make sure Codex CLI is installed and logged in:
codex loginIf you are using a ChatGPT subscription-backed Codex login rather than API keys, this login step is required before bench.codex_judge can run successfully.
Run the Codex judge with:
python3 -m bench.codex_judge \
--project experiments/evaluator/gpt-workspace \
--features experiments/evaluator/features.yaml \
--prd experiments/evaluator/PRD.md \
--label gpt-codex-judgeOr ask Codex to generate the features from the PRD:
python3 -m bench.codex_judge \
--project experiments/evaluator/gpt-workspace \
--prd experiments/evaluator/PRD.md \
--label gpt-codex-judge-generatedUseful options:
--base-url http://127.0.0.1:3000and--no-startwhen the app is already running--keep-judge-workspaceto preserve the disposable copy atevals/<eval_id>/judge-workspace/--judge-workspace-root path/to/rootto control where the disposable copy lives instead--codex-sandbox workspace-writeto override the default sandbox, though this may prevent local servers or Chromium from running--model <codex-model>to choose the Codex model--searchto enable live web search for the judge if you explicitly want that mode
The Pi judge mirrors the Codex judge flow: it copies the coding agent workspace into a disposable judge workspace, gives Pi the dedicated judging skill in judge_skills/web_judge/, and writes the same evals/<eval_id>/ artifacts.
Run the Pi judge with:
python3 -m bench.pi_judge \
--project experiments/evaluator/gpt-workspace \
--features experiments/evaluator/features.yaml \
--prd experiments/evaluator/PRD.md \
--label gpt-pi-judgeOr ask Pi to generate the features from the PRD:
python3 -m bench.pi_judge \
--project experiments/evaluator/gpt-workspace \
--prd experiments/evaluator/PRD.md \
--label gpt-pi-judge-generatedUseful options:
--base-url http://127.0.0.1:3000and--no-startwhen the app is already running--keep-judge-workspaceto preserve the disposable copy atevals/<eval_id>/judge-workspace/--judge-workspace-root path/to/rootto control where the disposable copy lives instead--provider <pi-provider>and--model <pi-model>to choose the Pi judge backend--thinking <level>,--system-prompt,--append-system-prompt, and--tools read,bash,edit,write,grep,find,lsto control Pi CLI options
Artifacts are written to evals/<eval_id>/:
summary.json,report.md,run.json,setup.json,setup-context.jsongenerated-features.yamlwhen--featuresis omittedevidence/<feature>.jsonplus raw scripted/agentic browser packetsjudgments/<feature>.jsonscreenshots/andjobs/from the browser layer
Correctness is passed / total * 100; uncertain counts as not passed.
When a web eval result looks wrong, inspect the artifacts before changing code:
evals/<id>/setup.jsonevals/<id>/setup-context.jsonevals/<id>/run.jsonevals/<id>/evidence/*.jsonevals/<id>/judgments/*.json- screenshots and app logs if available
The disposable judge workspace may receive dependency installs and temporary runtime artifacts, but the judge should not modify source-like app files to make an implementation pass.