feat(autobrowse): evals/ — verifier-based eval harness + Fable 5 vs Opus 4.8 results by shubh24 · Pull Request #132 · browserbase/skills

shubh24 · 2026-06-10T06:08:33Z

What

Adds skills/autobrowse/evals/ — an eval harness for the auto-research loop that measures the four axes that matter: convergence speed, verified accuracy, runtime speed, and token cost, sweepable across inner model × outer model × outer prompt.

Design

Train/holdout separation. Convergence is measured during the loop; accuracy is N fresh runs against the frozen best strategy. A lucky training pass doesn't count as reliability — and in practice the two disagree constantly.
Verifiers, not self-report. Every task ships a programmatic verify.mjs (--run-dir → {passed, checks, reason}, same protocol as the codegen runners). The agent's own success:true only feeds a false-success (reward-hacking) metric — zero observed in ~200 runs so far. npm run test:verifiers asserts each verifier accepts its documented known-good output and rejects a garbage {success:true}.
9 tasks, 3 tiers. Tier A: deterministic local fixtures with planted gotchas (delayed renders, decoy results) and exact ground truth. Tier B: live-stable sites checked by invariants (several drawn from the browse.sh prompt library — google flights, opentable, uspto, youtube). Tier C: bot-protected (stockx/PerimeterX, yelp/DataDome).
Browserbase concurrency. Remote tasks pre-create a verified+proxied session per run and use the existing --connect-url path in evaluate.mjs — collision-free parallel cells via --concurrency N (a 4-cell OpenTable matrix runs in ~35 min instead of ~2.5 h).
Scripted outer agent (one structured-output call/iteration) so the outer model/prompt are eval variables and outer tokens are metered — the interactive loop never records them.
Mock mode runs the full pipeline (including false-success + regression paths) with zero API spend.

First findings (RESULTS.md)

Fable 5 vs Opus 4.8 in both roles, ~200 verified runs: as the inner agent Fable wasn't worth 2× pricing (Opus matched/beat it on OpenTable). As the outer agent, Fable-authored strategies beat Opus-authored ones with the same Sonnet inner — 6/6 vs 5/6 holdout, ~30% faster/cheaper runs — because Fable consistently diagnoses mechanisms (React hydration timing, Akamai cookie behavior, a planted 900ms render trap) where Opus describes symptoms. Best tested combo: Sonnet 4.6 inner + Fable 5 outer.

Upstream observations

evaluate.mjs's hardcoded 30s EXEC_TIMEOUT_MS kills browse open during slow Chrome cold-starts and strands the session (~20 wasted turns recovering) — worth making configurable.
browse open --remote can't request --verified/--proxies, so plain remote mode gets Akamai-walled; the harness works around it via pre-created cloud sessions + --connect-url.
Note: skills/autobrowse/.gitignore ignores tasks/ (workspace dirs), so the benchmark task definitions are force-added.

Test plan

npm run test:verifiers — all 9 verifiers accept known-good, reject garbage ✅
node eval/run-matrix.mjs --conditions baseline --tasks fixture-checkout --mock + node eval/report.mjs — full pipeline, no API spend ✅ (verified from this exact in-repo location)
Real runs: pilot + 3 comparison rounds (~200 runs) produced the RESULTS.md numbers ✅

🤖 Generated with Claude Code

Note

Low Risk
Adds an isolated eval package and fixtures; it does not change autobrowse runtime scripts, only invokes them via subprocess with optional API keys in local .env.

Overview
Adds skills/autobrowse/evals/, a new benchmark package for the autobrowse evaluate → verify → improve loop. run-matrix.mjs runs condition × task × trial cells with train (inner evaluate.mjs, programmatic verify.mjs, scripted outer-agent.mjs) and holdout (frozen best strategy.md, fresh runs). Outcomes append to runs/results.jsonl for report.mjs scorecards.

Verification & tasks: Nine tasks (Tier A local fixtures + live/bot-protected sites), each with task.md, meta.json, mock-output.json, and per-task verify.mjs using shared _lib/checks.mjs. npm run test:verifiers asserts known-good passes and garbage {success:true} fails. Sweepable conditions/*.json (inner/outer model, lean vs default outer prompt).

Harness behavior: run-inner.mjs wraps evaluate.mjs with local browse stop/shim/pre-warm, remote verified+proxied Browserbase sessions via --connect-url, and --mock pipeline (false-success + regression paths). Central pricing.mjs cost; convergence early-stop and strategy regression revert in the matrix. Local fixture server auto-start; --concurrency for remote cells.

Also ships checkout/flightdeck HTML fixtures, docs (README, RESULTS.md with Sonnet inner + Fable 5 outer findings), and .env.example / package scripts.

^{Reviewed by Cursor Bugbot for commit d0217c9. Bugbot is set up for automated code reviews on this repo. Configure here.}

…esearch loop Measures the four axes that matter for autobrowse — convergence speed, verified accuracy, runtime speed, and token cost — across sweepable conditions (inner model, outer model, outer prompt). - Train/holdout separation: convergence is measured during the loop; accuracy is N fresh runs against the frozen best strategy. - Pass/fail comes from per-task programmatic verifiers, never the agent's own success claim; claimed-but-failed runs are tracked as a false-success (reward-hacking) metric. npm run test:verifiers asserts every verifier accepts its documented known-good output and rejects a garbage {success:true}. - 9-task benchmark in 3 tiers: deterministic local fixtures (with planted gotchas + exact ground truth), live-stable sites checked by invariants (several drawn from the browse.sh prompt library), and bot-protected sites via verified+proxied Browserbase sessions (--connect-url path, collision-free concurrency with --concurrency N). - Scripted outer agent (one structured-output call/iteration) makes the outer model and prompt sweepable and meters outer-agent tokens. - Mock mode exercises the full pipeline (including false-success and regression paths) with zero API spend. RESULTS.md has first findings: Fable 5 vs Opus 4.8 in both roles (~200 runs) — best tested combo is Sonnet inner + Fable 5 outer. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

aq17 · 2026-06-10T16:25:48Z

Thinking about if this (in the future) could actually live in browse CLI if we move autobrowse into it as well (i.e. browse skills generate for autobrowse + browse skills eval for evals!

shubh24 requested a review from aq17 June 10, 2026 06:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(autobrowse): evals/ — verifier-based eval harness + Fable 5 vs Opus 4.8 results#132

feat(autobrowse): evals/ — verifier-based eval harness + Fable 5 vs Opus 4.8 results#132
shubh24 wants to merge 1 commit into
mainfrom
shubh24/autobrowse-evals

shubh24 commented Jun 10, 2026 •

edited by cursor Bot

Loading

Uh oh!

aq17 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shubh24 commented Jun 10, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Design

First findings (RESULTS.md)

Upstream observations

Test plan

Uh oh!

aq17 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shubh24 commented Jun 10, 2026 •

edited by cursor Bot

Loading