Skip to content

feat(autobrowse): evals/ — verifier-based eval harness + Fable 5 vs Opus 4.8 results#132

Open
shubh24 wants to merge 1 commit into
mainfrom
shubh24/autobrowse-evals
Open

feat(autobrowse): evals/ — verifier-based eval harness + Fable 5 vs Opus 4.8 results#132
shubh24 wants to merge 1 commit into
mainfrom
shubh24/autobrowse-evals

Conversation

@shubh24

@shubh24 shubh24 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

What

Adds skills/autobrowse/evals/ — an eval harness for the auto-research loop that measures the four axes that matter: convergence speed, verified accuracy, runtime speed, and token cost, sweepable across inner model × outer model × outer prompt.

Design

  • Train/holdout separation. Convergence is measured during the loop; accuracy is N fresh runs against the frozen best strategy. A lucky training pass doesn't count as reliability — and in practice the two disagree constantly.
  • Verifiers, not self-report. Every task ships a programmatic verify.mjs (--run-dir{passed, checks, reason}, same protocol as the codegen runners). The agent's own success:true only feeds a false-success (reward-hacking) metric — zero observed in ~200 runs so far. npm run test:verifiers asserts each verifier accepts its documented known-good output and rejects a garbage {success:true}.
  • 9 tasks, 3 tiers. Tier A: deterministic local fixtures with planted gotchas (delayed renders, decoy results) and exact ground truth. Tier B: live-stable sites checked by invariants (several drawn from the browse.sh prompt library — google flights, opentable, uspto, youtube). Tier C: bot-protected (stockx/PerimeterX, yelp/DataDome).
  • Browserbase concurrency. Remote tasks pre-create a verified+proxied session per run and use the existing --connect-url path in evaluate.mjs — collision-free parallel cells via --concurrency N (a 4-cell OpenTable matrix runs in ~35 min instead of ~2.5 h).
  • Scripted outer agent (one structured-output call/iteration) so the outer model/prompt are eval variables and outer tokens are metered — the interactive loop never records them.
  • Mock mode runs the full pipeline (including false-success + regression paths) with zero API spend.

First findings (RESULTS.md)

Fable 5 vs Opus 4.8 in both roles, ~200 verified runs: as the inner agent Fable wasn't worth 2× pricing (Opus matched/beat it on OpenTable). As the outer agent, Fable-authored strategies beat Opus-authored ones with the same Sonnet inner — 6/6 vs 5/6 holdout, ~30% faster/cheaper runs — because Fable consistently diagnoses mechanisms (React hydration timing, Akamai cookie behavior, a planted 900ms render trap) where Opus describes symptoms. Best tested combo: Sonnet 4.6 inner + Fable 5 outer.

Upstream observations

  • evaluate.mjs's hardcoded 30s EXEC_TIMEOUT_MS kills browse open during slow Chrome cold-starts and strands the session (~20 wasted turns recovering) — worth making configurable.
  • browse open --remote can't request --verified/--proxies, so plain remote mode gets Akamai-walled; the harness works around it via pre-created cloud sessions + --connect-url.
  • Note: skills/autobrowse/.gitignore ignores tasks/ (workspace dirs), so the benchmark task definitions are force-added.

Test plan

  • npm run test:verifiers — all 9 verifiers accept known-good, reject garbage ✅
  • node eval/run-matrix.mjs --conditions baseline --tasks fixture-checkout --mock + node eval/report.mjs — full pipeline, no API spend ✅ (verified from this exact in-repo location)
  • Real runs: pilot + 3 comparison rounds (~200 runs) produced the RESULTS.md numbers ✅

🤖 Generated with Claude Code


Note

Low Risk
Adds an isolated eval package and fixtures; it does not change autobrowse runtime scripts, only invokes them via subprocess with optional API keys in local .env.

Overview
Adds skills/autobrowse/evals/, a new benchmark package for the autobrowse evaluate → verify → improve loop. run-matrix.mjs runs condition × task × trial cells with train (inner evaluate.mjs, programmatic verify.mjs, scripted outer-agent.mjs) and holdout (frozen best strategy.md, fresh runs). Outcomes append to runs/results.jsonl for report.mjs scorecards.

Verification & tasks: Nine tasks (Tier A local fixtures + live/bot-protected sites), each with task.md, meta.json, mock-output.json, and per-task verify.mjs using shared _lib/checks.mjs. npm run test:verifiers asserts known-good passes and garbage {success:true} fails. Sweepable conditions/*.json (inner/outer model, lean vs default outer prompt).

Harness behavior: run-inner.mjs wraps evaluate.mjs with local browse stop/shim/pre-warm, remote verified+proxied Browserbase sessions via --connect-url, and --mock pipeline (false-success + regression paths). Central pricing.mjs cost; convergence early-stop and strategy regression revert in the matrix. Local fixture server auto-start; --concurrency for remote cells.

Also ships checkout/flightdeck HTML fixtures, docs (README, RESULTS.md with Sonnet inner + Fable 5 outer findings), and .env.example / package scripts.

Reviewed by Cursor Bugbot for commit d0217c9. Bugbot is set up for automated code reviews on this repo. Configure here.

…esearch loop

Measures the four axes that matter for autobrowse — convergence speed,
verified accuracy, runtime speed, and token cost — across sweepable
conditions (inner model, outer model, outer prompt).

- Train/holdout separation: convergence is measured during the loop;
  accuracy is N fresh runs against the frozen best strategy.
- Pass/fail comes from per-task programmatic verifiers, never the agent's
  own success claim; claimed-but-failed runs are tracked as a
  false-success (reward-hacking) metric. npm run test:verifiers asserts
  every verifier accepts its documented known-good output and rejects a
  garbage {success:true}.
- 9-task benchmark in 3 tiers: deterministic local fixtures (with planted
  gotchas + exact ground truth), live-stable sites checked by invariants
  (several drawn from the browse.sh prompt library), and bot-protected
  sites via verified+proxied Browserbase sessions (--connect-url path,
  collision-free concurrency with --concurrency N).
- Scripted outer agent (one structured-output call/iteration) makes the
  outer model and prompt sweepable and meters outer-agent tokens.
- Mock mode exercises the full pipeline (including false-success and
  regression paths) with zero API spend.

RESULTS.md has first findings: Fable 5 vs Opus 4.8 in both roles
(~200 runs) — best tested combo is Sonnet inner + Fable 5 outer.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@shubh24 shubh24 requested a review from aq17 June 10, 2026 06:10
@aq17

aq17 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Thinking about if this (in the future) could actually live in browse CLI if we move autobrowse into it as well (i.e. browse skills generate for autobrowse + browse skills eval for evals!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants