feat(autobrowse): evals/ — verifier-based eval harness + Fable 5 vs Opus 4.8 results#132
Open
shubh24 wants to merge 1 commit into
Open
feat(autobrowse): evals/ — verifier-based eval harness + Fable 5 vs Opus 4.8 results#132shubh24 wants to merge 1 commit into
shubh24 wants to merge 1 commit into
Conversation
…esearch loop
Measures the four axes that matter for autobrowse — convergence speed,
verified accuracy, runtime speed, and token cost — across sweepable
conditions (inner model, outer model, outer prompt).
- Train/holdout separation: convergence is measured during the loop;
accuracy is N fresh runs against the frozen best strategy.
- Pass/fail comes from per-task programmatic verifiers, never the agent's
own success claim; claimed-but-failed runs are tracked as a
false-success (reward-hacking) metric. npm run test:verifiers asserts
every verifier accepts its documented known-good output and rejects a
garbage {success:true}.
- 9-task benchmark in 3 tiers: deterministic local fixtures (with planted
gotchas + exact ground truth), live-stable sites checked by invariants
(several drawn from the browse.sh prompt library), and bot-protected
sites via verified+proxied Browserbase sessions (--connect-url path,
collision-free concurrency with --concurrency N).
- Scripted outer agent (one structured-output call/iteration) makes the
outer model and prompt sweepable and meters outer-agent tokens.
- Mock mode exercises the full pipeline (including false-success and
regression paths) with zero API spend.
RESULTS.md has first findings: Fable 5 vs Opus 4.8 in both roles
(~200 runs) — best tested combo is Sonnet inner + Fable 5 outer.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Contributor
|
Thinking about if this (in the future) could actually live in |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds
skills/autobrowse/evals/— an eval harness for the auto-research loop that measures the four axes that matter: convergence speed, verified accuracy, runtime speed, and token cost, sweepable across inner model × outer model × outer prompt.Design
verify.mjs(--run-dir→{passed, checks, reason}, same protocol as the codegen runners). The agent's ownsuccess:trueonly feeds a false-success (reward-hacking) metric — zero observed in ~200 runs so far.npm run test:verifiersasserts each verifier accepts its documented known-good output and rejects a garbage{success:true}.--connect-urlpath in evaluate.mjs — collision-free parallel cells via--concurrency N(a 4-cell OpenTable matrix runs in ~35 min instead of ~2.5 h).First findings (RESULTS.md)
Fable 5 vs Opus 4.8 in both roles, ~200 verified runs: as the inner agent Fable wasn't worth 2× pricing (Opus matched/beat it on OpenTable). As the outer agent, Fable-authored strategies beat Opus-authored ones with the same Sonnet inner — 6/6 vs 5/6 holdout, ~30% faster/cheaper runs — because Fable consistently diagnoses mechanisms (React hydration timing, Akamai cookie behavior, a planted 900ms render trap) where Opus describes symptoms. Best tested combo: Sonnet 4.6 inner + Fable 5 outer.
Upstream observations
evaluate.mjs's hardcoded 30sEXEC_TIMEOUT_MSkillsbrowse openduring slow Chrome cold-starts and strands the session (~20 wasted turns recovering) — worth making configurable.browse open --remotecan't request--verified/--proxies, so plain remote mode gets Akamai-walled; the harness works around it via pre-created cloud sessions +--connect-url.skills/autobrowse/.gitignoreignorestasks/(workspace dirs), so the benchmark task definitions are force-added.Test plan
npm run test:verifiers— all 9 verifiers accept known-good, reject garbage ✅node eval/run-matrix.mjs --conditions baseline --tasks fixture-checkout --mock+node eval/report.mjs— full pipeline, no API spend ✅ (verified from this exact in-repo location)🤖 Generated with Claude Code
Note
Low Risk
Adds an isolated eval package and fixtures; it does not change autobrowse runtime scripts, only invokes them via subprocess with optional API keys in local
.env.Overview
Adds
skills/autobrowse/evals/, a new benchmark package for the autobrowse evaluate → verify → improve loop.run-matrix.mjsruns condition × task × trial cells with train (innerevaluate.mjs, programmaticverify.mjs, scriptedouter-agent.mjs) and holdout (frozen beststrategy.md, fresh runs). Outcomes append toruns/results.jsonlforreport.mjsscorecards.Verification & tasks: Nine tasks (Tier A local fixtures + live/bot-protected sites), each with
task.md,meta.json,mock-output.json, and per-taskverify.mjsusing shared_lib/checks.mjs.npm run test:verifiersasserts known-good passes and garbage{success:true}fails. Sweepableconditions/*.json(inner/outer model, lean vs default outer prompt).Harness behavior:
run-inner.mjswrapsevaluate.mjswith local browse stop/shim/pre-warm, remote verified+proxied Browserbase sessions via--connect-url, and--mockpipeline (false-success + regression paths). Centralpricing.mjscost; convergence early-stop and strategy regression revert in the matrix. Local fixture server auto-start;--concurrencyfor remote cells.Also ships checkout/flightdeck HTML fixtures, docs (README, RESULTS.md with Sonnet inner + Fable 5 outer findings), and
.env.example/ package scripts.Reviewed by Cursor Bugbot for commit d0217c9. Bugbot is set up for automated code reviews on this repo. Configure here.