Commit 2109e0d
authored
research(wv590): adaptive max-turns extension + cycle-1 hot-spots rejection record (#91)
* research(wv590-hot-spots-1): adaptive max-turns + dialog/calendar nav guidance
Closes 2 of 7 hypotheses from bench/research/wv590-hot-spots.json,
the queue derived from the 2026-04-28 WebVoyager-590 baseline
(536/590 = 90.8%; 78% of fails on booking + google-flights date
pickers).
Hypothesis #5 — adaptive max-turns (priority 5, parameter-tuning,
expected +2-4pp)
21 of 54 fails were "agent_gave_up_at_max_turns" mid-flow on
booking + google-flights, where agent-was-progressing reads
unambiguously from the trace. Static maxTurns=15 cut them off.
Implementation
src/run-state.ts: new RunState.lastProgressTurn (init -Infinity)
src/runner/runner.ts: progress detection in the observe-completed
emit path (URL change OR snapshot byte delta > 5%)
src/runner/runner.ts: maxTurns is now `let` not `const`; at the
cap boundary, if lastProgressTurn ≥ maxTurns - 3, grant a
one-time +5 extension (capped absolute at 25). Vision-mode
runs are excluded (already get +5 baseline). Cascading
extensions blocked via extensionGranted.
bus emits a recovery-fired event with strategy
"max-turns-extension" so traces are honest about borrowed
turns.
Anti-overfitting
The 5% byte-delta floor was chosen so decorative animations and
dynamic-id reshuffles don't trip the predicate. The extension
requires recent (≤3 turn) progress, not just a one-shot DOM
change at turn 1, so it doesn't reward stuck loops.
Hypothesis #1 — dialog/calendar nav guidance (priority 1, prompt
change, expected +5-7pp)
27/27 google-flights fails involve the date-picker; ~10/15 booking
fails are calendar-month-navigation-stuck. Both share the same
failure mode: agent clicks "next month" once per turn, burning the
turn budget navigating from "April 2026" to "December 2026".
Implementation
src/brain/index.ts HEAVY_PAGE_RULES adds rules 25-27:
25. CALENDAR/DATE-PICKER: chain N "next month" clicks via
nextActions (micro-plan) in ONE turn instead of one click
per turn
26. DIALOG-STATE AWARENESS: complete or dismiss the dialog;
don't waste turns clicking outside it
27. STATE-REGRESSION DETECTION: if search results disappeared
and you're back at the homepage, switch strategy
These rules are added to SYSTEM_PROMPT (the per-turn agent
prompt) — URL_FIRST_RULES is only in the planner prompt and
doesn't reach per-turn decisions.
Tests
tests/run-state.test.ts: +2 tests for lastProgressTurn predicate
behavior (initial -Infinity; lookback-window matching).
Existing 1514 tests unchanged; 1516/1516 total pass.
Boundary check 157/157 files clean.
Next steps (separate cycles)
Run --two-stage screen on the 79 booking+flights cases (~$200)
to validate combined +5-10pp signal.
If wins, full WebVoyager-590 re-baseline (~$200) confirms.
Then implement #2 (URL-direct site profiles) as the next bigger
architectural lever — could push toward 96%+.
* research(wv590-hot-spots-1): cycle-1 reject — revert prompt rules; keep adaptive max-turns
Cycle 1 ran the combined treatment (rules 25-27 added to SYSTEM_PROMPT
+ adaptive max-turns runner change) on the 79 hot-spot cases. Result:
booking 18/40 = 45.0% vs 25/40 = 62.5% baseline -17.5pp
google-flights 2/17 = 11.8% vs 12/39 = 30.8% baseline -19.0pp
combined 20/57 = 35.1% vs 37/79 = 46.8% baseline -11.7pp
(1 cred-fail aborted run at 57/79)
REJECT — both subsets clearly negative beyond run-to-run variance.
Diagnosis from fail verdicts: agent is bailing CLEANER and EARLIER
than baseline, not grinding through. Verdict patterns like
"Booking.com repeatedly failed to navigate off the homepage" and
"date dialog is open, could not complete" recur. Rule #27 (state-
regression detection) likely gave the agent permission to give up
when the only fallback strategy (URL-direct) is blocked by the
target site.
Reverted
src/brain/index.ts: HEAVY_PAGE_RULES rules 25-27 reverted to
baseline (no calendar/dialog/state-regression guidance). Prompt
rules that tell the agent when to bail without offering a working
alternative are net-negative.
Kept
src/run-state.ts + src/runner/runner.ts adaptive max-turns
extension. Fired 1× in 57 cases (most fails happen before the
cap), so it's INCONCLUSIVE rather than negative. The safety net
is innocuous in the worst case and may help in cycle 2 once we
fix the earlier bail-out.
Annotated bench/research/wv590-hot-spots.json with the cycle-1
results — calendar-month-nav-macro and state-regression-detection
demoted to priority=99 (rejected), max-turns-by-flow-complexity
demoted to priority=50 (inconclusive). Future cycles read these
annotations to avoid re-running rejected hypotheses.
Cycle-2 hypothesis (queued, not yet implemented):
The fundamental problem isn't prompt-level guidance — it's that
booking redirects to homepage with no working recovery. Real fix
needs a new "fresh-session retry" action: when state regresses
to homepage, the runner spawns a new browser context and retries
the search via URL-direct on the fresh session. Architectural
change, not prompt change. Pairs with adaptive max-turns since
retries consume turns.1 parent 9513492 commit 2109e0d
6 files changed
Lines changed: 303 additions & 4 deletions
File tree
- bench
- external/webvoyager
- research
- src
- runner
- tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
31 | 31 | | |
32 | 32 | | |
33 | 33 | | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
39 | | - | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
40 | 44 | | |
41 | 45 | | |
42 | 46 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
54 | 54 | | |
55 | 55 | | |
56 | 56 | | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
57 | 76 | | |
58 | 77 | | |
59 | 78 | | |
| |||
0 commit comments