research(wv590): adaptive max-turns extension + cycle-1 hot-spots rejection record (#91)

drewstone · web-flow · commit 2109e0df96c7 · 2026-04-28T16:40:17.000-06:00
* research(wv590-hot-spots-1): adaptive max-turns + dialog/calendar nav guidance Closes 2 of 7 hypotheses from bench/research/wv590-hot-spots.json, the queue derived from the 2026-04-28 WebVoyager-590 baseline (536/590 = 90.8%; 78% of fails on booking + google-flights date pickers). Hypothesis #5 — adaptive max-turns (priority 5, parameter-tuning, expected +2-4pp) 21 of 54 fails were "agent_gave_up_at_max_turns" mid-flow on booking + google-flights, where agent-was-progressing reads unambiguously from the trace. Static maxTurns=15 cut them off. Implementation src/run-state.ts: new RunState.lastProgressTurn (init -Infinity) src/runner/runner.ts: progress detection in the observe-completed emit path (URL change OR snapshot byte delta > 5%) src/runner/runner.ts: maxTurns is now `let` not `const`; at the cap boundary, if lastProgressTurn ≥ maxTurns - 3, grant a one-time +5 extension (capped absolute at 25). Vision-mode runs are excluded (already get +5 baseline). Cascading extensions blocked via extensionGranted. bus emits a recovery-fired event with strategy "max-turns-extension" so traces are honest about borrowed turns. Anti-overfitting The 5% byte-delta floor was chosen so decorative animations and dynamic-id reshuffles don't trip the predicate. The extension requires recent (≤3 turn) progress, not just a one-shot DOM change at turn 1, so it doesn't reward stuck loops. Hypothesis #1 — dialog/calendar nav guidance (priority 1, prompt change, expected +5-7pp) 27/27 google-flights fails involve the date-picker; ~10/15 booking fails are calendar-month-navigation-stuck. Both share the same failure mode: agent clicks "next month" once per turn, burning the turn budget navigating from "April 2026" to "December 2026". Implementation src/brain/index.ts HEAVY_PAGE_RULES adds rules 25-27: 25. CALENDAR/DATE-PICKER: chain N "next month" clicks via nextActions (micro-plan) in ONE turn instead of one click per turn 26. DIALOG-STATE AWARENESS: complete or dismiss the dialog; don't waste turns clicking outside it 27. STATE-REGRESSION DETECTION: if search results disappeared and you're back at the homepage, switch strategy These rules are added to SYSTEM_PROMPT (the per-turn agent prompt) — URL_FIRST_RULES is only in the planner prompt and doesn't reach per-turn decisions. Tests tests/run-state.test.ts: +2 tests for lastProgressTurn predicate behavior (initial -Infinity; lookback-window matching). Existing 1514 tests unchanged; 1516/1516 total pass. Boundary check 157/157 files clean. Next steps (separate cycles) Run --two-stage screen on the 79 booking+flights cases (~$200) to validate combined +5-10pp signal. If wins, full WebVoyager-590 re-baseline (~$200) confirms. Then implement #2 (URL-direct site profiles) as the next bigger architectural lever — could push toward 96%+. * research(wv590-hot-spots-1): cycle-1 reject — revert prompt rules; keep adaptive max-turns Cycle 1 ran the combined treatment (rules 25-27 added to SYSTEM_PROMPT + adaptive max-turns runner change) on the 79 hot-spot cases. Result: booking 18/40 = 45.0% vs 25/40 = 62.5% baseline -17.5pp google-flights 2/17 = 11.8% vs 12/39 = 30.8% baseline -19.0pp combined 20/57 = 35.1% vs 37/79 = 46.8% baseline -11.7pp (1 cred-fail aborted run at 57/79) REJECT — both subsets clearly negative beyond run-to-run variance. Diagnosis from fail verdicts: agent is bailing CLEANER and EARLIER than baseline, not grinding through. Verdict patterns like "Booking.com repeatedly failed to navigate off the homepage" and "date dialog is open, could not complete" recur. Rule #27 (state- regression detection) likely gave the agent permission to give up when the only fallback strategy (URL-direct) is blocked by the target site. Reverted src/brain/index.ts: HEAVY_PAGE_RULES rules 25-27 reverted to baseline (no calendar/dialog/state-regression guidance). Prompt rules that tell the agent when to bail without offering a working alternative are net-negative. Kept src/run-state.ts + src/runner/runner.ts adaptive max-turns extension. Fired 1× in 57 cases (most fails happen before the cap), so it's INCONCLUSIVE rather than negative. The safety net is innocuous in the worst case and may help in cycle 2 once we fix the earlier bail-out. Annotated bench/research/wv590-hot-spots.json with the cycle-1 results — calendar-month-nav-macro and state-regression-detection demoted to priority=99 (rejected), max-turns-by-flow-complexity demoted to priority=50 (inconclusive). Future cycles read these annotations to avoid re-running rejected hypotheses. Cycle-2 hypothesis (queued, not yet implemented): The fundamental problem isn't prompt-level guidance — it's that booking redirects to homepage with no working recovery. Real fix needs a new "fresh-session retry" action: when state regresses to homepage, the runner spawns a new browser context and retries the search via URL-direct on the fresh session. Architectural change, not prompt change. Pairs with adaptive max-turns since retries consume turns.
diff --git a/.gitignore b/.gitignore
@@ -31,3 +31,7 @@ mm-*.png
 demos/
 bench/design/eval/results/
 .claude/worktrees/
+
+# Transient WebVoyager run-recovery cases (regenerated per session)
+bench/external/webvoyager/cases-resume*.json
+bench/external/webvoyager/cases-hotspots*.json
diff --git a/bench/external/webvoyager/convert-tasks.mjs b/bench/external/webvoyager/convert-tasks.mjs
@@ -36,7 +36,11 @@ const excludeRemoved = hasFlag('exclude-removed') || applyPatches
 const filterSite = getArg('site')
 const maxTasks = Number(getArg('max-tasks', '0'))
 const maxTurns = Number(getArg('max-turns', '15'))
-const timeoutMs = Number(getArg('timeout', '120000'))
+// 300_000 (5 min) per case — the prior 120_000 floor dominated the failure
+// mode on long-page sites (Amazon, Booking, Google Flights), masking real
+// capability vs config issues. Override with --timeout for cost-sensitive
+// sweeps.
+const timeoutMs = Number(getArg('timeout', '300000'))
 const outFile = getArg('out', path.resolve(__dir, 'cases.json'))
 
 // Load patches
diff --git a/bench/research/wv590-hot-spots.json b/bench/research/wv590-hot-spots.json
@@ -0,0 +1,171 @@
+{
+  "name": "WV590 Hot-Spots \u2014 booking + google-flights",
+  "description": "Close the 42 fails on the two dominant capability hot-spots from the 2026-04-28 WebVoyager-590 baseline (90.8%). Date-picker dialog-stuck pattern accounts for 30+ fails across both sites. Goal: research architectural changes that generalize across sites, not site-specific overfits.",
+  "_baseline": {
+    "date": "2026-04-28",
+    "passRate": 0.908,
+    "n": 590,
+    "model": "gpt-5.4",
+    "route": "router.tangle.tools/v1",
+    "perSite": {
+      "booking": "25/40 = 62.5%",
+      "google-flights": "12/39 = 30.8%",
+      "wolfram-alpha": "41/46 = 89.1%",
+      "huggingface": "33/36 = 91.7%",
+      "8 sites at 100%": "allrecipes amazon apple arxiv bbc-news cambridge coursera google-map"
+    },
+    "failureMix": {
+      "agent_gave_up_at_max_turns": 21,
+      "other": 18,
+      "cost_cap": 9,
+      "unreachable": 4,
+      "wall_clock_timeout": 2
+    }
+  },
+  "defaults": {
+    "casesPath": "./bench/external/webvoyager/cases.json",
+    "casesFilter": "site=booking,google-flights",
+    "model": "gpt-5.4",
+    "provider": "openai",
+    "baseUrl": "https://router.tangle.tools/v1",
+    "benchmarkProfile": "webvoyager",
+    "repetitions": 3,
+    "concurrency": 5,
+    "scenarioConcurrency": 2,
+    "seed": "2026-04-28",
+    "memoryIsolation": "per-run",
+    "modes": "fast-explore",
+    "tokenBudget": 300000,
+    "caseTimeoutMs": 300000
+  },
+  "control": {},
+  "hypotheses": [
+    {
+      "id": "calendar-month-nav-macro",
+      "name": "Calendar month-navigation macro for date pickers",
+      "rationale": "27/27 google-flights fails involve the date-picker dialog. ~10/15 booking fails involve the 'calendar only shows next 2 months' pattern \u2014 agent has to click 'next month' iteratively to reach Dec 2026, burning 6-8 turns per click sequence. Both sites use a similar pattern: open dialog \u2192 calendar shows current month \u2192 user must navigate to target month. A reusable macro 'navigateCalendarToMonth(year, month)' that detects month-name elements and clicks 'next' until target is visible would condense 6-8 turns into 1 macro call. Generalizable to ANY site with month-navigation calendars (Airbnb, Expedia, etc).",
+      "category": "architectural",
+      "expected_impact": "pass rate: +5-7pp overall; +30-50pp on flights+booking",
+      "risk": "Macro selector may be fragile across calendar implementations. Mitigation: detect month-name pattern via regex on visible text, fall back to manual navigation if not found.",
+      "priority": 99,
+      "treatment": {
+        "_note": "New macro in src/macros/ + register in skills/macros/. Detection: dialog containing month-name text. Action: click 'next month' button N times until target visible, click target date.",
+        "scope": "all sites that present month-navigation calendars"
+      },
+      "result": {
+        "cycle": 1,
+        "date": "2026-04-28",
+        "verdict": "reject (combined-with-26-and-27)",
+        "evidence": "Tested as prompt-rule addition (rules 25-27 added together to SYSTEM_PROMPT.HEAVY_PAGE_RULES). 57/79 hot-spot cases run before cred-fail abort. booking 18/40=45% (-17.5pp vs 62.5% baseline); flights 2/17=12% (-19pp vs 30.8% baseline). Combined treatment is unambiguously negative. Cannot isolate which rule caused the regression in this combined test; suspicion is on rule #27 (state-regression detection) which gave the agent permission to bail without offering a working alternative strategy.",
+        "next": "Re-test rule #25 in isolation. The micro-plan instruction may still be valid but needs validation without #27 contaminating the signal."
+      }
+    },
+    {
+      "id": "url-direct-when-params-known",
+      "name": "URL-direct construction when goal specifies all parameters",
+      "rationale": "Google Flights URL `?tfs=...` carries origin+dest+date+passengers. Booking URL `?ss=Paris&checkin=2026-12-01&checkout=2026-12-06&group_adults=2` does the same. When the goal specifies all required params (origin/destination/dates/count) the agent could construct the URL directly and skip the dialog entirely. Currently the brain only does navigate(URL) for the homepage. Adding a 'try URL-direct first, fall back to UI flow on failure' strategy bypasses the date-picker bottleneck completely.",
+      "category": "architectural",
+      "expected_impact": "pass rate: +3-5pp overall; concentrated on flights/booking/airbnb-style sites",
+      "risk": "Hand-coded URL templates per site is overfitting. Mitigation: ship as a 'site profile' system (declarative URL templates) rather than hardcoded conditionals; require goal-extracted params to all be present before attempting; on any non-200 or redirect-to-homepage, fall back to UI flow.",
+      "priority": 2,
+      "treatment": {
+        "_note": "New src/site-profiles/ with URL templates for top-failure sites + brain integration. Profile shape: {domain, urlTemplate, requiredParams, paramExtractors}.",
+        "scope": "domain-pattern-matched (booking.com, google.com/travel/flights, expedia.com, airbnb.com, etc.)"
+      }
+    },
+    {
+      "id": "dialog-aware-attention-focus",
+      "name": "Dialog-aware snapshot focusing",
+      "rationale": "When a modal dialog is open (date picker, occupancy selector), the snapshot still includes the entire underlying page DOM. This dilutes attention and inflates context \u2014 booking/flights dialog turns hit 200-300k tokens with vision on. Detecting an open dialog (role=dialog, aria-modal=true, or visible-blocker pattern) and FILTERING the snapshot to just the dialog subtree forces the agent to focus on the immediate task. Token cost drops + decision quality improves.",
+      "category": "architectural",
+      "expected_impact": "pass rate: +1-3pp overall; cost: -10-15% on dialog-heavy sites",
+      "risk": "Could miss state outside the dialog (toast notifications, page errors). Mitigation: keep a small 'page-context' header (URL, page title, top-level nav state) above the dialog snapshot.",
+      "priority": 3,
+      "treatment": {
+        "_note": "Modify src/observe/ to detect [role=dialog], [aria-modal=true], and z-index>1000 patterns; when detected, scope the DOM snapshot to dialog subtree + a 200-char page header.",
+        "scope": "all sites \u2014 improves dialog interaction broadly"
+      }
+    },
+    {
+      "id": "state-regression-detection",
+      "name": "State-regression detection + URL-recovery",
+      "rationale": "Booking specifically has a 'redirect to homepage with errorc_searchstring_not_found' pattern that wipes search context. The agent currently treats this as 'dom changed slightly' and continues \u2014 but the search state is gone. Detect 'URL changed unexpectedly to root domain' or 'expected entity missing from page' as a state-regression signal and trigger recovery (re-navigate to known-good URL with search params, or restart the search flow).",
+      "category": "bug-fix",
+      "expected_impact": "pass rate: +2-3pp overall (closes booking search-reset cluster)",
+      "risk": "False positives on legitimate redirects (e.g., login walls). Mitigation: only trigger when the regression is to root path AND the agent had progressed past root in a prior turn.",
+      "priority": 99,
+      "treatment": {
+        "_note": "Add to src/runner/recovery.ts: track URL progression, detect 'we were on /search/results then we got bounced to /', emit recovery event with 'state regressed; re-navigate to last-known-good URL'.",
+        "scope": "all sites \u2014 generic regression detection"
+      },
+      "result": {
+        "cycle": 1,
+        "date": "2026-04-28",
+        "verdict": "reject",
+        "evidence": "Implemented as prompt rule #27 in SYSTEM_PROMPT.HEAVY_PAGE_RULES. Combined treatment with #25/#26 produced -17.5pp on booking and -19pp on flights. The rule told the agent to \"switch strategy\" when the search state regressed, but the only fallback (URL-direct) is blocked on booking. Net effect: agent bails earlier than baseline because the rule gave it permission to give up. Reverted from main.",
+        "next": "A real fix needs a NEW recovery action (fresh-session retry, not just URL-direct fallback) before this rule is useful."
+      }
+    },
+    {
+      "id": "max-turns-by-flow-complexity",
+      "name": "Adaptive max-turns based on detected flow complexity",
+      "rationale": "21 fails are 'agent_gave_up_at_max_turns'. Current default is max_turns=15 per case across all sites. Booking flows are 12-20 turns (date pickers + filters + result extraction); single-page lookups are 3-5 turns. Adaptive sizing \u2014 start with 15, allow extension to 25 if the agent has shown verifiable progress in the last 3 turns (URL change AND DOM significantly different) \u2014 prevents premature aborts on hard cases without rewarding spinning agents.",
+      "category": "parameter-tuning",
+      "expected_impact": "pass rate: +2-4pp overall",
+      "risk": "Could waste budget on cases that genuinely can't succeed. Mitigation: extension only granted on verified progress signal; cap at 25 absolute.",
+      "priority": 50,
+      "treatment": {
+        "_note": "src/runner/runner.ts \u2014 track lastProgressTurn; when hitting maxTurns, if lastProgressTurn was within 3 turns ago, grant +5 extension up to 25 total.",
+        "scope": "all sites"
+      },
+      "result": {
+        "cycle": 1,
+        "date": "2026-04-28",
+        "verdict": "inconclusive",
+        "evidence": "Adaptive max-turns extension (15\u2192up-to-25 when last 3 turns showed progress) fired exactly 1 time in 57 cases. Most fails happen well before the cap (turn 9-14 with agent giving up cleanly), so the safety net never engages. Innocuous (no negative impact observed) but also no measurable positive impact. Code change kept on the branch as future-proofing.",
+        "next": "Pair with a hypothesis that addresses the EARLIER bail-out (e.g., a recovery action the agent can take when stuck), then re-test."
+      }
+    },
+    {
+      "id": "vision-only-on-spa-detection",
+      "name": "Vision-only mode on SPA detection",
+      "rationale": "Both booking and flights are React-heavy SPAs where the DOM snapshot is voluminous and unstable but the visual layout is consistent. For these specific sites, vision is more reliable than DOM-vision-hybrid. Detect SPA patterns (heavy use of data-* attributes, React/Vue runtime markers, lots of generated class names) and switch to vision-only observation. Reduces context size + makes the agent reason about what it sees, not what's in the DOM.",
+      "category": "architectural",
+      "expected_impact": "pass rate: +1-3pp on flights/booking; cost: -20% on SPA-heavy",
+      "risk": "Vision is less reliable for hidden/off-screen elements. Mitigation: only activate when DOM-snapshot-bytes > threshold AND turns-without-progress >= 2.",
+      "priority": 6,
+      "treatment": {
+        "_note": "src/observe/ \u2014 detect SPA via DOM heuristics; add vision-only fallback path when activation conditions met.",
+        "scope": "SPA-pattern-matched sites only"
+      }
+    },
+    {
+      "id": "cost-cap-domain-aware",
+      "name": "Domain-aware cost cap (300k \u2192 500k for vision-heavy)",
+      "rationale": "9 fails are cost_cap_exceeded. 4 of them are on flights/booking/espn where pages are inherently large \u00d7 vision-on context. Raising the cap to 500k for known vision-heavy domains trades ~$0.05 extra per case for closing 4-9 fails. Lowest-leverage hypothesis but cheap.",
+      "category": "parameter-tuning",
+      "expected_impact": "pass rate: +1-2pp; cost: +5% on listed domains, 0% elsewhere",
+      "risk": "Encourages runaway loops to burn more before the cap fires. Mitigation: pair with state-regression detection (#4) so loops are caught earlier.",
+      "priority": 7,
+      "treatment": {
+        "_note": "src/run-state.ts \u2014 domain-aware DEFAULT_TOKEN_BUDGET lookup table; default 300k, override 500k for {booking.com, google.com/travel/flights, espn.com}.",
+        "scope": "domain-listed sites only"
+      }
+    }
+  ],
+  "_cycle_results": {
+    "cycle1": {
+      "date": "2026-04-28",
+      "tested": [
+        "calendar-month-nav-macro",
+        "state-regression-detection",
+        "max-turns-by-flow-complexity"
+      ],
+      "verdict": "reject prompt rules; inconclusive on adaptive max-turns",
+      "baseline_combined": "37/79 = 46.8% (booking 25/40 + flights 12/39 from full WV-590 run)",
+      "treatment_combined": "20/57 = 35.1% (-11.7pp before cred-fail abort)",
+      "cost_usd": 32,
+      "lesson": "Adding prompt rules that tell the agent when to bail is dangerous if no better alternative is offered. Rule #27 specifically gave the agent permission to give up on booking without a working fallback. Going forward: prompt rules need to PAIR with concrete actions the agent can take."
+    }
+  }
+}
diff --git a/src/run-state.ts b/src/run-state.ts
@@ -54,6 +54,25 @@ export class RunState {
   // Gen 24b: checkpoint replay — save known-good URLs for rollback on wrong-path
   checkpoints: Array<{ url: string; turn: number }> = [];
 
+  /**
+   * Last turn at which the agent showed verifiable progress: URL changed,
+   * snapshot DOM materially changed, or evidence was extracted. Drives the
+   * 2026-04-28 adaptive-max-turns extension — the runner grants up to 5
+   * extra turns past the configured maxTurns IF the last 3 turns showed
+   * progress, on the theory that an agent making demonstrable progress
+   * 3 turns from the cap should not be cut off arbitrarily.
+   *
+   * Initial value `-Infinity` so a brand-new run with zero observations
+   * cannot trigger the extension on its way in.
+   *
+   * Updated by:
+   *   - URL change between consecutive observe-completed events
+   *   - Snapshot byte delta > 5% of prior turn (filters noise from
+   *     decorative animations / dynamic IDs / timestamps)
+   *   - firstSufficientEvidenceTurn being set this turn
+   */
+  lastProgressTurn = -Infinity;
+
   readonly maxTotalErrors: number;
 
   /** Total LLM tokens (input + output + cache) charged to this case so far. */
diff --git a/src/runner/runner.ts b/src/runner/runner.ts
diff --git a/tests/run-state.test.ts b/tests/run-state.test.ts