@@ -52,29 +52,43 @@ detour.
5252Happy Paths remembers what worked and intervenes at the moment of failure,
5353before the agent wastes steps rediscovering the fix.
5454
55- ## Where it helps (and where it doesn't)
55+ ## Two wins, one thesis
5656
57- We ran 17 benchmark iterations (~ 800+ runs across two benchmark suites) to
58- find the right intervention designs. Honest findings:
57+ We ran 17 benchmark iterations (~ 1,000+ runs across three suites) to find
58+ what actually works. The thesis: ** Happy Paths doesn't make smart models
59+ smarter at things they already know. It makes undiscoverable things
60+ discoverable.**
5961
60- ** Where it helps** : projects with undocumented setup steps, internal CLI tools,
61- or error messages that point the wrong way. These are the cases where a model
62- has no prior training data and can't infer the fix from repo files alone.
62+ ### Win 1: Tool registry eliminates reinvention waste (−100%)
6363
64- ** Where it also helps ** : repos where agents waste tokens reinventing existing
65- tools. Session mining found 9,012 throwaway scripts ( ~ 2.3M wasted tokens)
66- across 300 sessions. A simple ` AGENTS.md ` tool registry eliminated this
67- entirely (0 heredocs, 2.8x CLI usage).
64+ Mining 300 real sessions revealed 9,012 throwaway inline scripts ( ~ 2.3M
65+ wasted tokens). Agents kept rewriting the same Linear API / GCloud boilerplate
66+ because existing tools weren't discoverable. A 10-line markdown table in
67+ ` AGENTS.md ` fixed it completely:
6868
69- ** Where it doesn't help** : well-documented projects, standard toolchain errors,
70- or situations where the model can figure out the fix by reading ` README.md ` and
71- exploring the repo. Modern models (gpt-5.3-codex) are surprisingly good at
72- ` ls → find → read → execute ` discovery loops.
69+ | Metric | Without registry | With registry |
70+ | ---| ---| ---|
71+ | Throwaway heredocs (36 runs) | 9 | ** 0** |
72+ | CLI tool usage | 59 | ** 163 (2.8×)** |
73+ | Wasted tokens | 1,048 | ** 0** |
74+
75+ Cost: ~ 200 tokens in the system prompt. Savings: ~ 1,000+ per session.
76+
77+ ### Win 2: Error-time hints save 4–11% on undocumented repos
78+
79+ When a repo has no README and the only way to run tests is an undocumented
80+ CLI tool, hints at the moment of error give the agent a direct path:
81+
82+ | Repo | What's missing | Δ time |
83+ | ---| ---| ---|
84+ | ledgerkit | No README, ` ./kit ` CLI undiscoverable | ** −11%** |
85+ | logparse | No README, ` ./qa ` CLI undiscoverable | ** −4%** |
86+
87+ ### Where it doesn't help
7388
74- ** What actively hurts** : injecting too many hints, injecting hints too early
75- (before the agent has context), or injecting generic "prior failure" warnings.
76- More is not better — one precise hint at the right moment beats three hints
77- across three errors.
89+ - ** Well-documented repos** (toolhub +10%, monobuild +7%): agent reads README
90+ - ** Standard errors** (git push conflicts +10%, venv setup): model already knows
91+ - ** Too many hints** or ** hints injected too early** : adds noise, net-harmful
7892
7993See [ Benchmark results] ( #benchmark-results ) below for the full data.
8094
0 commit comments