Skip to content

Commit fbb43d8

Browse files
README: lead with two wins + thesis narrative (#76)
1 parent 6832e74 commit fbb43d8

1 file changed

Lines changed: 32 additions & 18 deletions

File tree

README.md

Lines changed: 32 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -52,29 +52,43 @@ detour.
5252
Happy Paths remembers what worked and intervenes at the moment of failure,
5353
before the agent wastes steps rediscovering the fix.
5454

55-
## Where it helps (and where it doesn't)
55+
## Two wins, one thesis
5656

57-
We ran 17 benchmark iterations (~800+ runs across two benchmark suites) to
58-
find the right intervention designs. Honest findings:
57+
We ran 17 benchmark iterations (~1,000+ runs across three suites) to find
58+
what actually works. The thesis: **Happy Paths doesn't make smart models
59+
smarter at things they already know. It makes undiscoverable things
60+
discoverable.**
5961

60-
**Where it helps**: projects with undocumented setup steps, internal CLI tools,
61-
or error messages that point the wrong way. These are the cases where a model
62-
has no prior training data and can't infer the fix from repo files alone.
62+
### Win 1: Tool registry eliminates reinvention waste (−100%)
6363

64-
**Where it also helps**: repos where agents waste tokens reinventing existing
65-
tools. Session mining found 9,012 throwaway scripts (~2.3M wasted tokens)
66-
across 300 sessions. A simple `AGENTS.md` tool registry eliminated this
67-
entirely (0 heredocs, 2.8x CLI usage).
64+
Mining 300 real sessions revealed 9,012 throwaway inline scripts (~2.3M
65+
wasted tokens). Agents kept rewriting the same Linear API / GCloud boilerplate
66+
because existing tools weren't discoverable. A 10-line markdown table in
67+
`AGENTS.md` fixed it completely:
6868

69-
**Where it doesn't help**: well-documented projects, standard toolchain errors,
70-
or situations where the model can figure out the fix by reading `README.md` and
71-
exploring the repo. Modern models (gpt-5.3-codex) are surprisingly good at
72-
`ls → find → read → execute` discovery loops.
69+
| Metric | Without registry | With registry |
70+
|---|---|---|
71+
| Throwaway heredocs (36 runs) | 9 | **0** |
72+
| CLI tool usage | 59 | **163 (2.8×)** |
73+
| Wasted tokens | 1,048 | **0** |
74+
75+
Cost: ~200 tokens in the system prompt. Savings: ~1,000+ per session.
76+
77+
### Win 2: Error-time hints save 4–11% on undocumented repos
78+
79+
When a repo has no README and the only way to run tests is an undocumented
80+
CLI tool, hints at the moment of error give the agent a direct path:
81+
82+
| Repo | What's missing | Δ time |
83+
|---|---|---|
84+
| ledgerkit | No README, `./kit` CLI undiscoverable | **−11%** |
85+
| logparse | No README, `./qa` CLI undiscoverable | **−4%** |
86+
87+
### Where it doesn't help
7388

74-
**What actively hurts**: injecting too many hints, injecting hints too early
75-
(before the agent has context), or injecting generic "prior failure" warnings.
76-
More is not better — one precise hint at the right moment beats three hints
77-
across three errors.
89+
- **Well-documented repos** (toolhub +10%, monobuild +7%): agent reads README
90+
- **Standard errors** (git push conflicts +10%, venv setup): model already knows
91+
- **Too many hints** or **hints injected too early**: adds noise, net-harmful
7892

7993
See [Benchmark results](#benchmark-results) below for the full data.
8094

0 commit comments

Comments
 (0)