Skip to content

Commit a11477a

Browse files
README: add reinvention waste benchmark results, update stats (#73)
- New section: reinvention waste benchmark (v3 baseline, v4 registry) - Updated intro: 17 iterations, 800+ runs across two benchmark suites - Updated stats: 13 repos, 52 tasks, 24 traps - Key finding: AGENTS.md tool registry eliminates reinvention waste (0 heredocs, 2.8x CLI usage, -100% token waste)
1 parent 5ccb12f commit a11477a

1 file changed

Lines changed: 30 additions & 3 deletions

File tree

README.md

Lines changed: 30 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -54,13 +54,18 @@ before the agent wastes steps rediscovering the fix.
5454

5555
## Where it helps (and where it doesn't)
5656

57-
We ran 10 benchmark iterations (~400 runs) to find the right intervention
58-
design. Honest findings:
57+
We ran 17 benchmark iterations (~800+ runs across two benchmark suites) to
58+
find the right intervention designs. Honest findings:
5959

6060
**Where it helps**: projects with undocumented setup steps, internal CLI tools,
6161
or error messages that point the wrong way. These are the cases where a model
6262
has no prior training data and can't infer the fix from repo files alone.
6363

64+
**Where it also helps**: repos where agents waste tokens reinventing existing
65+
tools. Session mining found 9,012 throwaway scripts (~2.3M wasted tokens)
66+
across 300 sessions. A simple `AGENTS.md` tool registry eliminated this
67+
entirely (0 heredocs, 2.8x CLI usage).
68+
6469
**Where it doesn't help**: well-documented projects, standard toolchain errors,
6570
or situations where the model can figure out the fix by reading `README.md` and
6671
exploring the repo. Modern models (gpt-5.3-codex) are surprisingly good at
@@ -265,7 +270,7 @@ the kinds of knowledge gaps models can't resolve from training data alone.
265270
- **Design**: A/B — each task runs OFF (no hints) and ON (hints enabled),
266271
interleaved, with 3 replicates per variant
267272
- **Metric**: wall-clock time, error count, and tool-call count per run
268-
- **Repos**: 10 synthetic Python projects, 40 tasks, 19 unique traps
273+
- **Repos**: 13 synthetic Python projects, 52 tasks, 24 unique traps
269274
- **Trap families**: undocumented tooling, misdirecting error messages,
270275
non-standard test setup, format-before-lint, build target syntax,
271276
hallucinated tool names
@@ -345,6 +350,28 @@ The 4 git-specific patterns (push conflicts, dirty rebase, worktree confusion)
345350
and the CI timeout pattern require git/CI infrastructure in the benchmark — a
346351
future improvement.
347352

353+
### Reinvention waste benchmark (new)
354+
355+
We discovered a second class of waste beyond error recovery: **agents writing
356+
throwaway scripts for operations that have existing repo tools.** Mining 300
357+
real Pi sessions revealed 9,012 inline Python heredocs (~2.3M wasted tokens),
358+
with 55% being Linear API and GCloud boilerplate rewritten every session.
359+
360+
We built a separate benchmark to measure this — 3 synthetic repos
361+
(issuetracker, opsboard, dataquery) with 12 tasks, 151-191 files each, and
362+
existing CLI tools (`./track`, `./ops`, `jq`) buried in docs:
363+
364+
| Version | Files/repo | Intervention | Heredocs (36 runs) | CLI usage | Token waste |
365+
|---|---|---|---|---|---|
366+
| v3 (baseline) | 151-191 | None | 9 | 59 | 1,048 |
367+
| v3 + hints | 151-191 | Tool-call hints only | 6 | 67 | 971 (−7%) |
368+
| **v4 (registry)** | 151-191 | **AGENTS.md tool registry** | **0** | **163 (2.8x)** | **0 (−100%)** |
369+
370+
**The fix isn't an algorithm — it's making tools discoverable.** A 10-line
371+
markdown table in `AGENTS.md` mapping operations → CLI commands completely
372+
eliminated throwaway scripts and nearly tripled CLI usage. Cost: ~200 tokens
373+
in the system prompt. Savings: ~1,000+ tokens per session.
374+
348375
### What the data teaches
349376

350377
1. **One comprehensive hint > many small hints.** When the agent hits

0 commit comments

Comments
 (0)