@@ -54,13 +54,18 @@ before the agent wastes steps rediscovering the fix.
5454
5555## Where it helps (and where it doesn't)
5656
57- We ran 10 benchmark iterations (~ 400 runs) to find the right intervention
58- design . Honest findings:
57+ We ran 17 benchmark iterations (~ 800+ runs across two benchmark suites) to
58+ find the right intervention designs . Honest findings:
5959
6060** Where it helps** : projects with undocumented setup steps, internal CLI tools,
6161or error messages that point the wrong way. These are the cases where a model
6262has no prior training data and can't infer the fix from repo files alone.
6363
64+ ** Where it also helps** : repos where agents waste tokens reinventing existing
65+ tools. Session mining found 9,012 throwaway scripts (~ 2.3M wasted tokens)
66+ across 300 sessions. A simple ` AGENTS.md ` tool registry eliminated this
67+ entirely (0 heredocs, 2.8x CLI usage).
68+
6469** Where it doesn't help** : well-documented projects, standard toolchain errors,
6570or situations where the model can figure out the fix by reading ` README.md ` and
6671exploring the repo. Modern models (gpt-5.3-codex) are surprisingly good at
@@ -265,7 +270,7 @@ the kinds of knowledge gaps models can't resolve from training data alone.
265270- ** Design** : A/B — each task runs OFF (no hints) and ON (hints enabled),
266271 interleaved, with 3 replicates per variant
267272- ** Metric** : wall-clock time, error count, and tool-call count per run
268- - ** Repos** : 10 synthetic Python projects, 40 tasks, 19 unique traps
273+ - ** Repos** : 13 synthetic Python projects, 52 tasks, 24 unique traps
269274- ** Trap families** : undocumented tooling, misdirecting error messages,
270275 non-standard test setup, format-before-lint, build target syntax,
271276 hallucinated tool names
@@ -345,6 +350,28 @@ The 4 git-specific patterns (push conflicts, dirty rebase, worktree confusion)
345350and the CI timeout pattern require git/CI infrastructure in the benchmark — a
346351future improvement.
347352
353+ ### Reinvention waste benchmark (new)
354+
355+ We discovered a second class of waste beyond error recovery: ** agents writing
356+ throwaway scripts for operations that have existing repo tools.** Mining 300
357+ real Pi sessions revealed 9,012 inline Python heredocs (~ 2.3M wasted tokens),
358+ with 55% being Linear API and GCloud boilerplate rewritten every session.
359+
360+ We built a separate benchmark to measure this — 3 synthetic repos
361+ (issuetracker, opsboard, dataquery) with 12 tasks, 151-191 files each, and
362+ existing CLI tools (` ./track ` , ` ./ops ` , ` jq ` ) buried in docs:
363+
364+ | Version | Files/repo | Intervention | Heredocs (36 runs) | CLI usage | Token waste |
365+ | ---| ---| ---| ---| ---| ---|
366+ | v3 (baseline) | 151-191 | None | 9 | 59 | 1,048 |
367+ | v3 + hints | 151-191 | Tool-call hints only | 6 | 67 | 971 (−7%) |
368+ | ** v4 (registry)** | 151-191 | ** AGENTS.md tool registry** | ** 0** | ** 163 (2.8x)** | ** 0 (−100%)** |
369+
370+ ** The fix isn't an algorithm — it's making tools discoverable.** A 10-line
371+ markdown table in ` AGENTS.md ` mapping operations → CLI commands completely
372+ eliminated throwaway scripts and nearly tripled CLI usage. Cost: ~ 200 tokens
373+ in the system prompt. Savings: ~ 1,000+ tokens per session.
374+
348375### What the data teaches
349376
3503771 . ** One comprehensive hint > many small hints.** When the agent hits
0 commit comments