Skip to content

reinvention-waste benchmark: 3 repos, 12 tasks, 96 sessions#70

Merged
dpetrou-continua merged 2 commits into
mainfrom
dpetrou/reinvention-benchmark
Feb 22, 2026
Merged

reinvention-waste benchmark: 3 repos, 12 tasks, 96 sessions#70
dpetrou-continua merged 2 commits into
mainfrom
dpetrou/reinvention-benchmark

Conversation

@dpetrou-continua
Copy link
Copy Markdown
Contributor

Summary

New "reinvention-waste" benchmark: measures token waste from throwaway heredoc scripts, not error recovery.

Background

Session mining found 9,012 inline Python heredocs across 300 real sessions (~2.3M wasted tokens). 55% were Linear API + GCloud boilerplate the agent rewrites every session instead of using existing repo tools.

What's here

3 synthetic repos (safe for public, no PII):

  • issuetracker./track CLI for issue CRUD (4 tasks)
  • opsboard./ops CLI for deploy/logs/health (4 tasks)
  • dataquery — JSON data + jq workflows (4 tasks)

Token waste analyzer (scripts/analyze-reinvention-results.ts): counts heredoc lines/tokens, CLI tool usage, jq usage per session.

Tool-call proactive hints (from PR #69): detect heredoc patterns and suggest existing tools on tool_result.

Results (96 sessions total)

v1 (clear README) v2 (buried CLI docs)
OFF heredocs 3 5
ON heredocs 4 7
OFF CLI uses 49 49
ON CLI uses 59 40

Key finding: gpt-5.3-codex discovers CLI tools by browsing even when docs are buried. Heredoc rate is low in both variants (5-7 per 24 sessions). The real waste in production occurs in repos with 195+ scripts where the model has no time to browse.

Implication: The tool-call hint intervention needs to target the REAL conditions — large repos where discovery fails. The benchmark validates our measurement framework but shows current synthetic repos are too small to reproduce the pattern reliably.

Testing

28 test files, 164 tests pass. New: reinventionTemplates.ts, analyze-reinvention-results.ts.

New benchmark type that measures TOKEN WASTE from agent reinvention —
writing throwaway scripts for operations with existing repo tools.

Repos:
- issuetracker: local issue tracker with ./track CLI (4 tasks)
- opsboard: deploy/ops dashboard with ./ops CLI (4 tasks)
- dataquery: JSON data + jq workflows (4 tasks)

Key v1 finding: gpt-5.3-codex already uses CLI tools when README is
clear. Only 3-4 heredocs across 48 sessions. The real waste in user
sessions occurs in LARGE repos where tools are buried — next iteration
needs harder discovery conditions.

Also:
- analyze-reinvention-results.ts: heredoc/CLI/jq usage analyzer
- task-filter now uses regex (enables | alternation)
- build-recurring-pattern-benchmark.ts: --include-reinvention flag
Repos now have 75-94 files each (up from ~5-8 in v1):
- README focuses on architecture/API, not CLI
- CLI docs buried in docs/cli-reference.md and docs/ops-guide.md
- 15-18 distractor files per repo (src/, tests/, config/, scripts/, infra/)

v2 results (gpt-5.3-codex, r=2):
- OFF: 5 heredocs, 510 tokens wasted, 49 CLI uses
- ON:  7 heredocs, 1114 tokens wasted, 40 CLI uses
- Model still discovers CLI tools by browsing (~2 CLI uses per session)
- Tool-call hints not helping — model already good at discovery

Key insight: the model is too good at discovery in small repos.
Real production waste (2.3M tokens) occurs in repos with 195+ scripts
where the agent has no time to browse before jumping to implementation.
@dpetrou-continua dpetrou-continua merged commit a773bd4 into main Feb 22, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant