reinvention-waste benchmark: 3 repos, 12 tasks, 96 sessions#70
Merged
Conversation
New benchmark type that measures TOKEN WASTE from agent reinvention — writing throwaway scripts for operations with existing repo tools. Repos: - issuetracker: local issue tracker with ./track CLI (4 tasks) - opsboard: deploy/ops dashboard with ./ops CLI (4 tasks) - dataquery: JSON data + jq workflows (4 tasks) Key v1 finding: gpt-5.3-codex already uses CLI tools when README is clear. Only 3-4 heredocs across 48 sessions. The real waste in user sessions occurs in LARGE repos where tools are buried — next iteration needs harder discovery conditions. Also: - analyze-reinvention-results.ts: heredoc/CLI/jq usage analyzer - task-filter now uses regex (enables | alternation) - build-recurring-pattern-benchmark.ts: --include-reinvention flag
Repos now have 75-94 files each (up from ~5-8 in v1): - README focuses on architecture/API, not CLI - CLI docs buried in docs/cli-reference.md and docs/ops-guide.md - 15-18 distractor files per repo (src/, tests/, config/, scripts/, infra/) v2 results (gpt-5.3-codex, r=2): - OFF: 5 heredocs, 510 tokens wasted, 49 CLI uses - ON: 7 heredocs, 1114 tokens wasted, 40 CLI uses - Model still discovers CLI tools by browsing (~2 CLI uses per session) - Tool-call hints not helping — model already good at discovery Key insight: the model is too good at discovery in small repos. Real production waste (2.3M tokens) occurs in repos with 195+ scripts where the agent has no time to browse before jumping to implementation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
New "reinvention-waste" benchmark: measures token waste from throwaway heredoc scripts, not error recovery.
Background
Session mining found 9,012 inline Python heredocs across 300 real sessions (~2.3M wasted tokens). 55% were Linear API + GCloud boilerplate the agent rewrites every session instead of using existing repo tools.
What's here
3 synthetic repos (safe for public, no PII):
./trackCLI for issue CRUD (4 tasks)./opsCLI for deploy/logs/health (4 tasks)Token waste analyzer (
scripts/analyze-reinvention-results.ts): counts heredoc lines/tokens, CLI tool usage, jq usage per session.Tool-call proactive hints (from PR #69): detect heredoc patterns and suggest existing tools on
tool_result.Results (96 sessions total)
Key finding: gpt-5.3-codex discovers CLI tools by browsing even when docs are buried. Heredoc rate is low in both variants (5-7 per 24 sessions). The real waste in production occurs in repos with 195+ scripts where the model has no time to browse.
Implication: The tool-call hint intervention needs to target the REAL conditions — large repos where discovery fails. The benchmark validates our measurement framework but shows current synthetic repos are too small to reproduce the pattern reliably.
Testing
28 test files, 164 tests pass. New:
reinventionTemplates.ts,analyze-reinvention-results.ts.