reinvention-waste benchmark: 3 repos, 12 tasks, 96 sessions by dpetrou-continua · Pull Request #70 · continua-ai/happy-paths

dpetrou-continua · 2026-02-22T17:25:58Z

Summary

New "reinvention-waste" benchmark: measures token waste from throwaway heredoc scripts, not error recovery.

Background

Session mining found 9,012 inline Python heredocs across 300 real sessions (~2.3M wasted tokens). 55% were Linear API + GCloud boilerplate the agent rewrites every session instead of using existing repo tools.

What's here

3 synthetic repos (safe for public, no PII):

issuetracker — ./track CLI for issue CRUD (4 tasks)
opsboard — ./ops CLI for deploy/logs/health (4 tasks)
dataquery — JSON data + jq workflows (4 tasks)

Token waste analyzer (scripts/analyze-reinvention-results.ts): counts heredoc lines/tokens, CLI tool usage, jq usage per session.

Tool-call proactive hints (from PR #69): detect heredoc patterns and suggest existing tools on tool_result.

Results (96 sessions total)

	v1 (clear README)	v2 (buried CLI docs)
OFF heredocs	3	5
ON heredocs	4	7
OFF CLI uses	49	49
ON CLI uses	59	40

Key finding: gpt-5.3-codex discovers CLI tools by browsing even when docs are buried. Heredoc rate is low in both variants (5-7 per 24 sessions). The real waste in production occurs in repos with 195+ scripts where the model has no time to browse.

Implication: The tool-call hint intervention needs to target the REAL conditions — large repos where discovery fails. The benchmark validates our measurement framework but shows current synthetic repos are too small to reproduce the pattern reliably.

Testing

28 test files, 164 tests pass. New: reinventionTemplates.ts, analyze-reinvention-results.ts.

New benchmark type that measures TOKEN WASTE from agent reinvention — writing throwaway scripts for operations with existing repo tools. Repos: - issuetracker: local issue tracker with ./track CLI (4 tasks) - opsboard: deploy/ops dashboard with ./ops CLI (4 tasks) - dataquery: JSON data + jq workflows (4 tasks) Key v1 finding: gpt-5.3-codex already uses CLI tools when README is clear. Only 3-4 heredocs across 48 sessions. The real waste in user sessions occurs in LARGE repos where tools are buried — next iteration needs harder discovery conditions. Also: - analyze-reinvention-results.ts: heredoc/CLI/jq usage analyzer - task-filter now uses regex (enables | alternation) - build-recurring-pattern-benchmark.ts: --include-reinvention flag

Repos now have 75-94 files each (up from ~5-8 in v1): - README focuses on architecture/API, not CLI - CLI docs buried in docs/cli-reference.md and docs/ops-guide.md - 15-18 distractor files per repo (src/, tests/, config/, scripts/, infra/) v2 results (gpt-5.3-codex, r=2): - OFF: 5 heredocs, 510 tokens wasted, 49 CLI uses - ON: 7 heredocs, 1114 tokens wasted, 40 CLI uses - Model still discovers CLI tools by browsing (~2 CLI uses per session) - Tool-call hints not helping — model already good at discovery Key insight: the model is too good at discovery in small repos. Real production waste (2.3M tokens) occurs in repos with 195+ scripts where the agent has no time to browse before jumping to implementation.

dpetrou-continua added 2 commits February 22, 2026 11:50

dpetrou-continua merged commit a773bd4 into main Feb 22, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reinvention-waste benchmark: 3 repos, 12 tasks, 96 sessions#70

reinvention-waste benchmark: 3 repos, 12 tasks, 96 sessions#70
dpetrou-continua merged 2 commits into
mainfrom
dpetrou/reinvention-benchmark

dpetrou-continua commented Feb 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dpetrou-continua commented Feb 22, 2026

Summary

Background

What's here

Results (96 sessions total)

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant