You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: scripts/paper/blog-post-draft.md
+13-7Lines changed: 13 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -79,19 +79,25 @@ where *m* is a model cost multiplier (Haiku = 0.25×, Sonnet = 1.0×, Opus = 5.0
79
79
80
80
## Initial results
81
81
82
-
After deploying the Auditor and Optimizer across twelve production workflows in the `gh-aw` project, we downloaded token usage artifacts from runs before and after each optimization to measure actual impact in effective tokens (ET). Seven of the nine implemented optimizations have enough post-fix run history to compare:
82
+
After deploying the Auditor and Optimizer across twelve production workflows in the `gh-aw` project, we downloaded token usage artifacts from runs before and after each optimization to measure actual impact in effective tokens (ET). Nine of the twelve workflows received optimizer-recommended changes. We include results only for workflows with at least four runs in both the pre- and post-optimization periods; three optimized workflows (Daily Syntax Error Quality, Glossary Maintainer, and Test Quality Sentinel) were excluded because token tracking via the API proxy started in early April and the workflows were optimized within days of the first instrumented run, leaving fewer than four baseline data points.
83
83
84
-

84
+
The four workflows with sufficient data show a range of outcomes:
85
85
86
-
The improvements range from modest (Daily Community Attribution, −28%) to dramatic (Auto-Triage Issues, −81%). The variation reflects the nature of the fix applied: simple toolset pruning saves less than eliminating whole categories of work. It is important to note that the Daily Compiler Quality average is inflated by a post-fix outlier run (7.68M ET). The remaining 5 runs average 2.41M ET (−30%) and the median for all post-fix runs was 2.6M ET (-26% compared to the pre-fix average).
86
+

87
87
88
-
From these results, we highlight three patterns that account for most of the gains.
88
+
Auto-Triage Issues shows a clear, sustained reduction of 44% across 62 post-fix runs. Daily Compiler Quality shows a modest 5% improvement that is difficult to separate from workload variation. Daily Community Attribution shows no meaningful change. And Contribution Check shows a slight increase (+9%), which we examine below.
89
89
90
-
**A single misconfigured rule can cause runaway loops.** The most extreme case was Daily Syntax Error Quality at 10.4 M ET per run—roughly 6× the project average. The root cause was a one-line misconfiguration: the workflow copied test files to `/tmp/` then called `gh aw compile *`, but the sandbox's bash allowlist only permitted relative-path glob patterns. Every compile attempt was blocked. Unable to use the tool it needed, the agent fell into a 64-turn fallback loop—manually reading source code to reconstruct what the compiler would have told it. One fix to the allowed bash patterns dropped consumption to 6.27 M ET (−40%). It's still high because the workflow itself tests many syntax error cases, but the runaway loop is gone.
90
+
Run frequency matters as much as per-run savings. Auto-Triage Issues fires on every new issue—averaging 6.5 runs per day with peaks of 15—while the daily quality checks run once per day. Contribution Check fires on every PR at about 4 runs per day. A 44% reduction at 6.5 runs/day compounds quickly: over the observation period, Auto-Triage's optimization saved roughly 3.2 M ET in aggregate, dwarfing the other workflows combined. When prioritizing which workflows to optimize, run frequency is at least as important as per-run consumption.
91
91
92
-
**Unused tools are expensive to carry.** The Glossary Maintainer workflow was spending 4.27 M ET per run—and a single tool dominated: `search_repositories`, called **342 times in one run**, accounting for 58% of all tool calls. The tool came in as part of the default toolset but was completely unnecessary for a workflow that only scans local file changes. Removing it dropped average consumption to 2.32 M ET (−46%). The Daily Community Attribution workflow showed the inverse and was configured with eight GitHub MCP tools. It made **zero calls to any of them** across an entire run while still spending 6.75 M ET.
92
+
This range is itself an important finding. Not every optimization the agent recommends translates into measurable ET reduction, especially over short observation windows on a live repository where workload varies day to day. The workflow with the strongest signal is the one where the optimization eliminated a clearly pathological behavior rather than shaving a few percent off normal operation.
93
93
94
-
**Many agent turns are deterministic data-gathering.** Contribution Check and Test Quality Sentinel show the largest proportional gains (−55% and −66%) because their inefficiency was structural: 50–96% of agent turns were spent on reads that required no inference, such as fetching PR diffs, listing changed files, and parsing a repository's `CONTRIBUTING.md`. Moving those reads into pre-agentic `gh` CLI steps before the agent starts eliminated the majority of the LLM work. Test Quality Sentinel went from 1.15 M ET to 400 K ET; Contribution Check from 3.2 M to 1.43 M.
94
+
From these results and the excluded workflows, we highlight three patterns.
95
+
96
+
**Many agent turns are deterministic data-gathering.** Auto-Triage Issues shows the strongest sustained improvement (−44% across 62 post-fix runs) because the optimization eliminated structural inefficiency: many agent turns were spent on reads that required no inference, such as fetching issue metadata and scanning labels. Moving those reads into pre-agentic `gh` CLI steps before the agent starts removed them from the LLM reasoning loop entirely. The same pattern was applied to Contribution Check, where 50–96% of turns were data-gathering. However, the ET data for Contribution Check shows a slight *increase* (+9%). The cause is workload shift, not optimization failure: in the pre-optimization period 41% of runs processed small PRs (ET < 100 K) and 39% processed large PRs (ET > 300 K), while the post-optimization period—which coincided with a burst of development activity—had only 17% small PRs and 64% large PRs. Output tokens, which carry a 4× weight in the ET formula, rose 29% as the agent reviewed bigger diffs. The optimization likely improved per-turn efficiency, but the shift toward heavier workloads masks that gain in the aggregate numbers.
97
+
98
+
**Unused tools are expensive to carry.** Among the excluded workflows, the Glossary Maintainer is an instructive case. A single tool—`search_repositories`—was called **342 times in one run**, accounting for 58% of all tool calls, despite being completely unnecessary for a workflow that only scans local file changes. Removing it from the toolset was the optimizer's recommendation. The Daily Community Attribution workflow illustrates the limits of this approach: it was configured with eight GitHub MCP tools and made **zero calls to any of them** across an entire run, yet removing them did not measurably reduce ET. The tool manifests were a small fraction of this workflow's overall context.
99
+
100
+
**A single misconfigured rule can cause runaway loops.** Also among the excluded workflows, Daily Syntax Error Quality was the highest-ET workflow in the project before optimization. The root cause was a one-line misconfiguration: the workflow copied test files to `/tmp/` then called `gh aw compile *`, but the sandbox's bash allowlist only permitted relative-path glob patterns. Every compile attempt was blocked. Unable to use the tool it needed, the agent fell into a 64-turn fallback loop—manually reading source code to reconstruct what the compiler would have told it. One fix to the allowed bash patterns eliminated the loop. We lack enough baseline runs to quantify the improvement precisely, but the pathology was clear and the fix was unambiguous.
0 commit comments