Skip to content

test(engine): token-shape regression harness for cache/cost invariants#411

Merged
mgoldsborough merged 1 commit into
mainfrom
test/token-shape-evals
Jun 10, 2026
Merged

test(engine): token-shape regression harness for cache/cost invariants#411
mgoldsborough merged 1 commit into
mainfrom
test/token-shape-evals

Conversation

@mgoldsborough

Copy link
Copy Markdown
Contributor

Why

Token-usage regressions keep recurring (the #403/#405/#406/#408 cache work), and each fix has landed without a guardrail behind it. The reason they recur: the request shape the engine sends to the provider isn't locked anywhere. Yet the expensive bugs — cache-write thrash, a per-turn hint appended to the system prompt busting its 1h breakpoint and the whole prefix, an anchor sliding off the prior tail — are all properties of that request, which is fully deterministic given a fixed conversation. So they're catchable with a scripted model and zero API calls.

What

One harness, two tiers.

test/helpers/recording-model.ts — a LanguageModelV3 decorator that records the exact post-cache-policy prompt + tools at the provider boundary (what actually goes on the wire).

test/helpers/token-shape.tsderiveShape() (compact per-step fingerprint) + checkInvariants() (named violations). Provider-aware (anthropic vs passthrough).

test/unit/token-shape.test.ts (+ committed golden) — runs the real engine loop through a recording model on a scripted tool-loop and asserts:

The golden is the artifact: promptTokens climbs every step (80 → 153 → 226…) while deltaTokensAfterAnchor stays pinned at 73. Prompt grows, per-turn cache-write flat — the optimization, locked. A regression moves those numbers into a reviewable diff. Regenerate intentional changes with TOKEN_SHAPE_UPDATE=1.

test/eval/token-shape.eval.test.ts (Tier 3) — the same scenario against real Anthropic/OpenAI/Google, reusing the same harness to verify invariants under real model behavior, plus realized cacheRead/cacheWrite health checks and a greppable metrics line for weekly drift tracking. Skips gracefully per provider when keys are absent.

Validation

Notes

  • Unit tier is the per-PR fence; the eval is run-on-a-cron (real tokens, provider nondeterminism).
  • Building it surfaced one real thing: a naive uniform scenario trips the loop supervisor (identical repeated tool calls → tool released), so the scenario varies the query per step like a real agent.

Lock the request shape the engine sends to the provider so cache-cost
regressions surface as a reviewable diff instead of a production discovery.
The expensive token-usage bugs (cache-write thrash, a system append busting
the cached prefix, an anchor sliding off the prior tail) are all properties
of the deterministic request, so they're catchable with a scripted model and
no API calls.

Two tiers sharing one harness (recordingModel + deriveShape/checkInvariants):

- Unit (every PR, no API): run the real engine loop through a recording
  model on a scripted tool-loop; assert system stability, anchor chaining,
  TTL tiering, append-only prefix, and a flat post-anchor write delta; diff a
  committed golden. The golden shows promptTokens climbing while the per-turn
  write delta stays flat — the cache win, locked.
- Eval (weekly/real APIs): same scenario against Anthropic/OpenAI/Google,
  verifying the invariants under real model behavior and reporting realized
  cache-hit rate for drift tracking. Skips when keys are absent.
@mgoldsborough mgoldsborough added the qa-reviewed QA review completed with no critical issues label Jun 10, 2026
@mgoldsborough mgoldsborough merged commit 7b30075 into main Jun 10, 2026
5 checks passed
@mgoldsborough mgoldsborough deleted the test/token-shape-evals branch June 10, 2026 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qa-reviewed QA review completed with no critical issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant