test(engine): token-shape regression harness for cache/cost invariants#411
Merged
Conversation
Lock the request shape the engine sends to the provider so cache-cost regressions surface as a reviewable diff instead of a production discovery. The expensive token-usage bugs (cache-write thrash, a system append busting the cached prefix, an anchor sliding off the prior tail) are all properties of the deterministic request, so they're catchable with a scripted model and no API calls. Two tiers sharing one harness (recordingModel + deriveShape/checkInvariants): - Unit (every PR, no API): run the real engine loop through a recording model on a scripted tool-loop; assert system stability, anchor chaining, TTL tiering, append-only prefix, and a flat post-anchor write delta; diff a committed golden. The golden shows promptTokens climbing while the per-turn write delta stays flat — the cache win, locked. - Eval (weekly/real APIs): same scenario against Anthropic/OpenAI/Google, verifying the invariants under real model behavior and reporting realized cache-hit rate for drift tracking. Skips when keys are absent.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Token-usage regressions keep recurring (the #403/#405/#406/#408 cache work), and each fix has landed without a guardrail behind it. The reason they recur: the request shape the engine sends to the provider isn't locked anywhere. Yet the expensive bugs — cache-write thrash, a per-turn hint appended to the system prompt busting its 1h breakpoint and the whole prefix, an anchor sliding off the prior tail — are all properties of that request, which is fully deterministic given a fixed conversation. So they're catchable with a scripted model and zero API calls.
What
One harness, two tiers.
test/helpers/recording-model.ts— aLanguageModelV3decorator that records the exact post-cache-policyprompt+toolsat the provider boundary (what actually goes on the wire).test/helpers/token-shape.ts—deriveShape()(compact per-step fingerprint) +checkInvariants()(named violations). Provider-aware (anthropicvspassthrough).test/unit/token-shape.test.ts(+ committed golden) — runs the real engine loop through a recording model on a scripted tool-loop and asserts:system-stable— system block byte-identical every step (guards the system-append regression, fix(engine): deliver the final-step hint as a tail message, not a system append #408)anchor-chaining— anchor of call N+1 == prior call's tail (test(engine): cover length-continuation anchor in cache-policy (follow-up to #401) #405)anti-thrash-bounded— post-anchor write delta stays flat across the runttltiering (1h stable prefix / 5m rolling),prefix-monotonic,passthrough-no-cacheThe golden is the artifact:
promptTokensclimbs every step (80 → 153 → 226…) whiledeltaTokensAfterAnchorstays pinned at 73. Prompt grows, per-turn cache-write flat — the optimization, locked. A regression moves those numbers into a reviewable diff. Regenerate intentional changes withTOKEN_SHAPE_UPDATE=1.test/eval/token-shape.eval.test.ts(Tier 3) — the same scenario against real Anthropic/OpenAI/Google, reusing the same harness to verify invariants under real model behavior, plus realizedcacheRead/cacheWritehealth checks and a greppable metrics line for weekly drift tracking. Skips gracefully per provider when keys are absent.Validation
bun run test:unit— 3419 pass (incl. 5 new)tsccleananchor-chaining(10×), the golden, and anti-thrash; reverting returns it to green. The guard isn't vacuous.Notes