test(engine): token-shape regression harness for cache/cost invariants by mgoldsborough · Pull Request #411 · NimbleBrainInc/nimblebrain

mgoldsborough · 2026-06-10T18:21:31Z

Why

Token-usage regressions keep recurring (the #403/#405/#406/#408 cache work), and each fix has landed without a guardrail behind it. The reason they recur: the request shape the engine sends to the provider isn't locked anywhere. Yet the expensive bugs — cache-write thrash, a per-turn hint appended to the system prompt busting its 1h breakpoint and the whole prefix, an anchor sliding off the prior tail — are all properties of that request, which is fully deterministic given a fixed conversation. So they're catchable with a scripted model and zero API calls.

What

One harness, two tiers.

test/helpers/recording-model.ts — a LanguageModelV3 decorator that records the exact post-cache-policy prompt + tools at the provider boundary (what actually goes on the wire).

test/helpers/token-shape.ts — deriveShape() (compact per-step fingerprint) + checkInvariants() (named violations). Provider-aware (anthropic vs passthrough).

test/unit/token-shape.test.ts (+ committed golden) — runs the real engine loop through a recording model on a scripted tool-loop and asserts:

system-stable — system block byte-identical every step (guards the system-append regression, fix(engine): deliver the final-step hint as a tail message, not a system append #408)
anchor-chaining — anchor of call N+1 == prior call's tail (test(engine): cover length-continuation anchor in cache-policy (follow-up to #401) #405)
anti-thrash-bounded — post-anchor write delta stays flat across the run
ttl tiering (1h stable prefix / 5m rolling), prefix-monotonic, passthrough-no-cache

The golden is the artifact: promptTokens climbs every step (80 → 153 → 226…) while deltaTokensAfterAnchor stays pinned at 73. Prompt grows, per-turn cache-write flat — the optimization, locked. A regression moves those numbers into a reviewable diff. Regenerate intentional changes with TOKEN_SHAPE_UPDATE=1.

test/eval/token-shape.eval.test.ts (Tier 3) — the same scenario against real Anthropic/OpenAI/Google, reusing the same harness to verify invariants under real model behavior, plus realized cacheRead/cacheWrite health checks and a greppable metrics line for weekly drift tracking. Skips gracefully per provider when keys are absent.

Validation

bun run test:unit — 3419 pass (incl. 5 new)
format / lint / tsc clean
Negative control: removing the rolling anchor (simulating the pre-test(engine): cover length-continuation anchor in cache-policy (follow-up to #401) #405 bug) turns the unit suite red on anchor-chaining (10×), the golden, and anti-thrash; reverting returns it to green. The guard isn't vacuous.

Notes

Unit tier is the per-PR fence; the eval is run-on-a-cron (real tokens, provider nondeterminism).
Building it surfaced one real thing: a naive uniform scenario trips the loop supervisor (identical repeated tool calls → tool released), so the scenario varies the query per step like a real agent.

Lock the request shape the engine sends to the provider so cache-cost regressions surface as a reviewable diff instead of a production discovery. The expensive token-usage bugs (cache-write thrash, a system append busting the cached prefix, an anchor sliding off the prior tail) are all properties of the deterministic request, so they're catchable with a scripted model and no API calls. Two tiers sharing one harness (recordingModel + deriveShape/checkInvariants): - Unit (every PR, no API): run the real engine loop through a recording model on a scripted tool-loop; assert system stability, anchor chaining, TTL tiering, append-only prefix, and a flat post-anchor write delta; diff a committed golden. The golden shows promptTokens climbing while the per-turn write delta stays flat — the cache win, locked. - Eval (weekly/real APIs): same scenario against Anthropic/OpenAI/Google, verifying the invariants under real model behavior and reporting realized cache-hit rate for drift tracking. Skips when keys are absent.

mgoldsborough added the qa-reviewed QA review completed with no critical issues label Jun 10, 2026

mgoldsborough merged commit 7b30075 into main Jun 10, 2026
5 checks passed

mgoldsborough deleted the test/token-shape-evals branch June 10, 2026 20:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(engine): token-shape regression harness for cache/cost invariants#411

test(engine): token-shape regression harness for cache/cost invariants#411
mgoldsborough merged 1 commit into
mainfrom
test/token-shape-evals

mgoldsborough commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mgoldsborough commented Jun 10, 2026

Why

What

Validation

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant