fix(usage): TTL-tier cache writes and price them correctly by mgoldsborough · Pull Request #406 · NimbleBrainInc/nimblebrain

mgoldsborough · 2026-06-09T21:19:58Z

The bug

The engine cached everything at a 1-hour TTL (billed at 2× base input), but the cost model priced cache writes at the catalog's 1.25× 5-minute rate — so every dashboard and the cost script under-reported real cache-write spend by ~1.6×. Verified against the live API:

"cache_creation": { "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 3376 }

Every write is billed at the 1h tier. The audit flagged this; this confirms and fixes it.

Why TTL should be tiered, not all-1h

After the rolling step-anchor (#401), the bulk of cache writes happen within a run — the agent loops seconds apart, and each turn re-reads the prior breakpoint almost immediately, far inside even the 5-minute window. The 1-hour premium (60% over 5-minute) buys nothing there. It only earns its keep on the stable prefix re-read after a between-runs pause (a user stepping away for minutes).

So the engine now tiers TTL by breakpoint stability:

Breakpoint	TTL	Why
system + tools (stable)	1h	One write per run; expensive to rebuild; worth keeping warm across a pause
step-anchor + tail (rolling history)	5m	Re-read within seconds inside a run; lapses cheaply after a pause while system+tools stay warm

This pays the premium only where it earns its keep — a meaningful write-cost cut on long conversations, with no loss of the #401 within-run caching (5-minute is plenty for second-apart reads).

TTL-aware costing (correct regardless of the TTL strategy)

Rather than hardcode a single rate, costing now uses the split Anthropic already returns:

The engine captures the per-call 1-hour write portion from usage.raw.cache_creation.ephemeral_1h_input_tokens onto a new TokenUsage.cacheWrite1hTokens.
cost.ts prices the 1h portion at 2× base and the 5-minute remainder at the catalog cacheWrite rate (which is the 5-minute rate — finally used correctly).
cacheWrite1hTokens is tri-state: absent means "no split reported" (legacy events from before tiering, or non-Anthropic providers), which cost treats as all-1h so historical figures stay accurate. addUsage/emptyUsage preserve the absent state rather than collapsing it to an explicit 0.

This makes the cost model correct for any TTL strategy — today's tiered split, a future all-5m, whatever — with no re-fix.

Tests

New: TTL tiering (system/tools = 1h, anchor/tail = 5m); split pricing (1h at 2×, 5m remainder at catalog rate, result strictly between all-1h and all-5m).
Updated: existing cost tests to the 1h rate; the engine's "last user message" breakpoint test to expect 5m (it's the rolling tail now).
verify:static green; full unit suite (3,392) green.

Follow-up (separate, in the deployments repo)

The tenant-summary.py cost script needs the same TTL-aware treatment (it has its own pricing table) so the /tenant-summary numbers match. Handling that alongside this.

…HANGELOG (QA #406) Address QA on #406: - Add the missing regression guard for the load-bearing seam: an engine test with a mock whose finish usage carries raw.cache_creation, asserting the emitted TokenUsage.cacheWrite1hTokens === ephemeral_1h_input_tokens. Without it, an SDK-shape drift would silently drop the split → mostly-5m writes over-report at 2x, with green tests. - Note in cost.ts that the "absent split → 2x" default is Anthropic-specific (the only provider that caches here / reports cache-write tokens); a future provider reporting cache writes at a different multiplier should gate on it. - Fix the CHANGELOG: the earlier edit replaced the #401 bullet's header and merged the bodies into one run-on bullet. Split into three (TTL tiering, TTL-aware cost incl. the historical re-pricing operator note, and the restored #401 thrashing entry). Kept FIVE_MIN_CACHE_WRITE_MULTIPLIER — it's the live `??` fallback rate (mirrors the cacheRead ?? c.input fallback two lines up), not dead.

The tri-state accumulation assumes all deltas are same-deploy-era (all carry the cacheWrite1h split or none do). True within a run, and addUsage is never used to sum raw usage across the deploy boundary — the usage aggregator sums per-record costs (each priced correctly). A full fix for mixed-era aggregation needs retroactive accounting for an unreachable path; document the assumption instead. Keeping FIVE_MIN_CACHE_WRITE_MULTIPLIER (live ?? fallback, mirrors cacheRead); a synced cacheWrite1h catalog field (vs the hardcoded 2x) is deferred.

The engine cached everything at a 1-hour TTL (billed 2x base input), but the cost model priced writes at the catalog's 1.25x 5-minute rate — under-reporting real cache-write spend by ~1.6x. Verified against the live API: writes report under `cache_creation.ephemeral_1h_input_tokens`. After the rolling step-anchor, the bulk of writes are within-run (the agent loops seconds apart) where the 1h premium buys nothing. So: - Tier TTL by breakpoint stability (model/cache-policy.ts): 1-hour on the stable system+tools block (one write per run, worth keeping warm), 5-minute on the rolling step-anchor + tail (re-read within seconds). - Make costing TTL-aware: the engine captures the per-call 1h write portion from `usage.raw.cache_creation` onto `TokenUsage.cacheWrite1hTokens`; cost.ts prices the 1h portion at 2x base and the 5m remainder at the catalog `cacheWrite` rate. `cacheWrite1hTokens` is tri-state — absent means "no split reported" (legacy events, non-Anthropic), which cost treats as all-1h. Tests: TTL tiering (system/tools=1h, anchor/tail=5m), split pricing, and the load-bearing engine extraction of ephemeral_1h_input_tokens from raw usage. The "absent → 2x" default is Anthropic-specific (only provider that reports cache writes); addUsage's tri-state accumulation assumes same-era deltas (true within a run; cross-deploy aggregation is handled per-record at the cost boundary). A synced cacheWrite1h catalog field (vs hardcoded 2x) is deferred.

mgoldsborough mentioned this pull request Jun 10, 2026

feat(usage): per-conversation cache-hit rate in usage reports #409

Merged

mgoldsborough force-pushed the fix/cache-cost-accuracy branch from be217f0 to 0d9beb5 Compare June 10, 2026 17:36

mgoldsborough merged commit c033e97 into main Jun 10, 2026
5 checks passed

mgoldsborough deleted the fix/cache-cost-accuracy branch June 10, 2026 17:40

mgoldsborough mentioned this pull request Jun 10, 2026

test(engine): token-shape regression harness for cache/cost invariants #411

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(usage): TTL-tier cache writes and price them correctly#406

fix(usage): TTL-tier cache writes and price them correctly#406
mgoldsborough merged 1 commit into
mainfrom
fix/cache-cost-accuracy

mgoldsborough commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mgoldsborough commented Jun 9, 2026

The bug

Why TTL should be tiered, not all-1h

TTL-aware costing (correct regardless of the TTL strategy)

Tests

Follow-up (separate, in the deployments repo)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant