fix(usage): TTL-tier cache writes and price them correctly#406
Merged
Conversation
mgoldsborough
added a commit
that referenced
this pull request
Jun 10, 2026
…HANGELOG (QA #406) Address QA on #406: - Add the missing regression guard for the load-bearing seam: an engine test with a mock whose finish usage carries raw.cache_creation, asserting the emitted TokenUsage.cacheWrite1hTokens === ephemeral_1h_input_tokens. Without it, an SDK-shape drift would silently drop the split → mostly-5m writes over-report at 2x, with green tests. - Note in cost.ts that the "absent split → 2x" default is Anthropic-specific (the only provider that caches here / reports cache-write tokens); a future provider reporting cache writes at a different multiplier should gate on it. - Fix the CHANGELOG: the earlier edit replaced the #401 bullet's header and merged the bodies into one run-on bullet. Split into three (TTL tiering, TTL-aware cost incl. the historical re-pricing operator note, and the restored #401 thrashing entry). Kept FIVE_MIN_CACHE_WRITE_MULTIPLIER — it's the live `??` fallback rate (mirrors the cacheRead ?? c.input fallback two lines up), not dead.
mgoldsborough
added a commit
that referenced
this pull request
Jun 10, 2026
The tri-state accumulation assumes all deltas are same-deploy-era (all carry the cacheWrite1h split or none do). True within a run, and addUsage is never used to sum raw usage across the deploy boundary — the usage aggregator sums per-record costs (each priced correctly). A full fix for mixed-era aggregation needs retroactive accounting for an unreachable path; document the assumption instead. Keeping FIVE_MIN_CACHE_WRITE_MULTIPLIER (live ?? fallback, mirrors cacheRead); a synced cacheWrite1h catalog field (vs the hardcoded 2x) is deferred.
The engine cached everything at a 1-hour TTL (billed 2x base input), but the cost model priced writes at the catalog's 1.25x 5-minute rate — under-reporting real cache-write spend by ~1.6x. Verified against the live API: writes report under `cache_creation.ephemeral_1h_input_tokens`. After the rolling step-anchor, the bulk of writes are within-run (the agent loops seconds apart) where the 1h premium buys nothing. So: - Tier TTL by breakpoint stability (model/cache-policy.ts): 1-hour on the stable system+tools block (one write per run, worth keeping warm), 5-minute on the rolling step-anchor + tail (re-read within seconds). - Make costing TTL-aware: the engine captures the per-call 1h write portion from `usage.raw.cache_creation` onto `TokenUsage.cacheWrite1hTokens`; cost.ts prices the 1h portion at 2x base and the 5m remainder at the catalog `cacheWrite` rate. `cacheWrite1hTokens` is tri-state — absent means "no split reported" (legacy events, non-Anthropic), which cost treats as all-1h. Tests: TTL tiering (system/tools=1h, anchor/tail=5m), split pricing, and the load-bearing engine extraction of ephemeral_1h_input_tokens from raw usage. The "absent → 2x" default is Anthropic-specific (only provider that reports cache writes); addUsage's tri-state accumulation assumes same-era deltas (true within a run; cross-deploy aggregation is handled per-record at the cost boundary). A synced cacheWrite1h catalog field (vs hardcoded 2x) is deferred.
be217f0 to
0d9beb5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The bug
The engine cached everything at a 1-hour TTL (billed at 2× base input), but the cost model priced cache writes at the catalog's 1.25× 5-minute rate — so every dashboard and the cost script under-reported real cache-write spend by ~1.6×. Verified against the live API:
Every write is billed at the 1h tier. The audit flagged this; this confirms and fixes it.
Why TTL should be tiered, not all-1h
After the rolling step-anchor (#401), the bulk of cache writes happen within a run — the agent loops seconds apart, and each turn re-reads the prior breakpoint almost immediately, far inside even the 5-minute window. The 1-hour premium (60% over 5-minute) buys nothing there. It only earns its keep on the stable prefix re-read after a between-runs pause (a user stepping away for minutes).
So the engine now tiers TTL by breakpoint stability:
This pays the premium only where it earns its keep — a meaningful write-cost cut on long conversations, with no loss of the #401 within-run caching (5-minute is plenty for second-apart reads).
TTL-aware costing (correct regardless of the TTL strategy)
Rather than hardcode a single rate, costing now uses the split Anthropic already returns:
usage.raw.cache_creation.ephemeral_1h_input_tokensonto a newTokenUsage.cacheWrite1hTokens.cost.tsprices the 1h portion at 2× base and the 5-minute remainder at the catalogcacheWriterate (which is the 5-minute rate — finally used correctly).cacheWrite1hTokensis tri-state: absent means "no split reported" (legacy events from before tiering, or non-Anthropic providers), which cost treats as all-1h so historical figures stay accurate.addUsage/emptyUsagepreserve the absent state rather than collapsing it to an explicit 0.This makes the cost model correct for any TTL strategy — today's tiered split, a future all-5m, whatever — with no re-fix.
Tests
verify:staticgreen; full unit suite (3,392) green.Follow-up (separate, in the deployments repo)
The
tenant-summary.pycost script needs the same TTL-aware treatment (it has its own pricing table) so the/tenant-summarynumbers match. Handling that alongside this.