Skip to content

fix(usage): TTL-tier cache writes and price them correctly#406

Merged
mgoldsborough merged 1 commit into
mainfrom
fix/cache-cost-accuracy
Jun 10, 2026
Merged

fix(usage): TTL-tier cache writes and price them correctly#406
mgoldsborough merged 1 commit into
mainfrom
fix/cache-cost-accuracy

Conversation

@mgoldsborough

Copy link
Copy Markdown
Contributor

The bug

The engine cached everything at a 1-hour TTL (billed at 2× base input), but the cost model priced cache writes at the catalog's 1.25× 5-minute rate — so every dashboard and the cost script under-reported real cache-write spend by ~1.6×. Verified against the live API:

"cache_creation": { "ephemeral_5m_input_tokens": 0, "ephemeral_1h_input_tokens": 3376 }

Every write is billed at the 1h tier. The audit flagged this; this confirms and fixes it.

Why TTL should be tiered, not all-1h

After the rolling step-anchor (#401), the bulk of cache writes happen within a run — the agent loops seconds apart, and each turn re-reads the prior breakpoint almost immediately, far inside even the 5-minute window. The 1-hour premium (60% over 5-minute) buys nothing there. It only earns its keep on the stable prefix re-read after a between-runs pause (a user stepping away for minutes).

So the engine now tiers TTL by breakpoint stability:

Breakpoint TTL Why
system + tools (stable) 1h One write per run; expensive to rebuild; worth keeping warm across a pause
step-anchor + tail (rolling history) 5m Re-read within seconds inside a run; lapses cheaply after a pause while system+tools stay warm

This pays the premium only where it earns its keep — a meaningful write-cost cut on long conversations, with no loss of the #401 within-run caching (5-minute is plenty for second-apart reads).

TTL-aware costing (correct regardless of the TTL strategy)

Rather than hardcode a single rate, costing now uses the split Anthropic already returns:

  • The engine captures the per-call 1-hour write portion from usage.raw.cache_creation.ephemeral_1h_input_tokens onto a new TokenUsage.cacheWrite1hTokens.
  • cost.ts prices the 1h portion at 2× base and the 5-minute remainder at the catalog cacheWrite rate (which is the 5-minute rate — finally used correctly).
  • cacheWrite1hTokens is tri-state: absent means "no split reported" (legacy events from before tiering, or non-Anthropic providers), which cost treats as all-1h so historical figures stay accurate. addUsage/emptyUsage preserve the absent state rather than collapsing it to an explicit 0.

This makes the cost model correct for any TTL strategy — today's tiered split, a future all-5m, whatever — with no re-fix.

Tests

  • New: TTL tiering (system/tools = 1h, anchor/tail = 5m); split pricing (1h at 2×, 5m remainder at catalog rate, result strictly between all-1h and all-5m).
  • Updated: existing cost tests to the 1h rate; the engine's "last user message" breakpoint test to expect 5m (it's the rolling tail now).
  • verify:static green; full unit suite (3,392) green.

Follow-up (separate, in the deployments repo)

The tenant-summary.py cost script needs the same TTL-aware treatment (it has its own pricing table) so the /tenant-summary numbers match. Handling that alongside this.

mgoldsborough added a commit that referenced this pull request Jun 10, 2026
…HANGELOG (QA #406)

Address QA on #406:
- Add the missing regression guard for the load-bearing seam: an engine test
  with a mock whose finish usage carries raw.cache_creation, asserting the
  emitted TokenUsage.cacheWrite1hTokens === ephemeral_1h_input_tokens. Without
  it, an SDK-shape drift would silently drop the split → mostly-5m writes
  over-report at 2x, with green tests.
- Note in cost.ts that the "absent split → 2x" default is Anthropic-specific
  (the only provider that caches here / reports cache-write tokens); a future
  provider reporting cache writes at a different multiplier should gate on it.
- Fix the CHANGELOG: the earlier edit replaced the #401 bullet's header and
  merged the bodies into one run-on bullet. Split into three (TTL tiering,
  TTL-aware cost incl. the historical re-pricing operator note, and the restored
  #401 thrashing entry).

Kept FIVE_MIN_CACHE_WRITE_MULTIPLIER — it's the live `??` fallback rate (mirrors
the cacheRead ?? c.input fallback two lines up), not dead.
mgoldsborough added a commit that referenced this pull request Jun 10, 2026
The tri-state accumulation assumes all deltas are same-deploy-era (all carry the
cacheWrite1h split or none do). True within a run, and addUsage is never used to
sum raw usage across the deploy boundary — the usage aggregator sums per-record
costs (each priced correctly). A full fix for mixed-era aggregation needs
retroactive accounting for an unreachable path; document the assumption instead.

Keeping FIVE_MIN_CACHE_WRITE_MULTIPLIER (live ?? fallback, mirrors cacheRead);
a synced cacheWrite1h catalog field (vs the hardcoded 2x) is deferred.
The engine cached everything at a 1-hour TTL (billed 2x base input), but the
cost model priced writes at the catalog's 1.25x 5-minute rate — under-reporting
real cache-write spend by ~1.6x. Verified against the live API: writes report
under `cache_creation.ephemeral_1h_input_tokens`.

After the rolling step-anchor, the bulk of writes are within-run (the agent
loops seconds apart) where the 1h premium buys nothing. So:
- Tier TTL by breakpoint stability (model/cache-policy.ts): 1-hour on the stable
  system+tools block (one write per run, worth keeping warm), 5-minute on the
  rolling step-anchor + tail (re-read within seconds).
- Make costing TTL-aware: the engine captures the per-call 1h write portion from
  `usage.raw.cache_creation` onto `TokenUsage.cacheWrite1hTokens`; cost.ts prices
  the 1h portion at 2x base and the 5m remainder at the catalog `cacheWrite`
  rate. `cacheWrite1hTokens` is tri-state — absent means "no split reported"
  (legacy events, non-Anthropic), which cost treats as all-1h.

Tests: TTL tiering (system/tools=1h, anchor/tail=5m), split pricing, and the
load-bearing engine extraction of ephemeral_1h_input_tokens from raw usage.
The "absent → 2x" default is Anthropic-specific (only provider that reports
cache writes); addUsage's tri-state accumulation assumes same-era deltas (true
within a run; cross-deploy aggregation is handled per-record at the cost
boundary). A synced cacheWrite1h catalog field (vs hardcoded 2x) is deferred.
@mgoldsborough mgoldsborough force-pushed the fix/cache-cost-accuracy branch from be217f0 to 0d9beb5 Compare June 10, 2026 17:36
@mgoldsborough mgoldsborough merged commit c033e97 into main Jun 10, 2026
5 checks passed
@mgoldsborough mgoldsborough deleted the fix/cache-cost-accuracy branch June 10, 2026 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant