Summary
Follow-up to #8915 / #8916. Once the trailing-breakpoint fix lands, the Bedrock provider still uses only 2 of the 4 available cache breakpoints per request (system prompt + trailing message). The remaining 2 slots can be used to extend cache coverage across turns that add many blocks at once, and to make cache hits more resilient to edge cases in the 20-block lookback window.
This issue proposes two concrete strategies and invites discussion on which (if either) is worth implementing.
Background
Anthropic's prompt caching permits up to 4 cache_control breakpoints per request. Each breakpoint writes one cache entry keyed by the hash of the request prefix ending at that block, and reads walk backward up to 20 blocks from the breakpoint to find prior writes. See the Anthropic docs and Bedrock docs.
After #8916, Goose's layout is:
- Breakpoint 1: end of system prompt.
- Breakpoint 2: last visible message (trailing).
- Breakpoints 3, 4: unused.
This is correct and captures the majority of achievable savings. The 20-block lookback from breakpoint 2 finds the previous request's trailing write, so fresh processing is bounded to what was added between requests, provided that amount is under 20 blocks.
When a single turn adds 20+ blocks (e.g., a parallel-tool-use assistant message plus its matching tool_result user message, where N tool calls produce roughly 2N blocks), the trailing lookback misses and the whole turn is reprocessed fresh. The cache self-heals on the next turn, so the worst case is bounded, but the savings are lost for that request.
Proposal A: Stateless backward-stride anchors
On each request, walk backward from the tail placing a cache point at block positions tail, tail - 20, tail - 40, tail - 60. Use 3 of these for messages and keep the system-prompt breakpoint.
Why this works: the lookback from each breakpoint covers 20 blocks. Consecutive requests that grow the conversation by Δ blocks produce breakpoint positions shifted by Δ. As long as Δ ≤ 20, the deepest anchor's lookback walks into the region where the previous request wrote, and finds a hit. With 3 stacked 20-block windows, the scheme tolerates a per-request growth of up to ~60 blocks before missing entirely.
Pros:
- Stateless. Pure function of the current messages.
- Extends resilience against heavy parallel-tool-use turns.
- No extra cost versus fewer breakpoints. Anthropic bills by what content is actually read or written, not by breakpoint count (see docs, "Understanding cache breakpoint costs": "Adding more cache_control breakpoints doesn't increase your costs").
Cons:
- Still lookback-based, so it inherits any edge cases in how the 20-block window is implemented.
Proposal B: Harness-tracked boundary between carried-over and new content
Remember where the trailing breakpoint was placed on the previous request. On the new request, place breakpoints at:
- The last carried-over block (exactly the previous trailing position).
- The new trailing block (the end of newly appended content).
Plus the system-prompt breakpoint. Total: 3 of 4 slots used.
Why this works: breakpoint 1 sits on a prefix whose hash is byte-identical to the previous request's trailing write. The API computes the same hash and finds a direct match, not a lookback match. Breakpoint 2 writes a fresh entry that becomes breakpoint 1 on the next request.
Pros:
- Direct hash match on every turn; does not depend on lookback behavior at all.
- Works for arbitrarily large single-turn additions (no 20-block ceiling on Δ).
- Uses fewer breakpoints than Proposal A, leaving slots free for other caching strategies (e.g., tool-definition cache points).
Cons:
- Requires per-session state in the provider: the block index (or equivalent marker) of the last trailing breakpoint.
- Needs careful handling of conversation compaction / truncation, since those events invalidate the stored boundary.
Recommendation
Proposal B is appealing as a closed-form solution: it gives a direct-match cache hit regardless of how many blocks a single turn adds, and it doesn't depend on the lookback mechanism at all.
Proposal A gives up that guarantee but covers essentially every realistic agent workload I'd expect in the near term. Three stacked 20-block windows accommodate per-request deltas up to ~60 blocks, which corresponds to turns with roughly 30 parallel tool calls (each producing one tool_use block plus one tool_result block). For that ceiling to bind in practice, agent harnesses would need to shift toward much heavier per-turn parallelism than is common today. I'm not predicting that shift; I'm naming the category of change that would make A's guarantees feel insufficient.
Given that, plus A's statelessness (no coordination concern with session management, retries, compaction invalidation, etc.), I lean toward A as the first follow-up. B stays worth considering if the per-session state turns out to be cheap given the rest of BedrockProvider, or if measured workloads start bumping into the 60-block ceiling.
Either way, this is a follow-up to #8916 and should only be considered after that lands.
Summary
Follow-up to #8915 / #8916. Once the trailing-breakpoint fix lands, the Bedrock provider still uses only 2 of the 4 available cache breakpoints per request (system prompt + trailing message). The remaining 2 slots can be used to extend cache coverage across turns that add many blocks at once, and to make cache hits more resilient to edge cases in the 20-block lookback window.
This issue proposes two concrete strategies and invites discussion on which (if either) is worth implementing.
Background
Anthropic's prompt caching permits up to 4
cache_controlbreakpoints per request. Each breakpoint writes one cache entry keyed by the hash of the request prefix ending at that block, and reads walk backward up to 20 blocks from the breakpoint to find prior writes. See the Anthropic docs and Bedrock docs.After #8916, Goose's layout is:
This is correct and captures the majority of achievable savings. The 20-block lookback from breakpoint 2 finds the previous request's trailing write, so fresh processing is bounded to what was added between requests, provided that amount is under 20 blocks.
When a single turn adds 20+ blocks (e.g., a parallel-tool-use assistant message plus its matching tool_result user message, where N tool calls produce roughly 2N blocks), the trailing lookback misses and the whole turn is reprocessed fresh. The cache self-heals on the next turn, so the worst case is bounded, but the savings are lost for that request.
Proposal A: Stateless backward-stride anchors
On each request, walk backward from the tail placing a cache point at block positions
tail,tail - 20,tail - 40,tail - 60. Use 3 of these for messages and keep the system-prompt breakpoint.Why this works: the lookback from each breakpoint covers 20 blocks. Consecutive requests that grow the conversation by Δ blocks produce breakpoint positions shifted by Δ. As long as Δ ≤ 20, the deepest anchor's lookback walks into the region where the previous request wrote, and finds a hit. With 3 stacked 20-block windows, the scheme tolerates a per-request growth of up to ~60 blocks before missing entirely.
Pros:
Cons:
Proposal B: Harness-tracked boundary between carried-over and new content
Remember where the trailing breakpoint was placed on the previous request. On the new request, place breakpoints at:
Plus the system-prompt breakpoint. Total: 3 of 4 slots used.
Why this works: breakpoint 1 sits on a prefix whose hash is byte-identical to the previous request's trailing write. The API computes the same hash and finds a direct match, not a lookback match. Breakpoint 2 writes a fresh entry that becomes breakpoint 1 on the next request.
Pros:
Cons:
Recommendation
Proposal B is appealing as a closed-form solution: it gives a direct-match cache hit regardless of how many blocks a single turn adds, and it doesn't depend on the lookback mechanism at all.
Proposal A gives up that guarantee but covers essentially every realistic agent workload I'd expect in the near term. Three stacked 20-block windows accommodate per-request deltas up to ~60 blocks, which corresponds to turns with roughly 30 parallel tool calls (each producing one tool_use block plus one tool_result block). For that ceiling to bind in practice, agent harnesses would need to shift toward much heavier per-turn parallelism than is common today. I'm not predicting that shift; I'm naming the category of change that would make A's guarantees feel insufficient.
Given that, plus A's statelessness (no coordination concern with session management, retries, compaction invalidation, etc.), I lean toward A as the first follow-up. B stays worth considering if the per-session state turns out to be cheap given the rest of
BedrockProvider, or if measured workloads start bumping into the 60-block ceiling.Either way, this is a follow-up to #8916 and should only be considered after that lands.