docs: clarify supportsPromptCaching gates marker injection only, not cache token extraction

lewis617 · lewis617 · commit e246f079ba20 · 2026-06-12T16:27:37.000+08:00
Update spec 021 to explicitly distinguish two concerns:
- supportsPromptCaching / WAVE_PROMPT_CACHE_REGEX gates cache_control
  marker injection (messages/tools) — Claude-only
- Cache token extraction from usage applies to ALL models — no gate

Updated: spec.md (FR-001, FR-007, FR-008, edge cases, key entities),
data-model.md (relationships, model detection flow),
quickstart.md (phase 3 integration notes),
contracts/cache-control-api.md (scope note),
research.md (universal caching alternatives)
diff --git a/specs/021-prompt-cache-control/contracts/cache-control-api.md b/specs/021-prompt-cache-control/contracts/cache-control-api.md
@@ -121,7 +121,7 @@ const isClaudeModel: typeof supportsPromptCaching;
 - Example: `WAVE_PROMPT_CACHE_REGEX="claude|qwen"` matches both claude and qwen models
 - Invalid regex patterns fall back to simple "claude" matching
 
-**Note**: While `supportsPromptCaching` controls cache_control marker injection (only for Claude-like models), cache token extraction from usage applies to ALL models. Non-Claude models (Gemini, DeepSeek) return cache data via `prompt_tokens_details` which is an OpenAI-standard field.
+**Scope of `supportsPromptCaching`**: This function gates ONLY cache_control marker injection (adding `cache_control: {type: "ephemeral"}` to messages and tool definitions). It does NOT gate cache token extraction from usage responses — that applies to ALL models, since `prompt_tokens_details` is an OpenAI-standard field returned by Gemini, DeepSeek, and others.
 
 ### Cache Control Application
 
diff --git a/specs/021-prompt-cache-control/data-model.md b/specs/021-prompt-cache-control/data-model.md
@@ -106,9 +106,9 @@ interface ModelCacheConfig {
   - `"claude|qwen3\\.6-plus"` - matches "claude" or exact "qwen3.6-plus"
 
 **Relationships**:
-- Determines cache control application for entire request
+- Determines cache_control marker injection for the request (messages and tools)
+- Does NOT gate cache token extraction from usage — that applies to all models
 - Affects message transformation and tool processing
-- Influences usage tracking structure
 
 **Legacy Support**:
 - `isClaudeModel()` is deprecated alias for `supportsPromptCaching()`
@@ -173,7 +173,10 @@ Request 2+: cache_creation_input_tokens = 0, cache_read_input_tokens > 0 (cache
 ### Model Detection Flow
 
 ```
-ModelName -> isClaudeModel() -> shouldApplyCache -> Message Transformation
+ModelName -> supportsPromptCaching() -> shouldInjectCacheControlMarkers -> Message Transformation
+
+Note: Cache token extraction from usage is NOT gated by supportsPromptCaching.
+All models' usage responses are checked for cache tokens (Claude top-level + prompt_tokens_details).
 ```
 
 ## Data Flow Patterns
diff --git a/specs/021-prompt-cache-control/quickstart.md b/specs/021-prompt-cache-control/quickstart.md
@@ -71,8 +71,12 @@ export interface CacheControl {
 
 **Integration Point**: After line 183 (`openaiMessages` construction)
 
+> **Important**: Cache control has two distinct concerns:
+> 1. **Marker injection** (adding `cache_control: {type: "ephemeral"}` to messages/tools) — gated by `supportsPromptCaching`, only for Claude-like models
+> 2. **Cache token extraction** (reading cache metrics from usage responses) — NOT gated, applies to all models
+
 ```typescript
-// Add before createParams construction (line ~195)
+// Marker injection: gated by supportsPromptCaching
 if (supportsPromptCaching(model || modelConfig.model)) {
   openaiMessages = transformMessagesForClaudeCache(
     openaiMessages,
@@ -86,7 +90,7 @@ if (supportsPromptCaching(model || modelConfig.model)) {
 }
 ```
 
-**Usage Extension**: Modify usage processing to extract cache metrics from all models
+**Usage Extension**: Cache token extraction applies to all models (no gate)
 
 ```typescript
 // Extend usage object with cache metrics (Claude top-level + OpenAI prompt_tokens_details)
diff --git a/specs/021-prompt-cache-control/research.md b/specs/021-prompt-cache-control/research.md
@@ -30,7 +30,8 @@
 
 **Alternatives considered**:
 - Custom cache duration: Not supported by Claude API
-- Universal caching: Rejected due to cost and limited benefit for non-Claude models
+- Universal cache_control marker injection: Rejected due to cost and limited benefit for non-Claude models (only Claude API supports `cache_control: {type: "ephemeral"}` markers)
+- Universal cache token extraction: Accepted — `prompt_tokens_details` is an OpenAI-standard field; extracting cache metrics from all models' usage is zero-cost and provides visibility
 - Image content caching: Not supported by Claude API
 
 ## Implementation Architecture Research
diff --git a/specs/021-prompt-cache-control/spec.md b/specs/021-prompt-cache-control/spec.md
@@ -93,7 +93,7 @@ As a user who switches between permission modes (e.g., default → plan → acce
 
 ### Edge Cases
 
-- **Edge Case 1**: Model name detection MUST be case-insensitive ("Claude-3-Sonnet" and "claude-3-sonnet" both trigger caching)
+- **Edge Case 1**: Model name detection MUST be case-insensitive ("Claude-3-Sonnet" and "claude-3-sonnet" both trigger cache_control marker injection). Cache token extraction from usage applies regardless of model name.
 - **Edge Case 2**: Mixed content messages MUST apply cache_control only to text content parts, preserving images unchanged
 - **Edge Case 3**: Empty conversation history MUST skip user message caching, apply system message caching only
 - **Edge Case 4**: Streaming and non-streaming requests MUST apply identical cache_control transformation logic
@@ -104,14 +104,14 @@ As a user who switches between permission modes (e.g., default → plan → acce
 
 ### Functional Requirements
 
-- **FR-001**: System MUST detect cache-supporting models using the `WAVE_PROMPT_CACHE_REGEX` environment variable (default: "claude"), which allows configurable regex patterns for model matching
+- **FR-001**: System MUST detect cache-supporting models for cache_control marker injection using the `WAVE_PROMPT_CACHE_REGEX` environment variable (default: "claude"), which allows configurable regex patterns for model matching. This gate controls ONLY the injection of `cache_control: {type: "ephemeral"}` markers into messages and tool definitions — it does NOT gate cache token extraction from usage responses, which applies to all models.
 - **FR-002**: System MUST add cache_control markers with type "ephemeral" to the first system message when using Claude models. This ensures core instructions are always cached even if reminders are added later. The system prompt MUST remain constant across plan mode transitions — plan mode instructions are injected as `<system-reminder>` user messages rather than system prompt changes to preserve the cached system prompt prefix. The `<env>` section's `Primary working directory` field MUST use the immutable `originalWorkdir` (set once at session start) rather than the dynamic `workdir` (which tracks `cd` changes), so that CWD changes do not invalidate the cached system prompt.
 - **FR-003**: System MUST create a cache marker when total message count reaches multiples of 20 (20, 40, 60, etc.)
 - **FR-004**: System MUST NOT create cache markers when total message count is below 20 or not a multiple of 20
 - **FR-005**: System MUST maintain cache markers at the most recent multiple-of-20 message position (sliding window)
 - **FR-006**: System MUST include cached messages in the context provided to the AI agent
-- **FR-007**: System MUST not add cache_control markers when using non-Claude models
-- **FR-008**: System MUST extend usage tracking to include cache-related metrics. Cache tokens are extracted from two sources with priority ordering: (1) Claude top-level fields (cache_read_input_tokens, cache_creation_input_tokens, cache_creation object) take priority, (2) OpenAI-standard prompt_tokens_details fields (cached_tokens → cache_read_input_tokens, cache_creation_input_tokens → cache_creation_input_tokens) serve as fallback for non-Claude models that return cache data via prompt_tokens_details
+- **FR-007**: System MUST NOT add cache_control markers when using non-Claude models (as determined by `WAVE_PROMPT_CACHE_REGEX`). However, cache token extraction from usage (FR-008) applies to all models regardless of this gate.
+- **FR-008**: System MUST extend usage tracking to include cache-related metrics for ALL models (not gated by `supportsPromptCaching`). Cache tokens are extracted from two sources with priority ordering: (1) Claude top-level fields (cache_read_input_tokens, cache_creation_input_tokens, cache_creation object) take priority, (2) OpenAI-standard prompt_tokens_details fields (cached_tokens → cache_read_input_tokens, cache_creation_input_tokens → cache_creation_input_tokens) serve as fallback for non-Claude models that return cache data via prompt_tokens_details
 - **FR-009**: System MUST apply cache_control markers identically for both streaming and non-streaming requests during message preparation phase
 - **FR-010**: System MUST maintain backward compatibility with existing message processing logic (except for the cache strategy itself which is a breaking change)
 - **FR-011**: System MUST support caching for different message roles at interval positions, applying cache_control only at block level:
@@ -126,5 +126,5 @@ As a user who switches between permission modes (e.g., default → plan → acce
 - **Cache Marker**: Represents a point in the conversation where messages are preserved for context, containing the message position and associated conversation content
 - **Message Context**: Represents the combination of system prompt, tools, and cached messages that provide context for AI agent responses
 - **Enhanced Usage Metrics**: Extended usage object including cache-related token counts and creation breakdown
-- **Claude Model Detection**: Boolean determination based on case-insensitive model name matching
+- **Claude Model Detection**: Boolean determination based on case-insensitive model name matching. Gates cache_control marker injection only — cache token extraction applies to all models.
 - **Structured Message Content**: Array-based message content format supporting cache_control on individual content parts