Skip to content

System prompt forwarded with per-request x-anthropic-billing-header line — defeats upstream prompt cache #2

@ajcasagrande

Description

@ajcasagrande

Summary

When converting an Anthropic /v1/messages request to a Codex Responses payload, the proxy concatenates the inbound system array verbatim into instructions. Anthropic clients (notably Claude Code) prefix that array with a non-semantic header line whose hash changes on every request, so the upstream prompt prefix is unique per call and Codex's prompt cache never hits.

Background

Claude Code prefixes its system prompt with a line of the form:

x-anthropic-billing-header: cc_version=2.1.117.48f; cc_entrypoint=cli; cch=71fea;

The cch=<hash> portion regenerates on every request. The line carries no semantic value to the model — it's a billing/telemetry header. When forwarded into the Codex instructions field unchanged, every turn presents a brand-new prefix to the backend.

Root cause

src/transformers/request.ts, extractSystemPrompt:

function extractSystemPrompt(system: string | ContentBlock[] | undefined): string {
  if (!system) return "";
  if (typeof system === "string") return system;

  return system
    .map((block) => {
      if (block.type === "text" && block.text) return block.text;
      return "";
    })
    .filter(Boolean)
    .join("\n");
}

Every text block is forwarded as-is. There's no filter for the x-anthropic-billing-header: line (or any other non-semantic Anthropic header line). The result is assigned to instructions on the outbound Codex Responses payload.

There's also no prompt_cache_key set on the outbound payload, so even a stable downstream cache routing key isn't available as a fallback.

Impact

  • Cost. Long multi-turn Claude Code sessions re-bill the full input context (often a large CLAUDE.md + tool catalog + accumulated message history) on every turn. Typical multiplier vs. correct cache reuse: 5–10×.
  • Latency. Cache-miss prefills are noticeably slower than cache hits, so every turn after the first feels sluggish.
  • Subscription throughput. Per-account rate limits exhaust faster than they should because effective input-token throughput is lower.

Suggested fix

Strip non-semantic Anthropic header lines before joining:

function stripNonSemanticSystemLines(text: string): string {
  return text
    .split("\n")
    .filter((line) => !line.trim().toLowerCase().startsWith("x-anthropic-billing-header:"))
    .join("\n")
    .trim();
}

function extractSystemPrompt(system: string | ContentBlock[] | undefined): string {
  if (!system) return "";
  if (typeof system === "string") return stripNonSemanticSystemLines(system);

  return system
    .map((block) => {
      if (block.type === "text" && block.text) return stripNonSemanticSystemLines(block.text);
      return "";
    })
    .filter(Boolean)
    .join("\n\n");
}

Optional second hardening: in the Codex request payload, set prompt_cache_key to a stable session-scoped value (e.g. derived from a conversation identifier). The Responses API uses this field for cache routing across requests with otherwise-equivalent prefixes.

Reproduction

  1. Send any Anthropic /v1/messages request whose system array begins with a text block containing x-anthropic-billing-header: cc_version=...; cch=<hash>; as its first line. (Real Claude Code traffic does this automatically.)
  2. Observe the outbound Codex Responses payload — the instructions field starts with the billing-header line.
  3. Send a second request with a different cch=<hash> to simulate a fresh Claude Code session. The two outbound instructions differ in their first line, so prefix-based caching at the upstream cannot hit between them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions