Skip to content

ai-proxy and ai-proxy-advanced: Gemini driver drops cachedContent field during OpenAI-to-Gemini transformation #14837

@shireen-bean

Description

@shireen-bean

Is there an existing issue for this?

  • I have searched the existing issues

Kong version ($ kong version)

Kong 3.13.x (ai-proxy-advanced plugin)

Current Behavior

When sending a request to the llm/v1/chat route with provider: gemini, the cachedContent field in the request body is silently dropped during the OpenAI-to-Gemini transformation. The request succeeds, but the Vertex AI context cache is not used.

The response shows prompt_tokens reflecting only the user message (e.g. 8 tokens), with no cachedContentTokenCount — confirming the cache was ignored.

Expected Behavior

The cachedContent field should be preserved through the OpenAI-to-Gemini transformation and included in the outgoing request to Vertex AI, so that users can leverage Vertex AI context caching from the OpenAI-compatible /v1/chat/completions route.

The response should include cachedContentTokenCount in usageMetadata, and promptTokenCount should reflect the cached tokens plus the user message.

Steps To Reproduce

  1. Create a Vertex AI context cache via the cachedContents API with a large system prompt (32,768+ tokens)
  2. Configure ai-proxy-advanced with provider: gemini and route_type: llm/v1/chat
  3. Send a request to /v1/chat/completions with cachedContent in the body (e.g. via extra_body in the OpenAI SDK):
{
  "model": "gemini-2.0-flash-001",
  "cachedContent": "projects/123456789/locations/us-central1/cachedContents/987654321",
  "messages": [
    {"role": "user", "content": "What company are you an expert on?"}
  ]
}
  1. Observe that the response usage.prompt_tokens only reflects the user message size (e.g. 8), not the cached content. The cache is not being used.

  2. For comparison, send the same request via the native generateContent endpoint (with llm_format: gemini, which bypasses transformation) — this works correctly and returns cachedContentTokenCount in the response.

Anything else?

Root cause is in kong/llm/drivers/gemini.lua lines 272-383. The to_gemini_chat_openai function builds a new request body from scratch:

local function to_gemini_chat_openai(request_table, model_info, route_type)
  local new_r = {}
  -- only populates: new_r.contents, new_r.systemInstruction,
  -- new_r.generationConfig, new_r.tools, new_r.tool_config
  return new_r, "application/json", nil
end

request_table.cachedContent is never read or assigned to new_r.

Suggested fix — add one line before the return:

new_r.cachedContent = request_table.cachedContent

Context caching is a first-class Vertex AI feature for reducing latency and cost on large static prompts. Supporting it on the OpenAI-compatible route would let users access it without switching to the native Gemini API format.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions