Is there an existing issue for this?
Kong version ($ kong version)
Kong 3.13.x (ai-proxy-advanced plugin)
Current Behavior
When sending a request to the llm/v1/chat route with provider: gemini, the cachedContent field in the request body is silently dropped during the OpenAI-to-Gemini transformation. The request succeeds, but the Vertex AI context cache is not used.
The response shows prompt_tokens reflecting only the user message (e.g. 8 tokens), with no cachedContentTokenCount — confirming the cache was ignored.
Expected Behavior
The cachedContent field should be preserved through the OpenAI-to-Gemini transformation and included in the outgoing request to Vertex AI, so that users can leverage Vertex AI context caching from the OpenAI-compatible /v1/chat/completions route.
The response should include cachedContentTokenCount in usageMetadata, and promptTokenCount should reflect the cached tokens plus the user message.
Steps To Reproduce
- Create a Vertex AI context cache via the
cachedContents API with a large system prompt (32,768+ tokens)
- Configure ai-proxy-advanced with
provider: gemini and route_type: llm/v1/chat
- Send a request to
/v1/chat/completions with cachedContent in the body (e.g. via extra_body in the OpenAI SDK):
{
"model": "gemini-2.0-flash-001",
"cachedContent": "projects/123456789/locations/us-central1/cachedContents/987654321",
"messages": [
{"role": "user", "content": "What company are you an expert on?"}
]
}
-
Observe that the response usage.prompt_tokens only reflects the user message size (e.g. 8), not the cached content. The cache is not being used.
-
For comparison, send the same request via the native generateContent endpoint (with llm_format: gemini, which bypasses transformation) — this works correctly and returns cachedContentTokenCount in the response.
Anything else?
Root cause is in kong/llm/drivers/gemini.lua lines 272-383. The to_gemini_chat_openai function builds a new request body from scratch:
local function to_gemini_chat_openai(request_table, model_info, route_type)
local new_r = {}
-- only populates: new_r.contents, new_r.systemInstruction,
-- new_r.generationConfig, new_r.tools, new_r.tool_config
return new_r, "application/json", nil
end
request_table.cachedContent is never read or assigned to new_r.
Suggested fix — add one line before the return:
new_r.cachedContent = request_table.cachedContent
Context caching is a first-class Vertex AI feature for reducing latency and cost on large static prompts. Supporting it on the OpenAI-compatible route would let users access it without switching to the native Gemini API format.
Is there an existing issue for this?
Kong version (
$ kong version)Kong 3.13.x (ai-proxy-advanced plugin)
Current Behavior
When sending a request to the
llm/v1/chatroute withprovider: gemini, thecachedContentfield in the request body is silently dropped during the OpenAI-to-Gemini transformation. The request succeeds, but the Vertex AI context cache is not used.The response shows
prompt_tokensreflecting only the user message (e.g. 8 tokens), with nocachedContentTokenCount— confirming the cache was ignored.Expected Behavior
The
cachedContentfield should be preserved through the OpenAI-to-Gemini transformation and included in the outgoing request to Vertex AI, so that users can leverage Vertex AI context caching from the OpenAI-compatible/v1/chat/completionsroute.The response should include
cachedContentTokenCountinusageMetadata, andpromptTokenCountshould reflect the cached tokens plus the user message.Steps To Reproduce
cachedContentsAPI with a large system prompt (32,768+ tokens)provider: geminiandroute_type: llm/v1/chat/v1/chat/completionswithcachedContentin the body (e.g. viaextra_bodyin the OpenAI SDK):{ "model": "gemini-2.0-flash-001", "cachedContent": "projects/123456789/locations/us-central1/cachedContents/987654321", "messages": [ {"role": "user", "content": "What company are you an expert on?"} ] }Observe that the response
usage.prompt_tokensonly reflects the user message size (e.g. 8), not the cached content. The cache is not being used.For comparison, send the same request via the native
generateContentendpoint (withllm_format: gemini, which bypasses transformation) — this works correctly and returnscachedContentTokenCountin the response.Anything else?
Root cause is in
kong/llm/drivers/gemini.lualines 272-383. Theto_gemini_chat_openaifunction builds a new request body from scratch:request_table.cachedContentis never read or assigned tonew_r.Suggested fix — add one line before the return:
Context caching is a first-class Vertex AI feature for reducing latency and cost on large static prompts. Supporting it on the OpenAI-compatible route would let users access it without switching to the native Gemini API format.