-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add Vertex AI prompt caching support for Claude models #961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'm currently seeing that test are not passing in the upstream main branch, I believe that this is causing unit tests to fail here.
If I run them on my local branch they pass! |
Oh, I completely missed this one! I don't personally use vertex. @lupuletic if you or anyone has a minute to review I'm happy to ship it. |
Thank you for opening this @aitoroses and very sorry I missed it! |
Tests should be less flaky if you merge in main |
src/api/providers/vertex.ts
Outdated
// 2. Cache the most relevant context (usually at the end of the message) | ||
const isLastTextBlock = | ||
contentIndex === | ||
array.reduce((lastIndex, c, i) => (c.type === "text" ? i : lastIndex), -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For each item in the array, we seem to be running a reduce operation on the entire array to find the index of the last text block. This means if there are N items in the array, we're doing N full array traversals, which is O(N²) complexity.
// Current implementation (lines 118-132)
message.content.map((content, contentIndex, array) => {
// Images and other non-text content are passed through unchanged
if (content.type === "image") {
return content as VertexImageBlock
}
// We only cache the last text block in each message to:
// 1. Stay under the 4-block cache limit
// 2. Cache the most relevant context (usually at the end of the message)
const isLastTextBlock =
contentIndex ===
array.reduce((lastIndex, c, i) => (c.type === "text" ? i : lastIndex), -1)
return {
type: "text" as const,
text: (content as { text: string }).text,
...(shouldCache && isLastTextBlock && { cache_control: { type: "ephemeral" } }),
}
})
We can optimize this by calculating the last text block index once before the map operation, reducing the complexity to O(N):
private formatMessageForCache(message: Anthropic.Messages.MessageParam, shouldCache: boolean): VertexMessage {
// Assistant messages are kept as-is since they can't be cached
if (message.role === "assistant") {
return message as VertexMessage
}
// For string content, we convert to array format with optional cache control
if (typeof message.content === "string") {
return {
...message,
content: [
{
type: "text" as const,
text: message.content,
// For string content, we only have one block so it's always the last
...(shouldCache && { cache_control: { type: "ephemeral" } }),
},
],
}
}
// For array content, find the last text block index once before mapping
const lastTextBlockIndex = message.content.reduce(
(lastIndex: number, content: Anthropic.Messages.ContentBlock, index: number) => (content.type === "text" ? index : lastIndex),
-1
)
// Then use this pre-calculated index in the map function
return {
...message,
content: message.content.map((content: Anthropic.Messages.ContentBlock, contentIndex: number) => {
// Images and other non-text content are passed through unchanged
if (content.type === "image") {
return content as VertexImageBlock
}
// Check if this is the last text block using our pre-calculated index
const isLastTextBlock = contentIndex === lastTextBlockIndex
return {
type: "text" as const,
text: (content as Anthropic.Messages.TextBlock).text,
...(shouldCache && isLastTextBlock && { cache_control: { type: "ephemeral" } }),
}
}),
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
// Find indices of user messages that we want to cache | ||
// We only cache the last two user messages to stay within the 4-block limit | ||
// (1 block for system + 1 block each for last two user messages = 3 total) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was wondering if we can reduce costs further by making use of the 4th cache block, but then I noticed openRouter implementation does the same thing.
However, I don't have a good enough understanding to make a suggestion towards what additionally could be beneficial to cache. Therefore, just leaving this here as more of a question on whether there's additional caching opportunities or not (could indeed increase costs if we just write and don't read)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lupuletic! I've tried originally using one more but it gave errors so.. it just considers 3 + 1 (system), but it's sufficient to cache the big system prompt!
- Implemented comprehensive prompt caching strategy for Vertex AI models - Added support for caching system prompts and user message text blocks - Enhanced stream processing to handle cache-related usage metrics - Updated model configurations to enable prompt caching - Improved type definitions for Vertex AI message handling
0135946
to
9b267e9
Compare
@@ -435,41 +435,51 @@ export const vertexModels = { | |||
contextWindow: 200_000, | |||
supportsImages: true, | |||
supportsComputerUse: true, | |||
supportsPromptCache: false, | |||
supportsPromptCache: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we please enable this for claude sonnet 3.7 as well? just few lines above
otherwise, LGTM and thanks a lot for implementing this one @aitoroses !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Will look today |
@@ -441,7 +441,7 @@ export const vertexModels = { | |||
contextWindow: 200_000, | |||
supportsImages: true, | |||
supportsComputerUse: true, | |||
supportsPromptCache: false, | |||
supportsPromptCache: true, | |||
inputPrice: 3.0, | |||
outputPrice: 15.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it also needs the cache writes and reads price (they are the same ones as sonnet v2)
Thanks a lot -- I am using Vertex AI quite a bit and currently I'm on a local installation of RooCode to benefit from the caching cost savings. However, I am sure this will benefit more people, so keen to get it this released! If preferred, I am also happy to spin up a new PR with all the new changes needed Thanks again all! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good to me! @lupuletic which changes do you need pulled in? I can also just merge if you want to fast follow.
Yeep, looks good to me too! It's just the cache costs for 3.7 missing, which I added on a follow-up PR here: #1244 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks all! |
Description
This PR adds prompt caching support for the Vertex API integration. The main changes are:
cache_control
fields have been added to user messages and system prompts insrc/api/providers/vertex.ts
when caching is supported.src/shared/api.ts
has been updated to enable prompt caching for supported models and includes pricing details for cache writes and reads.src/api/providers/__tests__/vertex.test.ts
have been updated to simulate and verify the new caching behavior.Type of change
How Has This Been Tested?
How Has This Been Tested?
vertex.test.ts
simulate prompt caching scenarios.Checklist:
Additional context
supportsPromptCache
enabled.Related Issues
N/A
Reviewers
Important
Add prompt caching support for Claude models in Vertex AI, updating configurations and tests to handle caching behavior.
cache_control
fields to user messages and system prompts invertex.ts
.vertexModels
inapi.ts
for supported models.vertex.test.ts
to simulate and verify caching behavior.This description was created by
for 3ddd4c9. It will automatically update as commits are pushed.