Skip to content

Add Vertex AI prompt caching support for Claude models #961

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 27, 2025

Conversation

aitoroses
Copy link

@aitoroses aitoroses commented Feb 12, 2025

Description

This PR adds prompt caching support for the Vertex API integration. The main changes are:

  • Caching Implementation:
    • Ephemeral cache_control fields have been added to user messages and system prompts in src/api/providers/vertex.ts when caching is supported.
    • Token usage for cache writes and cache reads is now tracked and emitted during streaming.
  • Configuration Updates:
    • The configuration in src/shared/api.ts has been updated to enable prompt caching for supported models and includes pricing details for cache writes and reads.
  • Testing Enhancements:
    • Unit tests in src/api/providers/__tests__/vertex.test.ts have been updated to simulate and verify the new caching behavior.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

How Has This Been Tested?

  • Updated unit tests in vertex.test.ts simulate prompt caching scenarios.
  • Verified that token usage for cache writes and reads is correctly output.
  • Ran the full test suite locally with all tests passing.
  • Tested the extension locally
  • Verified the cost on vertex.ai

Checklist:

  • My code follows the patterns of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation

Additional context

  • This feature is only active for models with supportsPromptCache enabled.
  • The changes improve response efficiency by reusing cached prompt responses where possible.
  • Updated pricing details for caching are included in the shared API configuration.

Captura de pantalla 2025-02-12 a las 15 53 48
Captura de pantalla 2025-02-13 a las 9 10 23

Related Issues

N/A

Reviewers


Important

Add prompt caching support for Claude models in Vertex AI, updating configurations and tests to handle caching behavior.

  • Caching Implementation:
    • Add cache_control fields to user messages and system prompts in vertex.ts.
    • Track token usage for cache writes and reads during streaming.
  • Configuration Updates:
    • Enable prompt caching in vertexModels in api.ts for supported models.
    • Include pricing details for cache writes and reads.
  • Testing Enhancements:
    • Update vertex.test.ts to simulate and verify caching behavior.

This description was created by Ellipsis for 3ddd4c9. It will automatically update as commits are pushed.

Copy link

changeset-bot bot commented Feb 12, 2025

⚠️ No Changeset found

Latest commit: ea38d9e

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@aitoroses aitoroses marked this pull request as ready for review February 12, 2025 18:53
@aitoroses
Copy link
Author

I'm currently seeing that test are not passing in the upstream main branch, I believe that this is causing unit tests to fail here.
If I move to the main branch, they fail, the error is:

...
issues: [
          {
            code: 'invalid_type',
            expected: 'object',
            received: 'array',
            path: [],
            message: 'Expected object, received array'
          }
        ],
        addIssue: [Function (anonymous)],
        addIssues: [Function (anonymous)],
        errors: [
          {
            code: 'invalid_type',
            expected: 'object',
            received: 'array',
            path: [],
            message: 'Expected object, received array'
          }
        ]
...

If I run them on my local branch they pass!

@lupuletic
Copy link

lupuletic commented Feb 24, 2025

Is there a plan to get this merged @mrubens / @cte ?

Was looking at creating a separate PR for prompt caching for Vertex AI, but then found that this already exists

Thaanks!

@mrubens
Copy link
Collaborator

mrubens commented Feb 24, 2025

Is there a plan to get this merged @mrubens / @cte ?

Was looking at creating a separate PR for prompt caching for Vertex AI, but then found that this already exists

Thaanks!

Oh, I completely missed this one! I don't personally use vertex. @lupuletic if you or anyone has a minute to review I'm happy to ship it.

@mrubens
Copy link
Collaborator

mrubens commented Feb 24, 2025

Thank you for opening this @aitoroses and very sorry I missed it!

@mrubens
Copy link
Collaborator

mrubens commented Feb 24, 2025

Tests should be less flaky if you merge in main

// 2. Cache the most relevant context (usually at the end of the message)
const isLastTextBlock =
contentIndex ===
array.reduce((lastIndex, c, i) => (c.type === "text" ? i : lastIndex), -1)
Copy link

@lupuletic lupuletic Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For each item in the array, we seem to be running a reduce operation on the entire array to find the index of the last text block. This means if there are N items in the array, we're doing N full array traversals, which is O(N²) complexity.

// Current implementation (lines 118-132)
message.content.map((content, contentIndex, array) => {
    // Images and other non-text content are passed through unchanged
    if (content.type === "image") {
        return content as VertexImageBlock
    }
    // We only cache the last text block in each message to:
    // 1. Stay under the 4-block cache limit
    // 2. Cache the most relevant context (usually at the end of the message)
    const isLastTextBlock =
        contentIndex ===
        array.reduce((lastIndex, c, i) => (c.type === "text" ? i : lastIndex), -1)
    return {
        type: "text" as const,
        text: (content as { text: string }).text,
        ...(shouldCache && isLastTextBlock && { cache_control: { type: "ephemeral" } }),
    }
})

We can optimize this by calculating the last text block index once before the map operation, reducing the complexity to O(N):

	private formatMessageForCache(message: Anthropic.Messages.MessageParam, shouldCache: boolean): VertexMessage {
		// Assistant messages are kept as-is since they can't be cached
		if (message.role === "assistant") {
			return message as VertexMessage
		}
	
		// For string content, we convert to array format with optional cache control
		if (typeof message.content === "string") {
			return {
				...message,
				content: [
					{
						type: "text" as const,
						text: message.content,
						// For string content, we only have one block so it's always the last
						...(shouldCache && { cache_control: { type: "ephemeral" } }),
					},
				],
			}
		}
	
		// For array content, find the last text block index once before mapping
		const lastTextBlockIndex = message.content.reduce(
			(lastIndex: number, content: Anthropic.Messages.ContentBlock, index: number) => (content.type === "text" ? index : lastIndex),
			-1
		)
	
		// Then use this pre-calculated index in the map function
		return {
			...message,
			content: message.content.map((content: Anthropic.Messages.ContentBlock, contentIndex: number) => {
				// Images and other non-text content are passed through unchanged
				if (content.type === "image") {
					return content as VertexImageBlock
				}
				
				// Check if this is the last text block using our pre-calculated index
				const isLastTextBlock = contentIndex === lastTextBlockIndex
				
				return {
					type: "text" as const,
					text: (content as Anthropic.Messages.TextBlock).text,
					...(shouldCache && isLastTextBlock && { cache_control: { type: "ephemeral" } }),
				}
			}),
		}
	}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try this!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks -- that seems to be working as expected!
image


// Find indices of user messages that we want to cache
// We only cache the last two user messages to stay within the 4-block limit
// (1 block for system + 1 block each for last two user messages = 3 total)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was wondering if we can reduce costs further by making use of the 4th cache block, but then I noticed openRouter implementation does the same thing.

However, I don't have a good enough understanding to make a suggestion towards what additionally could be beneficial to cache. Therefore, just leaving this here as more of a question on whether there's additional caching opportunities or not (could indeed increase costs if we just write and don't read)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lupuletic! I've tried originally using one more but it gave errors so.. it just considers 3 + 1 (system), but it's sufficient to cache the big system prompt!

- Implemented comprehensive prompt caching strategy for Vertex AI models
- Added support for caching system prompts and user message text blocks
- Enhanced stream processing to handle cache-related usage metrics
- Updated model configurations to enable prompt caching
- Improved type definitions for Vertex AI message handling
@aitoroses aitoroses force-pushed the feat/vertex-prompt-caching branch from 0135946 to 9b267e9 Compare February 25, 2025 14:30
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Feb 25, 2025
@@ -435,41 +435,51 @@ export const vertexModels = {
contextWindow: 200_000,
supportsImages: true,
supportsComputerUse: true,
supportsPromptCache: false,
supportsPromptCache: true,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we please enable this for claude sonnet 3.7 as well? just few lines above
image

otherwise, LGTM and thanks a lot for implementing this one @aitoroses !

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just added caching for 3.7! @mrubens / @cte

@mrubens
Copy link
Collaborator

mrubens commented Feb 27, 2025

Awesome! Will look today

@@ -441,7 +441,7 @@ export const vertexModels = {
contextWindow: 200_000,
supportsImages: true,
supportsComputerUse: true,
supportsPromptCache: false,
supportsPromptCache: true,
inputPrice: 3.0,
outputPrice: 15.0,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it also needs the cache writes and reads price (they are the same ones as sonnet v2)

@lupuletic
Copy link

Awesome! Will look today

Thanks a lot -- I am using Vertex AI quite a bit and currently I'm on a local installation of RooCode to benefit from the caching cost savings. However, I am sure this will benefit more people, so keen to get it this released!

If preferred, I am also happy to spin up a new PR with all the new changes needed

Thanks again all!

Copy link
Collaborator

@mrubens mrubens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me! @lupuletic which changes do you need pulled in? I can also just merge if you want to fast follow.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Feb 27, 2025
@lupuletic
Copy link

Code looks good to me! @lupuletic which changes do you need pulled in? I can also just merge if you want to fast follow.

Yeep, looks good to me too! It's just the cache costs for 3.7 missing, which I added on a follow-up PR here: #1244

Copy link

@lupuletic lupuletic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cte cte merged commit 02c955e into RooCodeInc:main Feb 27, 2025
11 checks passed
@cte
Copy link
Collaborator

cte commented Feb 27, 2025

Thanks all!

refactorthis pushed a commit to refactorthis/Roo-Code that referenced this pull request Mar 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants