Fix: Support Chat Template Tokenization with vLLM Parameters in Prefix Cache Router by penfree · Pull Request #2002 · vllm-project/aibrix

penfree · 2026-03-13T09:57:33Z

Problem

There were two critical issues causing prefix cache mismatches between Gateway and vLLM Pods in kv-sync mode:

Double JSON Serialization Issue: ChatMessage.Content was defined as string, causing structured multimodal content (images, audio) to be double-encoded when sent to vLLM tokenization API. This resulted in Gateway tokenizing differently than vLLM Pods.
Missing Chat Template Support: The kvSyncPrefixCacheRouter used simple text tokenization (TokenizeInputText) for all requests, instead of applying chat templates for /v1/chat/completions endpoints. This caused prefix cache misses because Gateway wasn't tokenizing messages the same way vLLM does with its chat template.

Example of the Issue

Before this fix:

Client sends: {"role": "user", "content": [{"type": "text", "text": "Hello"}, {"type": "image_url", ...}]}
Gateway tokenizes: "[{\"type\":\"text\",\"text\":\"Hello\"}...]" (double-encoded string)
vLLM Pod tokenizes: [{"type": "text", "text": "Hello"}...] (structured array)
Result: Prefix cache miss ❌

After this fix:

Both Gateway and vLLM Pod tokenize the same way with chat template
Result: Prefix cache hit ✅

Solution

1. Define vLLM-Compatible Request Structure

File: pkg/types/request.go (NEW)

Created ChatCompletionRequest that embeds openai.ChatCompletionNewParams and adds vLLM-specific parameters:

add_generation_prompt - Controls whether to add generation prompt to chat template (default: true)
add_special_tokens - Controls whether to add special tokens on top of chat template (default: false)
return_token_strs - Returns token strings for debugging (default: false)

These parameters match vLLM's chat completion protocol.

2. Fix ChatMessage Content Type

File: pkg/utils/tokenizer/types.go

Changed ChatMessage.Content from string to json.RawMessage:

type ChatMessage struct {
    Role    string          `json:"role"`
    Content json.RawMessage `json:"content"` // Now supports string or structured array
}

This preserves the original JSON structure without double-encoding, supporting:

Simple text: "Hello world"
Multimodal arrays: [{"type": "text", ...}, {"type": "image_url", ...}]

3. Implement Chat Template Tokenization in Router

File: pkg/plugins/gateway/algorithms/prefix_cache.go

Added buildTokenizeInputFromChatRequest() helper that:

Converts OpenAI message format to tokenizer format
Preserves multimodal content structure
Extracts vLLM-specific parameters with correct defaults

Updated kvSyncPrefixCacheRouter.Route() to:

Check endpoint type (/v1/chat/completions vs /v1/completions)
For chat endpoints:
- Parse request as ChatCompletionRequest
- Build TokenizeInput with chat template parameters
- Use TokenizeWithOptions with ChatInput type
For other endpoints: Keep existing text tokenization
Implement comprehensive fallback mechanism

4. Export Helper Functions

File: pkg/utils/tokenizer/utils.go

Exported IntToByteArray function for converting token IDs to byte array format needed by prefix cache indexer.

Testing

Unit Tests Added

types_test.go: Tests for ChatMessage serialization/deserialization
- Simple text content
- Multimodal array content
- Round-trip preservation
serialization_test.go: Tests for sonic/JSON compatibility
- Verifies sonic.Marshal correctly handles json.RawMessage
- Validates vLLM request format matches expected structure
- Confirms multimodal content is not double-encoded

Test Results

$ go test ./pkg/utils/tokenizer/...
ok      github.com/vllm-project/aibrix/pkg/utils/tokenizer    4.536s

$ go test -v ./pkg/utils/tokenizer -run TestVLLMRequestSerialization
Serialized request:
{"messages":[{"role":"user","content":[{"type":"text","text":"test"},
{"type":"image_url","image_url":{"url":"http://example.com/img.jpg"}}]}],
"add_special_tokens":false,"add_generation_prompt":true}
--- PASS: TestVLLMRequestSerialization (0.00s)

Compatibility

Supported Engines

✅ vLLM: Fully compatible, enhanced with chat template support
✅ SGLang: Fully compatible (doesn't use tokenization API, no impact)

Backward Compatibility

✅ Existing /v1/completions endpoints work unchanged
✅ Fallback to text tokenization if chat template fails
✅ Compatible with tokenizers that don't support ExtendedTokenizer
✅ Default parameter values match vLLM behavior

Benefits

Improved Prefix Cache Hit Rate: Gateway now tokenizes chat messages the same way as vLLM Pods, significantly improving cache hits in kv-sync mode
Multimodal Content Support: Properly handles images, audio, and other structured content in chat messages
vLLM Parameter Compatibility: Gateway can now parse and respect vLLM-specific chat template parameters
Better Debugging: Logs detailed tokenization information at V(4) level
Robust Error Handling: Multiple fallback layers ensure the router continues working even if chat tokenization fails

Files Changed

File	Status	Changes
`pkg/types/request.go`	NEW	+45
`pkg/utils/tokenizer/types.go`	Modified	ChatMessage.Content → json.RawMessage
`pkg/utils/tokenizer/utils.go`	Modified	Export IntToByteArray
`pkg/plugins/gateway/algorithms/prefix_cache.go`	Modified	+112
`pkg/utils/tokenizer/types_test.go`	NEW	+106
`pkg/utils/tokenizer/serialization_test.go`	NEW	+135

Total: 6 files changed, 274 insertions(+), 12 deletions(-)

Verification Steps

To verify this fix works correctly:

Test simple chat completion:

curl -X POST http://gateway/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "test-model",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Test multimodal content:

curl -X POST http://gateway/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "test-model",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "https://..."}}
      ]
    }]
  }'

Test with vLLM parameters:

curl -X POST http://gateway/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "test-model",
    "messages": [{"role": "user", "content": "Hello"}],
    "add_generation_prompt": true,
    "add_special_tokens": false
  }'

Verify tokenization consistency:
Compare token IDs from Gateway tokenization with vLLM tokenization API:

# Gateway logs should show:
# tokenized using chat template: message_count=1, token_count=X, add_generation_prompt=true

Related Issues

This fix addresses the root cause of prefix cache mismatches in kv-sync mode when:

Using chat completion endpoints
Sending multimodal content (images, audio)
vLLM chat template parameters differ from defaults

Breaking Changes

None - This is a backward-compatible enhancement. All existing functionality remains unchanged.

…x cache router This commit fixes two issues that caused prefix cache mismatches between Gateway and vLLM Pods: 1. ChatMessage Content serialization: Changed Content field from string to json.RawMessage to preserve structured multimodal content (images, audio) without double JSON encoding. 2. Chat template tokenization: Updated kvSyncPrefixCacheRouter to use TokenizeWithOptions with proper chat template parameters for /v1/chat/completions endpoints, matching vLLM's tokenization behavior. Changes: - Add ChatCompletionRequest struct with vLLM-specific parameters (add_generation_prompt, add_special_tokens, return_token_strs) - Update ChatMessage.Content to json.RawMessage for multimodal support - Add buildTokenizeInputFromChatRequest helper for converting requests - Implement chat template tokenization in Route method with fallback - Export IntToByteArray for external use - Add comprehensive tests for ChatMessage serialization Benefits: - Correct prefix cache matching in kv-sync mode - Support for multimodal chat content (images, audio) - Backward compatible with existing endpoints - Proper handling of vLLM chat template parameters Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Change-Id: I8144268509d4311bc7483bd7e7a7f5088507e429 Signed-off-by: penfree <pengfei.qiu@gmail.com>

Signed-off-by: penfree <pengfei.qiu@gmail.com> Change-Id: I37280f0bdfba67e5fc03f7dc842059863ae4bc78

gemini-code-assist · 2026-03-13T09:57:55Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Gateway's ability to interact with vLLM Pods, particularly for chat completion requests and multimodal content. By resolving issues related to double JSON serialization and the lack of chat template support, it ensures consistent tokenization between the Gateway and vLLM, leading to a higher prefix cache hit rate and improved performance in kv-sync mode. The changes also introduce support for vLLM-specific chat parameters and robust error handling.

Highlights

Resolved Prefix Cache Mismatches: Addressed critical issues causing prefix cache mismatches between Gateway and vLLM Pods in kv-sync mode by ensuring consistent tokenization.
Improved Multimodal Content Handling: Fixed double JSON serialization of ChatMessage.Content by changing its type to json.RawMessage, which correctly preserves structured multimodal content.
Enabled Chat Template Tokenization: Implemented chat template support in kvSyncPrefixCacheRouter for /v1/chat/completions endpoints, aligning Gateway's tokenization with vLLM's behavior.
Introduced vLLM-Specific Parameters: Defined a new ChatCompletionRequest struct to support vLLM-specific parameters like add_generation_prompt and add_special_tokens.
Enhanced Tokenizer Interface: Exported the IntToByteArray function and made the ExtendedTokenizer interface public to support advanced tokenization features.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

pkg/plugins/gateway/algorithms/prefix_cache.go
- Imported encoding/json.
- Added buildTokenizeInputFromChatRequest function to convert ChatCompletionRequest to TokenizeInput while preserving multimodal content and vLLM parameters.
- Modified the Route method to conditionally apply chat template tokenization for /v1/chat/completions endpoints using ExtendedTokenizer and TokenizeWithOptions, with a fallback to text tokenization.
pkg/types/request.go
- Added a new file defining the ChatCompletionRequest struct, which extends openai.ChatCompletionNewParams with vLLM-specific fields like AddGenerationPrompt, AddSpecialTokens, and ReturnTokenStrings.
pkg/utils/tokenizer/interfaces.go
- Renamed the extendedTokenizer interface to ExtendedTokenizer (making it public) to allow external use.
- Updated remoteTokenizer to embed the new ExtendedTokenizer.
pkg/utils/tokenizer/serialization_test.go
- Added a new test file containing TestSonicSerializationWithRawMessage to verify sonic.Marshal correctly handles json.RawMessage for ChatMessage content.
- Added TestVLLMRequestSerialization to confirm the vLLM request format matches expectations, especially for multimodal content.
pkg/utils/tokenizer/types.go
- Imported encoding/json.
- Changed the Content field of the ChatMessage struct from string to json.RawMessage to support both simple text and structured multimodal content without double-encoding.
pkg/utils/tokenizer/types_test.go
- Added a new test file with TestChatMessageSerialization to verify correct serialization and deserialization of ChatMessage with both simple text and multimodal content.
- Added TestChatMessageContentPreservation to ensure the structure of multimodal content is preserved during JSON round-trips.
pkg/utils/tokenizer/utils.go
- Renamed intToByteArray to IntToByteArray (exported) and added an internal alias intToByteArray for backward compatibility.

Activity

Unit tests were added for ChatMessage serialization/deserialization, covering simple text and multimodal array content, including round-trip preservation.
Serialization tests were introduced to verify sonic.Marshal compatibility with json.RawMessage and validate the vLLM request format.
Verification steps were provided, including curl commands to test simple chat completion, multimodal content, and vLLM parameters, along with instructions to check Gateway logs for tokenization consistency.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively addresses prefix cache mismatches by integrating chat template tokenization and correcting a JSON serialization issue with multimodal content. The changes are well-structured and include appropriate tests. My review includes a few suggestions to enhance code efficiency, readability, and maintainability by refactoring an inefficient JSON handling pattern, simplifying a complex conditional block, and removing a redundant function alias.

pkg/plugins/gateway/algorithms/prefix_cache.go

pkg/utils/tokenizer/utils.go

This commit includes several refactoring improvements based on code review: 1. Extract chat tokenization logic into separate method - Add tokenizeChatRequest() method to reduce nesting in Route() - Eliminate "pyramid of doom" pattern for better readability - All error handling and logging encapsulated in the helper 2. Simplify message content extraction - Directly marshal msg.GetContent() instead of double marshal/unmarshal - Remove unnecessary intermediate map creation - More efficient and clearer code 3. Remove redundant intToByteArray alias - Update all callers to use exported IntToByteArray directly - Simplify utils.go by removing the internal alias Benefits: - Improved code readability and maintainability - Better performance (fewer marshal/unmarshal operations) - Cleaner separation of concerns Testing: All tokenizer tests pass, including vLLM serialization tests. Signed-off-by: penfree <pengfei.qiu@gmail.com> Change-Id: Id2b356d946db35b25d6c20493510de85e2b893d9

- Add error checking for json.Unmarshal calls to fix errcheck lint errors - Replace interface{} with any for Go 1.18+ compatibility Signed-off-by: penfree <pengfei.qiu@gmail.com> Change-Id: Ic742890ad4675fdd1a920d36aacb748e356d1735

pengfei913 and others added 2 commits March 13, 2026 17:04

Fix: add some unittest

af90f5b

Signed-off-by: penfree <pengfei.qiu@gmail.com> Change-Id: I37280f0bdfba67e5fc03f7dc842059863ae4bc78

gemini-code-assist bot reviewed Mar 13, 2026

View reviewed changes

pkg/plugins/gateway/algorithms/prefix_cache.go Outdated Show resolved Hide resolved

pkg/plugins/gateway/algorithms/prefix_cache.go Show resolved Hide resolved

pkg/utils/tokenizer/utils.go Outdated Show resolved Hide resolved

pengfei913 added 2 commits March 13, 2026 18:23

penfree marked this pull request as ready for review March 14, 2026 02:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Support Chat Template Tokenization with vLLM Parameters in Prefix Cache Router#2002

Fix: Support Chat Template Tokenization with vLLM Parameters in Prefix Cache Router#2002
penfree wants to merge 4 commits intovllm-project:mainfrom
penfree:fix/chat-tokenization

penfree commented Mar 13, 2026

Uh oh!

gemini-code-assist bot commented Mar 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

penfree commented Mar 13, 2026

Problem

Example of the Issue

Solution

1. Define vLLM-Compatible Request Structure

2. Fix ChatMessage Content Type

3. Implement Chat Template Tokenization in Router

4. Export Helper Functions

Testing

Unit Tests Added

Test Results

Compatibility

Supported Engines

Backward Compatibility

Benefits

Files Changed

Verification Steps

Related Issues

Breaking Changes

Uh oh!

gemini-code-assist bot commented Mar 13, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants