Skip to content

Fix: Support Chat Template Tokenization with vLLM Parameters in Prefix Cache Router#2002

Open
penfree wants to merge 4 commits intovllm-project:mainfrom
penfree:fix/chat-tokenization
Open

Fix: Support Chat Template Tokenization with vLLM Parameters in Prefix Cache Router#2002
penfree wants to merge 4 commits intovllm-project:mainfrom
penfree:fix/chat-tokenization

Conversation

@penfree
Copy link
Contributor

@penfree penfree commented Mar 13, 2026

Problem

There were two critical issues causing prefix cache mismatches between Gateway and vLLM Pods in kv-sync mode:

  1. Double JSON Serialization Issue: ChatMessage.Content was defined as string, causing structured multimodal content (images, audio) to be double-encoded when sent to vLLM tokenization API. This resulted in Gateway tokenizing differently than vLLM Pods.

  2. Missing Chat Template Support: The kvSyncPrefixCacheRouter used simple text tokenization (TokenizeInputText) for all requests, instead of applying chat templates for /v1/chat/completions endpoints. This caused prefix cache misses because Gateway wasn't tokenizing messages the same way vLLM does with its chat template.

Example of the Issue

Before this fix:

  • Client sends: {"role": "user", "content": [{"type": "text", "text": "Hello"}, {"type": "image_url", ...}]}
  • Gateway tokenizes: "[{\"type\":\"text\",\"text\":\"Hello\"}...]" (double-encoded string)
  • vLLM Pod tokenizes: [{"type": "text", "text": "Hello"}...] (structured array)
  • Result: Prefix cache miss

After this fix:

  • Both Gateway and vLLM Pod tokenize the same way with chat template
  • Result: Prefix cache hit

Solution

1. Define vLLM-Compatible Request Structure

File: pkg/types/request.go (NEW)

Created ChatCompletionRequest that embeds openai.ChatCompletionNewParams and adds vLLM-specific parameters:

  • add_generation_prompt - Controls whether to add generation prompt to chat template (default: true)
  • add_special_tokens - Controls whether to add special tokens on top of chat template (default: false)
  • return_token_strs - Returns token strings for debugging (default: false)

These parameters match vLLM's chat completion protocol.

2. Fix ChatMessage Content Type

File: pkg/utils/tokenizer/types.go

Changed ChatMessage.Content from string to json.RawMessage:

type ChatMessage struct {
    Role    string          `json:"role"`
    Content json.RawMessage `json:"content"` // Now supports string or structured array
}

This preserves the original JSON structure without double-encoding, supporting:

  • Simple text: "Hello world"
  • Multimodal arrays: [{"type": "text", ...}, {"type": "image_url", ...}]

3. Implement Chat Template Tokenization in Router

File: pkg/plugins/gateway/algorithms/prefix_cache.go

Added buildTokenizeInputFromChatRequest() helper that:

  1. Converts OpenAI message format to tokenizer format
  2. Preserves multimodal content structure
  3. Extracts vLLM-specific parameters with correct defaults

Updated kvSyncPrefixCacheRouter.Route() to:

  1. Check endpoint type (/v1/chat/completions vs /v1/completions)
  2. For chat endpoints:
    • Parse request as ChatCompletionRequest
    • Build TokenizeInput with chat template parameters
    • Use TokenizeWithOptions with ChatInput type
  3. For other endpoints: Keep existing text tokenization
  4. Implement comprehensive fallback mechanism

4. Export Helper Functions

File: pkg/utils/tokenizer/utils.go

Exported IntToByteArray function for converting token IDs to byte array format needed by prefix cache indexer.

Testing

Unit Tests Added

  1. types_test.go: Tests for ChatMessage serialization/deserialization

    • Simple text content
    • Multimodal array content
    • Round-trip preservation
  2. serialization_test.go: Tests for sonic/JSON compatibility

    • Verifies sonic.Marshal correctly handles json.RawMessage
    • Validates vLLM request format matches expected structure
    • Confirms multimodal content is not double-encoded

Test Results

$ go test ./pkg/utils/tokenizer/...
ok      github.com/vllm-project/aibrix/pkg/utils/tokenizer    4.536s

$ go test -v ./pkg/utils/tokenizer -run TestVLLMRequestSerialization
Serialized request:
{"messages":[{"role":"user","content":[{"type":"text","text":"test"},
{"type":"image_url","image_url":{"url":"http://example.com/img.jpg"}}]}],
"add_special_tokens":false,"add_generation_prompt":true}
--- PASS: TestVLLMRequestSerialization (0.00s)

Compatibility

Supported Engines

  • vLLM: Fully compatible, enhanced with chat template support
  • SGLang: Fully compatible (doesn't use tokenization API, no impact)

Backward Compatibility

  • ✅ Existing /v1/completions endpoints work unchanged
  • ✅ Fallback to text tokenization if chat template fails
  • ✅ Compatible with tokenizers that don't support ExtendedTokenizer
  • ✅ Default parameter values match vLLM behavior

Benefits

  1. Improved Prefix Cache Hit Rate: Gateway now tokenizes chat messages the same way as vLLM Pods, significantly improving cache hits in kv-sync mode

  2. Multimodal Content Support: Properly handles images, audio, and other structured content in chat messages

  3. vLLM Parameter Compatibility: Gateway can now parse and respect vLLM-specific chat template parameters

  4. Better Debugging: Logs detailed tokenization information at V(4) level

  5. Robust Error Handling: Multiple fallback layers ensure the router continues working even if chat tokenization fails

Files Changed

File Status Changes
pkg/types/request.go NEW +45
pkg/utils/tokenizer/types.go Modified ChatMessage.Content → json.RawMessage
pkg/utils/tokenizer/utils.go Modified Export IntToByteArray
pkg/plugins/gateway/algorithms/prefix_cache.go Modified +112
pkg/utils/tokenizer/types_test.go NEW +106
pkg/utils/tokenizer/serialization_test.go NEW +135

Total: 6 files changed, 274 insertions(+), 12 deletions(-)

Verification Steps

To verify this fix works correctly:

  1. Test simple chat completion:
curl -X POST http://gateway/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "test-model",
    "messages": [{"role": "user", "content": "Hello"}]
  }'
  1. Test multimodal content:
curl -X POST http://gateway/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "test-model",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "https://..."}}
      ]
    }]
  }'
  1. Test with vLLM parameters:
curl -X POST http://gateway/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "test-model",
    "messages": [{"role": "user", "content": "Hello"}],
    "add_generation_prompt": true,
    "add_special_tokens": false
  }'
  1. Verify tokenization consistency:
    Compare token IDs from Gateway tokenization with vLLM tokenization API:
# Gateway logs should show:
# tokenized using chat template: message_count=1, token_count=X, add_generation_prompt=true

Related Issues

This fix addresses the root cause of prefix cache mismatches in kv-sync mode when:

  • Using chat completion endpoints
  • Sending multimodal content (images, audio)
  • vLLM chat template parameters differ from defaults

Breaking Changes

None - This is a backward-compatible enhancement. All existing functionality remains unchanged.

pengfei913 and others added 2 commits March 13, 2026 17:04
…x cache router

This commit fixes two issues that caused prefix cache mismatches between
Gateway and vLLM Pods:

1. ChatMessage Content serialization: Changed Content field from string to
   json.RawMessage to preserve structured multimodal content (images, audio)
   without double JSON encoding.

2. Chat template tokenization: Updated kvSyncPrefixCacheRouter to use
   TokenizeWithOptions with proper chat template parameters for
   /v1/chat/completions endpoints, matching vLLM's tokenization behavior.

Changes:
- Add ChatCompletionRequest struct with vLLM-specific parameters
  (add_generation_prompt, add_special_tokens, return_token_strs)
- Update ChatMessage.Content to json.RawMessage for multimodal support
- Add buildTokenizeInputFromChatRequest helper for converting requests
- Implement chat template tokenization in Route method with fallback
- Export IntToByteArray for external use
- Add comprehensive tests for ChatMessage serialization

Benefits:
- Correct prefix cache matching in kv-sync mode
- Support for multimodal chat content (images, audio)
- Backward compatible with existing endpoints
- Proper handling of vLLM chat template parameters

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Change-Id: I8144268509d4311bc7483bd7e7a7f5088507e429
Signed-off-by: penfree <pengfei.qiu@gmail.com>
Signed-off-by: penfree <pengfei.qiu@gmail.com>
Change-Id: I37280f0bdfba67e5fc03f7dc842059863ae4bc78
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Gateway's ability to interact with vLLM Pods, particularly for chat completion requests and multimodal content. By resolving issues related to double JSON serialization and the lack of chat template support, it ensures consistent tokenization between the Gateway and vLLM, leading to a higher prefix cache hit rate and improved performance in kv-sync mode. The changes also introduce support for vLLM-specific chat parameters and robust error handling.

Highlights

  • Resolved Prefix Cache Mismatches: Addressed critical issues causing prefix cache mismatches between Gateway and vLLM Pods in kv-sync mode by ensuring consistent tokenization.
  • Improved Multimodal Content Handling: Fixed double JSON serialization of ChatMessage.Content by changing its type to json.RawMessage, which correctly preserves structured multimodal content.
  • Enabled Chat Template Tokenization: Implemented chat template support in kvSyncPrefixCacheRouter for /v1/chat/completions endpoints, aligning Gateway's tokenization with vLLM's behavior.
  • Introduced vLLM-Specific Parameters: Defined a new ChatCompletionRequest struct to support vLLM-specific parameters like add_generation_prompt and add_special_tokens.
  • Enhanced Tokenizer Interface: Exported the IntToByteArray function and made the ExtendedTokenizer interface public to support advanced tokenization features.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • pkg/plugins/gateway/algorithms/prefix_cache.go
    • Imported encoding/json.
    • Added buildTokenizeInputFromChatRequest function to convert ChatCompletionRequest to TokenizeInput while preserving multimodal content and vLLM parameters.
    • Modified the Route method to conditionally apply chat template tokenization for /v1/chat/completions endpoints using ExtendedTokenizer and TokenizeWithOptions, with a fallback to text tokenization.
  • pkg/types/request.go
    • Added a new file defining the ChatCompletionRequest struct, which extends openai.ChatCompletionNewParams with vLLM-specific fields like AddGenerationPrompt, AddSpecialTokens, and ReturnTokenStrings.
  • pkg/utils/tokenizer/interfaces.go
    • Renamed the extendedTokenizer interface to ExtendedTokenizer (making it public) to allow external use.
    • Updated remoteTokenizer to embed the new ExtendedTokenizer.
  • pkg/utils/tokenizer/serialization_test.go
    • Added a new test file containing TestSonicSerializationWithRawMessage to verify sonic.Marshal correctly handles json.RawMessage for ChatMessage content.
    • Added TestVLLMRequestSerialization to confirm the vLLM request format matches expectations, especially for multimodal content.
  • pkg/utils/tokenizer/types.go
    • Imported encoding/json.
    • Changed the Content field of the ChatMessage struct from string to json.RawMessage to support both simple text and structured multimodal content without double-encoding.
  • pkg/utils/tokenizer/types_test.go
    • Added a new test file with TestChatMessageSerialization to verify correct serialization and deserialization of ChatMessage with both simple text and multimodal content.
    • Added TestChatMessageContentPreservation to ensure the structure of multimodal content is preserved during JSON round-trips.
  • pkg/utils/tokenizer/utils.go
    • Renamed intToByteArray to IntToByteArray (exported) and added an internal alias intToByteArray for backward compatibility.
Activity
  • Unit tests were added for ChatMessage serialization/deserialization, covering simple text and multimodal array content, including round-trip preservation.
  • Serialization tests were introduced to verify sonic.Marshal compatibility with json.RawMessage and validate the vLLM request format.
  • Verification steps were provided, including curl commands to test simple chat completion, multimodal content, and vLLM parameters, along with instructions to check Gateway logs for tokenization consistency.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses prefix cache mismatches by integrating chat template tokenization and correcting a JSON serialization issue with multimodal content. The changes are well-structured and include appropriate tests. My review includes a few suggestions to enhance code efficiency, readability, and maintainability by refactoring an inefficient JSON handling pattern, simplifying a complex conditional block, and removing a redundant function alias.

This commit includes several refactoring improvements based on code review:

1. Extract chat tokenization logic into separate method
   - Add tokenizeChatRequest() method to reduce nesting in Route()
   - Eliminate "pyramid of doom" pattern for better readability
   - All error handling and logging encapsulated in the helper

2. Simplify message content extraction
   - Directly marshal msg.GetContent() instead of double marshal/unmarshal
   - Remove unnecessary intermediate map creation
   - More efficient and clearer code

3. Remove redundant intToByteArray alias
   - Update all callers to use exported IntToByteArray directly
   - Simplify utils.go by removing the internal alias

Benefits:
- Improved code readability and maintainability
- Better performance (fewer marshal/unmarshal operations)
- Cleaner separation of concerns

Testing: All tokenizer tests pass, including vLLM serialization tests.
Signed-off-by: penfree <pengfei.qiu@gmail.com>
Change-Id: Id2b356d946db35b25d6c20493510de85e2b893d9
- Add error checking for json.Unmarshal calls to fix errcheck lint errors
- Replace interface{} with any for Go 1.18+ compatibility

Signed-off-by: penfree <pengfei.qiu@gmail.com>
Change-Id: Ic742890ad4675fdd1a920d36aacb748e356d1735
@penfree penfree marked this pull request as ready for review March 14, 2026 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants