[model-gateway] Optimize L1 cache insertion with incremental hashing and tokenization #16259

ppraneth · 2026-01-01T09:16:46Z

Motivation

The current L1 cache implementation suffers from a performance bottleneck when handling long, multi-turn conversations. The original design re-hashed and re-tokenized the entire prefix for every special token boundary found in the prompt.

This resulted in quadratic O(N²) scaling behavior: as a conversation grows longer, the time spent inserting into the cache increases disproportionately to the length of the text. This PR refactors the logic to achieve linear O(N) scaling, ensuring the gateway remains efficient for workloads with large context windows or frequent turn-based updates.

Modifications

The following changes were made to the L1 cache logic in src/tokenizer/cache/l1.rs:

Incremental Hashing: Full-string hashing at every boundary was replaced with a forward pass using blake3::Hasher. The hasher is updated incrementally with only the "delta" text between boundaries, allowing all prefix hashes to be computed in a single linear pass.
Incremental Tokenization: The insertion logic was updated to tokenize only the new text segment between two safe boundaries. These tokens are appended to a running list, leveraging the fact that special tokens are atomic split points.
BOS Handling: Logic was adjusted to ensure special tokens (such as the BOS token) are only added during the tokenization of the very first segment. This prevents the cache from incorrectly duplicating BOS tokens in every cached prefix entry.

Accuracy Tests

Benchmarking and Profiling

Benchmarks performed with Criterion on a simulated ChatML prompt show a massive improvement in insertion scalability, accompanied by a minor and acceptable tradeoff in lookup latency.

Insertion Performance (Scaling Fix)

The optimization successfully eliminated the scaling bottleneck:

Turns	Baseline (Original)	Optimized	Speedup
2 Turns	60.37 µs	58.13 µs	~1.04x
10 Turns	327.91 µs	91.95 µs	~3.5x
50 Turns	5.82 ms	283.43 µs	~20.5x

Lookup Tradeoff

There is a noted regression in longest_prefix_match latency. To support incremental hashing, the lookup now performs a full forward pass to pre-compute hashes for all boundaries before checking the cache shards.

Lookup (50 turns): Latency increased from 20.25 µs to 94.69 µs.

Overall Result: While lookup is slightly slower, it remains sub-millisecond and is a negligible cost compared to model inference steps. The ~20x performance gain during insertion ensures the system no longer faces a "performance cliff" as chat turn counts grow.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments (/tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci) or contact authorized users to do so.
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-01T09:17:06Z

Summary of Changes

Hello @ppraneth, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of the L1 cache within the model gateway, particularly for long, multi-turn conversations. The previous implementation suffered from a quadratic scaling issue where re-hashing and re-tokenizing entire prefixes led to disproportionate time increases. By introducing incremental hashing and tokenization techniques, the system now scales linearly, drastically improving insertion efficiency and preventing performance degradation as conversation lengths grow, while maintaining acceptable lookup speeds.

Highlights

Optimized L1 Cache Insertion: Refactored the L1 cache insertion logic to achieve linear O(N) scaling, addressing a previous quadratic O(N²) bottleneck in multi-turn conversations.
Incremental Hashing: Implemented incremental hashing using blake3::Hasher to compute prefix hashes efficiently by updating with only the "delta" text between special token boundaries.
Incremental Tokenization: Modified the insertion process to tokenize only new text segments between boundaries, appending them to a running token list, and ensuring special tokens like BOS are handled correctly only for the initial segment.
Significant Performance Gains: Achieved substantial speedups in insertion performance, up to ~20.5x for 50-turn conversations, with an acceptable and sub-millisecond tradeoff in lookup latency.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant performance optimization for the L1 cache by refactoring the insertion logic to use incremental hashing and tokenization. The move from a quadratic to a linear scaling complexity for cache insertion is a fantastic improvement, as demonstrated by the comprehensive benchmarks. The code is well-structured and the changes are clearly explained. However, I've found a critical issue in the implementation of the incremental hashing in both longest_prefix_match and insert_at_boundaries. The blake3::Hasher::finalize() method consumes the hasher, which will cause a compile error due to use-after-move in the loops. I've provided suggestions to fix this by cloning the hasher before finalizing. Once this is addressed, this will be an excellent contribution.

sgl-model-gateway/src/tokenizer/cache/l1.rs

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

ppraneth · 2026-01-01T09:30:20Z

@slin1237 Can you review this pr?

ppraneth added 7 commits December 31, 2025 18:47

bench

3351514

bench

fd0320c

bench

3be1f1f

bench

e4d3a26

bench

81c027f

apply fix

373b327

apply fix

c2ef341

ppraneth requested review from CatherineSue and slin1237 as code owners January 1, 2026 09:16

github-actions bot added dependencies Pull requests that update a dependency file model-gateway labels Jan 1, 2026

Merge branch 'main' into l1-opti

634d357

gemini-code-assist bot reviewed Jan 1, 2026

View reviewed changes

sgl-model-gateway/src/tokenizer/cache/l1.rs Outdated Show resolved Hide resolved

sgl-model-gateway/src/tokenizer/cache/l1.rs Outdated Show resolved Hide resolved

ppraneth and others added 2 commits January 1, 2026 14:52

Update sgl-model-gateway/src/tokenizer/cache/l1.rs

94a2397

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update sgl-model-gateway/src/tokenizer/cache/l1.rs

ff58252

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[model-gateway] Optimize L1 cache insertion with incremental hashing and tokenization #16259

[model-gateway] Optimize L1 cache insertion with incremental hashing and tokenization #16259

Uh oh!

ppraneth commented Jan 1, 2026

Uh oh!

gemini-code-assist bot commented Jan 1, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

ppraneth commented Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[model-gateway] Optimize L1 cache insertion with incremental hashing and tokenization #16259

Are you sure you want to change the base?

[model-gateway] Optimize L1 cache insertion with incremental hashing and tokenization #16259

Uh oh!

Conversation

ppraneth commented Jan 1, 2026

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Insertion Performance (Scaling Fix)

Lookup Tradeoff

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Jan 1, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

ppraneth commented Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant