Skip to content

Conversation

@ppraneth
Copy link
Contributor

@ppraneth ppraneth commented Jan 1, 2026

Motivation

The current L1 cache implementation suffers from a performance bottleneck when handling long, multi-turn conversations. The original design re-hashed and re-tokenized the entire prefix for every special token boundary found in the prompt.

This resulted in quadratic O(N²) scaling behavior: as a conversation grows longer, the time spent inserting into the cache increases disproportionately to the length of the text. This PR refactors the logic to achieve linear O(N) scaling, ensuring the gateway remains efficient for workloads with large context windows or frequent turn-based updates.

Modifications

The following changes were made to the L1 cache logic in src/tokenizer/cache/l1.rs:

  • Incremental Hashing: Full-string hashing at every boundary was replaced with a forward pass using blake3::Hasher. The hasher is updated incrementally with only the "delta" text between boundaries, allowing all prefix hashes to be computed in a single linear pass.
  • Incremental Tokenization: The insertion logic was updated to tokenize only the new text segment between two safe boundaries. These tokens are appended to a running list, leveraging the fact that special tokens are atomic split points.
  • BOS Handling: Logic was adjusted to ensure special tokens (such as the BOS token) are only added during the tokenization of the very first segment. This prevents the cache from incorrectly duplicating BOS tokens in every cached prefix entry.

Accuracy Tests

Benchmarking and Profiling

Benchmarks performed with Criterion on a simulated ChatML prompt show a massive improvement in insertion scalability, accompanied by a minor and acceptable tradeoff in lookup latency.

Insertion Performance (Scaling Fix)

The optimization successfully eliminated the scaling bottleneck:

Turns Baseline (Original) Optimized Speedup
2 Turns 60.37 µs 58.13 µs ~1.04x
10 Turns 327.91 µs 91.95 µs ~3.5x
50 Turns 5.82 ms 283.43 µs ~20.5x

Lookup Tradeoff

There is a noted regression in longest_prefix_match latency. To support incremental hashing, the lookup now performs a full forward pass to pre-compute hashes for all boundaries before checking the cache shards.

  • Lookup (50 turns): Latency increased from 20.25 µs to 94.69 µs.

Overall Result: While lookup is slightly slower, it remains sub-millisecond and is a negligible cost compared to model inference steps. The ~20x performance gain during insertion ensures the system no longer faces a "performance cliff" as chat turn counts grow.

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments (/tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci) or contact authorized users to do so.
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@github-actions github-actions bot added dependencies Pull requests that update a dependency file model-gateway labels Jan 1, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ppraneth, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of the L1 cache within the model gateway, particularly for long, multi-turn conversations. The previous implementation suffered from a quadratic scaling issue where re-hashing and re-tokenizing entire prefixes led to disproportionate time increases. By introducing incremental hashing and tokenization techniques, the system now scales linearly, drastically improving insertion efficiency and preventing performance degradation as conversation lengths grow, while maintaining acceptable lookup speeds.

Highlights

  • Optimized L1 Cache Insertion: Refactored the L1 cache insertion logic to achieve linear O(N) scaling, addressing a previous quadratic O(N²) bottleneck in multi-turn conversations.
  • Incremental Hashing: Implemented incremental hashing using blake3::Hasher to compute prefix hashes efficiently by updating with only the "delta" text between special token boundaries.
  • Incremental Tokenization: Modified the insertion process to tokenize only new text segments between boundaries, appending them to a running token list, and ensuring special tokens like BOS are handled correctly only for the initial segment.
  • Significant Performance Gains: Achieved substantial speedups in insertion performance, up to ~20.5x for 50-turn conversations, with an acceptable and sub-millisecond tradeoff in lookup latency.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant performance optimization for the L1 cache by refactoring the insertion logic to use incremental hashing and tokenization. The move from a quadratic to a linear scaling complexity for cache insertion is a fantastic improvement, as demonstrated by the comprehensive benchmarks. The code is well-structured and the changes are clearly explained. However, I've found a critical issue in the implementation of the incremental hashing in both longest_prefix_match and insert_at_boundaries. The blake3::Hasher::finalize() method consumes the hasher, which will cause a compile error due to use-after-move in the loops. I've provided suggestions to fix this by cloning the hasher before finalizing. Once this is addressed, this will be an excellent contribution.

ppraneth and others added 2 commits January 1, 2026 14:52
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@ppraneth
Copy link
Contributor Author

ppraneth commented Jan 1, 2026

@slin1237 Can you review this pr?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file model-gateway

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant