-
Notifications
You must be signed in to change notification settings - Fork 3.9k
[model-gateway] Optimize L1 cache insertion with incremental hashing and tokenization #16259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @ppraneth, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the performance of the L1 cache within the model gateway, particularly for long, multi-turn conversations. The previous implementation suffered from a quadratic scaling issue where re-hashing and re-tokenizing entire prefixes led to disproportionate time increases. By introducing incremental hashing and tokenization techniques, the system now scales linearly, drastically improving insertion efficiency and preventing performance degradation as conversation lengths grow, while maintaining acceptable lookup speeds. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant performance optimization for the L1 cache by refactoring the insertion logic to use incremental hashing and tokenization. The move from a quadratic to a linear scaling complexity for cache insertion is a fantastic improvement, as demonstrated by the comprehensive benchmarks. The code is well-structured and the changes are clearly explained. However, I've found a critical issue in the implementation of the incremental hashing in both longest_prefix_match and insert_at_boundaries. The blake3::Hasher::finalize() method consumes the hasher, which will cause a compile error due to use-after-move in the loops. I've provided suggestions to fix this by cloning the hasher before finalizing. Once this is addressed, this will be an excellent contribution.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
@slin1237 Can you review this pr? |
Motivation
The current L1 cache implementation suffers from a performance bottleneck when handling long, multi-turn conversations. The original design re-hashed and re-tokenized the entire prefix for every special token boundary found in the prompt.
This resulted in quadratic O(N²) scaling behavior: as a conversation grows longer, the time spent inserting into the cache increases disproportionately to the length of the text. This PR refactors the logic to achieve linear O(N) scaling, ensuring the gateway remains efficient for workloads with large context windows or frequent turn-based updates.
Modifications
The following changes were made to the L1 cache logic in
src/tokenizer/cache/l1.rs:blake3::Hasher. The hasher is updated incrementally with only the "delta" text between boundaries, allowing all prefix hashes to be computed in a single linear pass.Accuracy Tests
Benchmarking and Profiling
Benchmarks performed with Criterion on a simulated ChatML prompt show a massive improvement in insertion scalability, accompanied by a minor and acceptable tradeoff in lookup latency.
Insertion Performance (Scaling Fix)
The optimization successfully eliminated the scaling bottleneck:
Lookup Tradeoff
There is a noted regression in
longest_prefix_matchlatency. To support incremental hashing, the lookup now performs a full forward pass to pre-compute hashes for all boundaries before checking the cache shards.Overall Result: While lookup is slightly slower, it remains sub-millisecond and is a negligible cost compared to model inference steps. The ~20x performance gain during insertion ensures the system no longer faces a "performance cliff" as chat turn counts grow.
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci) or contact authorized users to do so.