Add SHA256-CBOR hashing algorithm for token processor with extra keys…#587
Open
leipanhz wants to merge 2 commits into
Open
Add SHA256-CBOR hashing algorithm for token processor with extra keys…#587leipanhz wants to merge 2 commits into
leipanhz wants to merge 2 commits into
Conversation
|
Unsigned commits detected! Please sign your commits. For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation. |
850b062 to
6d44c16
Compare
… support Add configurable hashing algorithm in token processor with FNV64a as the default and SHA256-CBOR as an alternative for vLLM compatibility. Add support for extra keys (multimodal features) in block hash computation. Include comprehensive unit tests for SHA256 hashing and extra keys functionality. Changes: - Add HashAlgorithm configuration (FNV64a default, SHA256-CBOR for vLLM) - Implement SHA256-CBOR hashing matching vLLM engine-key computation - Extend BlockExtraFeatures for multimodal content support Signed-off-by: Lei Pan <leipan@ibm.com> Co-authored-by: Maroon Ayoub <maroon.ayoub@ibm.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
0bfdfca to
93bed83
Compare
Signed-off-by: Lei Pan <leipan@ibm.com>
93bed83 to
461c512
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add KV-cache file prefetch plugin for inference requests (experimental feature)
Part 1: changes in KV-Cache (current PR)
Part 2: changes in llm-d-router
PR Description:
Introduces a new experimental feature that aims to proactively prefetch KV-cache blocks across different storage tiers before inference requests are processed by the GPU pod. The plugin extends the precise prefix cache scorer with engine key calculation to determine the storage location (file names) of KV-cache blocks that will be needed and arrange for them to be promoted to a closer storage tier to improve inference latency. The current implementation is intended for a shared file system that includes transparent access to a remote storage tier, such as IBM Storage Scale configured to off-load cold data to remote object storage. The prefetch plugin uses a concurrent worker thread pool architecture to efficiently prefetch multiple (configurable) files in parallel from remote storage to the shared file system. In a future version of the plugin this could be extended, for example, to prefetch KV-cache blocks from the file system to CPU memory on the worker node that the request is being routed to.
For this to work correctly, the plugin must be configured to use a hash algorithm for generating engine keys that matches the algorithm used by vLLM when offloading KV-cache blocks to storage. For this purpose, this work adds a configurable hashing algorithm SHA256-CBOR to the token processor as an alternative for vLLM compatibility. The SHA256-CBOR implementation supports extra keys (multimodal features) in block hash computation. In addition, this feature relies on logic derived from the llm-d-fs-connector to generate KV file names, so it currently only works with the llm-d-fs-connector.
Changes include:
New Prefetch Plugin (prefetch_prerequest_experimental.go):
Precise Prefix Cache Scorer Enhancement (precise_prefix_cache.go):
Add SHA256-CBOR hashing algorithm for token processor with extra keys support