feat: Add Hybrid Model Architecture (HMA) Support in Prefix-Cache Aware Scheduling#427
Open
kapiljain1989 wants to merge 1 commit intollm-d:mainfrom
Open
feat: Add Hybrid Model Architecture (HMA) Support in Prefix-Cache Aware Scheduling#427kapiljain1989 wants to merge 1 commit intollm-d:mainfrom
kapiljain1989 wants to merge 1 commit intollm-d:mainfrom
Conversation
Contributor
|
@kapiljain1989 can you help rebase? |
Author
|
@gyliu513 , |
b781efc to
2adfdc3
Compare
|
Unsigned commits detected! Please sign your commits. For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation. |
2adfdc3 to
9949b80
Compare
Signed-off-by: Kapil Jain <kapiljain1989@gmail.com> # Conflicts: # pkg/kvcache/indexer.go
9949b80 to
1972927
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR adds support for Hybrid Model Architecture (HMA) with partial KV cache eviction, enabling efficient KV cache management for models with mixed attention mechanisms (e.g., DeepSeek-R1 style models with full-attention and sliding-window-attention layers).
Issue 336
New Features
Added ability to evict specific attention groups while preserving others in the KV cache.
New Field in BlockRemovedEvent:
Event Processing:
New scoring strategy that considers attention group availability when selecting pods for KV cache reuse.
Features:
Example:
// Request with 8K tokens, model has 4K sliding window
// Useful groups: [0 (full-attn), 1 (SWA)]
// Pod A has both groups: score × 1.0
// Pod B has only group 0: score × 0.65
// Pod C missing group 0: score × 0.0 (excluded)
Per-model configuration defining attention group architecture.
Configuration:
Example Config:
modelConfigs:
deepseek-r1:
attentionGroups:
0:
windowSize: null # Full-attention group (always required)
1:
windowSize: 4096 # SWA group (useful when tokens > 4096)
PodEntry Extended:
Cost Calculation Updated:
Extended Event Parsing:
Configuration
Enable HybridModel Scoring
Fallback to LongestPrefix
If no model configs provided or model not in registry, automatically falls back to standard LongestPrefixMatch scoring.
Backward Compatibility
Test Coverage
New Tests Added