feat: Add Hybrid Model Architecture (HMA) Support in Prefix-Cache Aware Scheduling by kapiljain1989 · Pull Request #427 · llm-d/llm-d-kv-cache

kapiljain1989 · 2026-03-16T14:13:10Z

Overview

This PR adds support for Hybrid Model Architecture (HMA) with partial KV cache eviction, enabling efficient KV cache management for models with mixed attention mechanisms (e.g., DeepSeek-R1 style models with full-attention and sliding-window-attention layers).
Issue 336

New Features

Partial Eviction Support

Added ability to evict specific attention groups while preserving others in the KV cache.

New Field in BlockRemovedEvent:

  type BlockRemovedEvent struct {
      BlockHashes   []uint64
      DeviceTier    string
      EvictedGroups []int  // NEW: nil/empty = full eviction, non-empty = partial eviction
  }

Event Processing:

Detects partial vs. full eviction based on EvictedGroups field
For partial evictions: Updates block metadata to track which groups remain cached
For full evictions: Removes entry completely (maintains existing behavior)

HybridModelScorer

New scoring strategy that considers attention group availability when selecting pods for KV cache reuse.

Features:

Computes which attention groups are needed based on request tokens
Validates that pods have required groups cached
Applies coverage multipliers:
- 0.0: Missing Group 0 (full-attention) - pod excluded
- 1.0: All useful groups cached - full hit
- 0.3 + 0.7 × ratio: Partial hit - scaled by coverage

Example:
// Request with 8K tokens, model has 4K sliding window
// Useful groups: [0 (full-attn), 1 (SWA)]
// Pod A has both groups: score × 1.0
// Pod B has only group 0: score × 0.65
// Pod C missing group 0: score × 0.0 (excluded)

Model Registry

Per-model configuration defining attention group architecture.

Configuration:

  type ModelConfig struct {
      AttentionGroups map[int]*AttentionGroupConfig
  }

  type AttentionGroupConfig struct {
      WindowSize *int  // nil = full-attention, value = sliding-window size
  }

Example Config:
modelConfigs:
deepseek-r1:
attentionGroups:
0:
windowSize: null # Full-attention group (always required)
1:
windowSize: 4096 # SWA group (useful when tokens > 4096)

Storage Layer Enhancements

PodEntry Extended:

  type PodEntry struct {
      PodIdentifier string
      DeviceTier    string
      EvictedGroups []int  // NEW: Tracks which groups were evicted
  }

Cost Calculation Updated:

Includes EvictedGroups slice in memory cost estimation
Maintains accurate cache accounting

vLLM Adapter Integration

Extended Event Parsing:

  type msgpackVLLMBlockRemovedEvent struct {
      Tag           string
      BlockHashes   []any
      Medium        *string
      EvictedGroups []int   // NEW: Parsed from vLLM events
  }

Configuration

Enable HybridModel Scoring

  scoringStrategy: "HybridModel"
  backendConfigs:
    - name: "gpu"
      weight: 1.0
      modelConfigs:
        deepseek-r1:
          attentionGroups:
            0: {}  # Full-attention
            1: { windowSize: 4096 }  # SWA

Fallback to LongestPrefix

If no model configs provided or model not in registry, automatically falls back to standard LongestPrefixMatch scoring.

Backward Compatibility

EvictedGroups = nil treated as full eviction (existing behavior)
Models without HMA config use LongestPrefixMatch scoring
vLLM events without EvictedGroups field supported

Test Coverage

New Tests Added

TestDecodeVLLMEvent_BlockRemoved: Partial eviction event parsing
TestHybridModelScorer_FullHit: All groups cached
TestHybridModelScorer_PartialHit: Some groups cached
TestHybridModelScorer_MissingGroup0: Missing required group
TestHybridModelScorer_SWANotUseful: Request below window size
TestHybridModelScorer_MultipleUsefulGroups: Complex scenarios
TestHybridModelScorer_NoModelConfigs: Fallback behavior
TestHybridModelScorer_FallbackToLongestPrefix: Unknown model

gyliu513 · 2026-04-02T02:22:34Z

@kapiljain1989 can you help rebase?

kapiljain1989 · 2026-04-02T02:37:53Z

@gyliu513 ,
there is some changes at vllm side that i need to incorporate. I will do needful

github-actions · 2026-04-07T02:26:29Z

Unsigned commits detected! Please sign your commits.

For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation.

Signed-off-by: Kapil Jain <kapiljain1989@gmail.com> # Conflicts: # pkg/kvcache/indexer.go

kapiljain1989 requested review from dannyharnik, kfirtoledo, liu-cong and vMaroon as code owners March 16, 2026 14:13

github-actions bot requested review from hyeongyun0916, sagearc and yankay March 16, 2026 14:13

kapiljain1989 changed the title ~~feat: Add Hybrid Model Architecture (HMA) Support in Prefix-Cache Aware Scheduling~~ [WIP]feat: Add Hybrid Model Architecture (HMA) Support in Prefix-Cache Aware Scheduling Apr 2, 2026

kapiljain1989 force-pushed the hma-prefix-routing branch from b781efc to 2adfdc3 Compare April 7, 2026 02:26

github-actions bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 7, 2026

kapiljain1989 force-pushed the hma-prefix-routing branch from 2adfdc3 to 9949b80 Compare April 7, 2026 02:37

hma prefix routing

1972927

Signed-off-by: Kapil Jain <kapiljain1989@gmail.com> # Conflicts: # pkg/kvcache/indexer.go

kapiljain1989 force-pushed the hma-prefix-routing branch from 9949b80 to 1972927 Compare April 7, 2026 02:45

kapiljain1989 changed the title ~~[WIP]feat: Add Hybrid Model Architecture (HMA) Support in Prefix-Cache Aware Scheduling~~ feat: Add Hybrid Model Architecture (HMA) Support in Prefix-Cache Aware Scheduling Apr 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Hybrid Model Architecture (HMA) Support in Prefix-Cache Aware Scheduling#427

feat: Add Hybrid Model Architecture (HMA) Support in Prefix-Cache Aware Scheduling#427
kapiljain1989 wants to merge 1 commit intollm-d:mainfrom
kapiljain1989:hma-prefix-routing

kapiljain1989 commented Mar 16, 2026 •

edited

Loading

Uh oh!

gyliu513 commented Apr 2, 2026

Uh oh!

kapiljain1989 commented Apr 2, 2026

Uh oh!

github-actions bot commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kapiljain1989 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gyliu513 commented Apr 2, 2026

Uh oh!

kapiljain1989 commented Apr 2, 2026

Uh oh!

github-actions bot commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kapiljain1989 commented Mar 16, 2026 •

edited

Loading