Skip to content

feat: Add Hybrid Model Architecture (HMA) Support in Prefix-Cache Aware Scheduling#427

Open
kapiljain1989 wants to merge 1 commit intollm-d:mainfrom
kapiljain1989:hma-prefix-routing
Open

feat: Add Hybrid Model Architecture (HMA) Support in Prefix-Cache Aware Scheduling#427
kapiljain1989 wants to merge 1 commit intollm-d:mainfrom
kapiljain1989:hma-prefix-routing

Conversation

@kapiljain1989
Copy link
Copy Markdown

@kapiljain1989 kapiljain1989 commented Mar 16, 2026

Overview

This PR adds support for Hybrid Model Architecture (HMA) with partial KV cache eviction, enabling efficient KV cache management for models with mixed attention mechanisms (e.g., DeepSeek-R1 style models with full-attention and sliding-window-attention layers).
Issue 336

New Features

  1. Partial Eviction Support

Added ability to evict specific attention groups while preserving others in the KV cache.

New Field in BlockRemovedEvent:

  type BlockRemovedEvent struct {
      BlockHashes   []uint64
      DeviceTier    string
      EvictedGroups []int  // NEW: nil/empty = full eviction, non-empty = partial eviction
  }

Event Processing:

  • Detects partial vs. full eviction based on EvictedGroups field
  • For partial evictions: Updates block metadata to track which groups remain cached
  • For full evictions: Removes entry completely (maintains existing behavior)
  1. HybridModelScorer

New scoring strategy that considers attention group availability when selecting pods for KV cache reuse.

Features:

  • Computes which attention groups are needed based on request tokens
  • Validates that pods have required groups cached
  • Applies coverage multipliers:
    • 0.0: Missing Group 0 (full-attention) - pod excluded
    • 1.0: All useful groups cached - full hit
    • 0.3 + 0.7 × ratio: Partial hit - scaled by coverage

Example:
// Request with 8K tokens, model has 4K sliding window
// Useful groups: [0 (full-attn), 1 (SWA)]
// Pod A has both groups: score × 1.0
// Pod B has only group 0: score × 0.65
// Pod C missing group 0: score × 0.0 (excluded)

  1. Model Registry

Per-model configuration defining attention group architecture.

Configuration:

  type ModelConfig struct {
      AttentionGroups map[int]*AttentionGroupConfig
  }
  type AttentionGroupConfig struct {
      WindowSize *int  // nil = full-attention, value = sliding-window size
  }

Example Config:
modelConfigs:
deepseek-r1:
attentionGroups:
0:
windowSize: null # Full-attention group (always required)
1:
windowSize: 4096 # SWA group (useful when tokens > 4096)

  1. Storage Layer Enhancements

PodEntry Extended:

  type PodEntry struct {
      PodIdentifier string
      DeviceTier    string
      EvictedGroups []int  // NEW: Tracks which groups were evicted
  }

Cost Calculation Updated:

  • Includes EvictedGroups slice in memory cost estimation
  • Maintains accurate cache accounting
  1. vLLM Adapter Integration

Extended Event Parsing:

  type msgpackVLLMBlockRemovedEvent struct {
      Tag           string
      BlockHashes   []any
      Medium        *string
      EvictedGroups []int   // NEW: Parsed from vLLM events
  }

Configuration

Enable HybridModel Scoring

  scoringStrategy: "HybridModel"
  backendConfigs:
    - name: "gpu"
      weight: 1.0
      modelConfigs:
        deepseek-r1:
          attentionGroups:
            0: {}  # Full-attention
            1: { windowSize: 4096 }  # SWA

Fallback to LongestPrefix

If no model configs provided or model not in registry, automatically falls back to standard LongestPrefixMatch scoring.

Backward Compatibility

  • EvictedGroups = nil treated as full eviction (existing behavior)
  • Models without HMA config use LongestPrefixMatch scoring
  • vLLM events without EvictedGroups field supported

Test Coverage

New Tests Added

  • TestDecodeVLLMEvent_BlockRemoved: Partial eviction event parsing
  • TestHybridModelScorer_FullHit: All groups cached
  • TestHybridModelScorer_PartialHit: Some groups cached
  • TestHybridModelScorer_MissingGroup0: Missing required group
  • TestHybridModelScorer_SWANotUseful: Request below window size
  • TestHybridModelScorer_MultipleUsefulGroups: Complex scenarios
  • TestHybridModelScorer_NoModelConfigs: Fallback behavior
  • TestHybridModelScorer_FallbackToLongestPrefix: Unknown model

@gyliu513
Copy link
Copy Markdown
Contributor

gyliu513 commented Apr 2, 2026

@kapiljain1989 can you help rebase?

@kapiljain1989 kapiljain1989 changed the title feat: Add Hybrid Model Architecture (HMA) Support in Prefix-Cache Aware Scheduling [WIP]feat: Add Hybrid Model Architecture (HMA) Support in Prefix-Cache Aware Scheduling Apr 2, 2026
@kapiljain1989
Copy link
Copy Markdown
Author

@gyliu513 ,
there is some changes at vllm side that i need to incorporate. I will do needful

@github-actions github-actions bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 7, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

Unsigned commits detected! Please sign your commits.

For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation.

Signed-off-by: Kapil Jain <kapiljain1989@gmail.com>

# Conflicts:
#	pkg/kvcache/indexer.go
@kapiljain1989 kapiljain1989 changed the title [WIP]feat: Add Hybrid Model Architecture (HMA) Support in Prefix-Cache Aware Scheduling feat: Add Hybrid Model Architecture (HMA) Support in Prefix-Cache Aware Scheduling Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants