Skip to content

Support HMA (Hybrid Memory Attention) in Prefix-Cache Aware Scheduling #336

@vMaroon

Description

@vMaroon

Overview

Prefix-cache-aware routing currently assumes uniform attention across all layers - a block is either cached or not. With HMA (hybrid models mixing full-attention and sliding-window layers), this breaks down:

  • vLLM can evict SWA blocks independently of full-attention blocks, creating partial hits: full-attention prefix is cached but SWA blocks for that range are gone
  • A partial hit is not a miss. The last N blocks (where N = window size in blocks) need SWA state, but earlier blocks don't. The indexer must distinguish between full hits, partial hits (full-attention present, SWA evicted outside the window), and misses
  • The BlockStored/BlockRemoved event format and the indexer's scoring logic are unaware of per-group block identity, so the scheduler cannot reason about this today

Proposed Direction

  • Keep BlockStored as-is - a stored block implies all layer groups are cached (full-attention + SWA)
  • Extend BlockRemoved to indicate eviction type: full (all layers, equivalent to today's behavior) vs partial (SWA layers only, full-attention retained). This avoids per-group tagging on every block hash
  • The indexer/scorer needs the model's SWA window size to determine whether a partial hit qualifies as a routable hit
  • Per-layer heterogeneous window sizes don't exist in production models today — defer support, but keep the format extensible
  • Upstream dependency: vLLM KVEvents emission needs to be HMA-aware (ref)

This was identified by @orozery.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions