-
Notifications
You must be signed in to change notification settings - Fork 91
Open
Milestone
Description
Overview
Prefix-cache-aware routing currently assumes uniform attention across all layers - a block is either cached or not. With HMA (hybrid models mixing full-attention and sliding-window layers), this breaks down:
- vLLM can evict SWA blocks independently of full-attention blocks, creating partial hits: full-attention prefix is cached but SWA blocks for that range are gone
- A partial hit is not a miss. The last N blocks (where N = window size in blocks) need SWA state, but earlier blocks don't. The indexer must distinguish between full hits, partial hits (full-attention present, SWA evicted outside the window), and misses
- The
BlockStored/BlockRemovedevent format and the indexer's scoring logic are unaware of per-group block identity, so the scheduler cannot reason about this today
Proposed Direction
- Keep
BlockStoredas-is - a stored block implies all layer groups are cached (full-attention + SWA) - Extend
BlockRemovedto indicate eviction type: full (all layers, equivalent to today's behavior) vs partial (SWA layers only, full-attention retained). This avoids per-group tagging on every block hash - The indexer/scorer needs the model's SWA window size to determine whether a partial hit qualifies as a routable hit
- Per-layer heterogeneous window sizes don't exist in production models today — defer support, but keep the format extensible
- Upstream dependency: vLLM KVEvents emission needs to be HMA-aware (ref)
This was identified by @orozery.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels