Support HMA (Hybrid Memory Attention) in Prefix-Cache Aware Scheduling

## Overview

Prefix-cache-aware routing currently assumes uniform attention across all layers - a block is either cached or not. With HMA (hybrid models mixing full-attention and sliding-window layers), this breaks down:

- vLLM can evict SWA blocks independently of full-attention blocks, creating partial hits: full-attention prefix is cached but SWA blocks for that range are gone
- A partial hit is not a miss. The last N blocks (where N = window size in blocks) need SWA state, but earlier blocks don't. The indexer must distinguish between full hits, partial hits (full-attention present, SWA evicted outside the window), and misses
- The `BlockStored`/`BlockRemoved` event format and the indexer's scoring logic are unaware of per-group block identity, so the scheduler cannot reason about this today

## Proposed Direction

- Keep `BlockStored` as-is - a stored block implies all layer groups are cached (full-attention + SWA)
- Extend `BlockRemoved` to indicate eviction type: full (all layers, equivalent to today's behavior) vs partial (SWA layers only, full-attention retained). This avoids per-group tagging on every block hash
- The indexer/scorer needs the model's SWA window size to determine whether a partial hit qualifies as a routable hit
- Per-layer heterogeneous window sizes don't exist in production models today — defer support, but keep the format extensible
- Upstream dependency: vLLM KVEvents emission needs to be HMA-aware ([ref](https://github.com/vllm-project/vllm/blob/192ad4648b2066ebdf1fa04ad84f24bdf0cd6533/vllm/config/vllm.py#L1004-L1006))

This was identified by @orozery.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support HMA (Hybrid Memory Attention) in Prefix-Cache Aware Scheduling #336

Overview

Proposed Direction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support HMA (Hybrid Memory Attention) in Prefix-Cache Aware Scheduling #336

Description

Overview

Proposed Direction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions