Skip to content

Speculative Indexing for the KV-Cache Aware Routing #353

@vMaroon

Description

@vMaroon

Summary

Details in google doc.

TL;DR: Feed routing decisions into the KV-cache index as speculative block entries alongside engine-confirmed blocks. A side TTL cache expires unconfirmed entries via Index.Evict() callbacks.
Block annotations distinguish speculative from confirmed, generalizing to future use cases (HMA full-attention vs SWA). The PrecisePrefixCacheScorer gains PrepareDataPlugin and PreRequest hooks, aligning with the gateway plugin framework. This functionally unifies the IGW approximate scorer and the llm-d precise scorer.

Metadata

Metadata

Assignees

Labels

help wantedExtra attention is needed

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions