feat: speculative indexing for PrecisePrefixCacheScorer by bongwoobak · Pull Request #659 · llm-d/llm-d-inference-scheduler

bongwoobak · 2026-02-27T16:28:16Z

Part of llm-d/llm-d-kv-cache#353

Summary

Implement PrepareDataPlugin interface for pre-computed block keys and prefix cache match info
Implement PreRequest hook to inject speculative entries into Index after routing decision
Add TTL-based speculative cache with automatic eviction on expiry
Fix PodIdentifier format to ip:port to match KV event topic format — speculative entries previously used IP-only, causing lookup mismatches with confirmed entries from KV events (which use kv@ip:port@model topic format)

Details

Part 2 of Speculative Indexing (llm-d/llm-d-kv-cache#353).
Depends on kv-cache changes: (llm-d/llm-d-kv-cache#369)

PrepareDataPlugin (PrepareRequestData)

Computes block keys via Indexer.ComputeBlockKeys()
Looks up KV-cache index to find which pods have matching blocks
Stores PrefixCacheMatchInfo on each endpoint for downstream consumers
Saves block keys + scores to PluginState for reuse by Score/PreRequest

Score() Optimization

Reuses pre-computed scores from PluginState when PrepareRequestData ran first
Falls back to full computation for backward compatibility (no PrepareRequestData)

PreRequest (Speculative Entry Injection)

Adds speculative PodEntry (Annotation: "speculative") to Index immediately after routing
Handles both decode and prefill endpoints for P/D disaggregation
Uses ip:port format to match KV event topic (kv@${POD_IP}:${PORT}@${MODEL})

TTL Cache for Speculative Entries

Tracks speculative entries per request (default TTL: 30s)
OnEviction callback auto-removes speculative entries from Index on expiry
Confirmed entries from KV events are unaffected (different Annotation value)

PodIdentifier ip:port Format Fix

The scheduler constructs PodIdentifier from endpoint metadata's Address and Port fields as ip:port
Previously only Address (IP) was used, which didn't match the PodIdentifier from KV events (kv@ip:port@model topic → PodIdentifier ip:port)
This mismatch meant speculative entries and confirmed entries had different PodIdentifiers for the same pod, so the LongestPrefixScorer treated them as different pods and scored them separately
Now both speculative and confirmed entries use the same ip:port format, enabling proper deduplication and score merging
This is especially critical for data parallelism (DP) deployments where multiple DP ranks run on the same pod IP but with different ports (e.g., 10.0.0.1:8000 for rank 0, 10.0.0.1:8001 for rank 1). Without the port, all DP ranks would be indistinguishable and collapse into a single PodIdentifier

Data Flow

PrepareRequestData --> compute blockKeys, lookup index, store scores/matchInfo
Score --> reuse pre-computed scores (or fallback to direct computation)
PreRequest --> inject speculative entries for selected pod(s), register in TTL cache
[KV event arrives] --> confirmed entry added (Annotation: "")
[TTL expires] --> speculative entry evicted, confirmed entry remains

Test plan

PrepareRequestData: block key computation + PluginState storage
Score with PrepareRequestData: pre-computed data reuse
PreRequest: speculative entries added to Index for selected endpoints
TTL eviction: automatic cleanup after expiry, confirmed entries preserved
E2E flow: PrepareData --> Score --> PreRequest --> KV event --> TTL eviction
Backward compatibility: Score without PrepareRequestData works as before
E2E cluster test with PD disaggregation (DP8): back-to-back requests with same prefix --> cache hit on 2nd request

Breaking Change: PodIdentifier Format

PodIdentifier now requires ip:port format (e.g., 10.0.0.1:8000) to match KV event topic format (kv@ip:port@model)
Previously IP-only format (10.0.0.1) was used, which caused mismatches with confirmed entries from KV events
vLLM KV events already publish topics in kv@${POD_IP}:${PORT}@${MODEL} format, so this aligns the scheduler with the engine behavior
Action required: Ensure all KV event publishers include the port in the topic string

… format

bongwoobak added 3 commits February 27, 2026 20:18

feat: speculative indexing for PrecisePrefixCacheScorer

2c65474

fix: use ip:port format for PodIdentifier to match KV event topics

c65f38d

fix: split Address/Port in test endpoints to match production ip:port…

7452ca0

… format

github-project-automation bot added this to llm-d-inference-scheduler Feb 27, 2026

github-actions bot requested review from nilig and shmuelk February 27, 2026 16:28

make SpeculativeIndexing optional

17abea8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: speculative indexing for PrecisePrefixCacheScorer#659

feat: speculative indexing for PrecisePrefixCacheScorer#659
bongwoobak wants to merge 4 commits intollm-d:mainfrom
moreh-dev:feature/speculative-indexing

bongwoobak commented Feb 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bongwoobak commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

PrepareDataPlugin (PrepareRequestData)

Score() Optimization

PreRequest (Speculative Entry Injection)

TTL Cache for Speculative Entries

PodIdentifier ip:port Format Fix

Data Flow

Test plan

Breaking Change: PodIdentifier Format

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bongwoobak commented Feb 27, 2026 •

edited

Loading