Skip to content

feat: speculative indexing for PrecisePrefixCacheScorer#659

Open
bongwoobak wants to merge 4 commits intollm-d:mainfrom
moreh-dev:feature/speculative-indexing
Open

feat: speculative indexing for PrecisePrefixCacheScorer#659
bongwoobak wants to merge 4 commits intollm-d:mainfrom
moreh-dev:feature/speculative-indexing

Conversation

@bongwoobak
Copy link
Contributor

@bongwoobak bongwoobak commented Feb 27, 2026

Part of llm-d/llm-d-kv-cache#353

Summary

  • Implement PrepareDataPlugin interface for pre-computed block keys and prefix cache match info
  • Implement PreRequest hook to inject speculative entries into Index after routing decision
  • Add TTL-based speculative cache with automatic eviction on expiry
  • Fix PodIdentifier format to ip:port to match KV event topic format — speculative entries previously used IP-only, causing lookup mismatches with confirmed entries from KV events (which use kv@ip:port@model topic format)

Details

Part 2 of Speculative Indexing (llm-d/llm-d-kv-cache#353).
Depends on kv-cache changes: (llm-d/llm-d-kv-cache#369)

PrepareDataPlugin (PrepareRequestData)

  • Computes block keys via Indexer.ComputeBlockKeys()
  • Looks up KV-cache index to find which pods have matching blocks
  • Stores PrefixCacheMatchInfo on each endpoint for downstream consumers
  • Saves block keys + scores to PluginState for reuse by Score/PreRequest

Score() Optimization

  • Reuses pre-computed scores from PluginState when PrepareRequestData ran first
  • Falls back to full computation for backward compatibility (no PrepareRequestData)

PreRequest (Speculative Entry Injection)

  • Adds speculative PodEntry (Annotation: "speculative") to Index immediately after routing
  • Handles both decode and prefill endpoints for P/D disaggregation
  • Uses ip:port format to match KV event topic (kv@${POD_IP}:${PORT}@${MODEL})

TTL Cache for Speculative Entries

  • Tracks speculative entries per request (default TTL: 30s)
  • OnEviction callback auto-removes speculative entries from Index on expiry
  • Confirmed entries from KV events are unaffected (different Annotation value)

PodIdentifier ip:port Format Fix

  • The scheduler constructs PodIdentifier from endpoint metadata's Address and Port fields as ip:port
  • Previously only Address (IP) was used, which didn't match the PodIdentifier from KV events (kv@ip:port@model topic → PodIdentifier ip:port)
  • This mismatch meant speculative entries and confirmed entries had different PodIdentifiers for the same pod, so the LongestPrefixScorer treated them as different pods and scored them separately
  • Now both speculative and confirmed entries use the same ip:port format, enabling proper deduplication and score merging
  • This is especially critical for data parallelism (DP) deployments where multiple DP ranks run on the same pod IP but with different ports (e.g., 10.0.0.1:8000 for rank 0, 10.0.0.1:8001 for rank 1). Without the port, all DP ranks would be indistinguishable and collapse into a single PodIdentifier

Data Flow

  1. PrepareRequestData --> compute blockKeys, lookup index, store scores/matchInfo
  2. Score --> reuse pre-computed scores (or fallback to direct computation)
  3. PreRequest --> inject speculative entries for selected pod(s), register in TTL cache
  4. [KV event arrives] --> confirmed entry added (Annotation: "")
  5. [TTL expires] --> speculative entry evicted, confirmed entry remains

Test plan

  • PrepareRequestData: block key computation + PluginState storage
  • Score with PrepareRequestData: pre-computed data reuse
  • PreRequest: speculative entries added to Index for selected endpoints
  • TTL eviction: automatic cleanup after expiry, confirmed entries preserved
  • E2E flow: PrepareData --> Score --> PreRequest --> KV event --> TTL eviction
  • Backward compatibility: Score without PrepareRequestData works as before
  • E2E cluster test with PD disaggregation (DP8): back-to-back requests with same prefix --> cache hit on 2nd request

Breaking Change: PodIdentifier Format

  • PodIdentifier now requires ip:port format (e.g., 10.0.0.1:8000) to match KV event topic format (kv@ip:port@model)
  • Previously IP-only format (10.0.0.1) was used, which caused mismatches with confirmed entries from KV events
  • vLLM KV events already publish topics in kv@${POD_IP}:${PORT}@${MODEL} format, so this aligns the scheduler with the engine behavior
  • Action required: Ensure all KV event publishers include the port in the topic string

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant