-
Notifications
You must be signed in to change notification settings - Fork 62
Description
What
Add the vllm:prefix_cache_hits and vllm:prefix_cache_queries Prometheus counters to match vLLM's prefix cache observability surface.
Both are token-granularity counters — queries increments by total prompt tokens on each request, hits increments by the number of tokens found already cached. This matches vLLM v1's semantics in kv_cache_manager.py where prefix_cache_stats.record(num_tokens=request.num_tokens, num_hits=num_new_computed_tokens).
Why
The simulator currently tracks KV cache utilization (vllm:kv_cache_usage_perc) but has no metric for cache effectiveness. When benchmarking prefix-cache-aware scorer strategies (e.g., precise-prefix-cache-scorer vs prefix-cache-scorer), there's no way to measure whether routing decisions actually result in higher cache reuse without scraping these counters.
Both counters are needed — hits alone is uninterpretable without the queries denominator. Together they give rate(vllm:prefix_cache_hits[5m]) / rate(vllm:prefix_cache_queries[5m]) for a rolling hit rate.
Implementation notes
The data is already computed in pkg/kv-cache/kv_cache.go:OnRequestStart():
len(tokens)→ maps toqueriesincrementnBlocksAlreadyInCache * blockSize→ maps tohitsincrement
Wiring follows the existing pattern: add a channel + async updater goroutine in metrics.go, same as kvCacheUsageChan → kvCacheUsageUpdater().
Only applies when --enable-kvcache is set — counters stay at zero otherwise (matching vLLM behavior when prefix caching is disabled).
Ref
- Discussed in Block-Level KV Cache Tracking for Prefix-Aware Scorer Validation #347 (comment by @mayabar)
- vLLM source:
vllm/v1/metrics/loggers.py:509-516,vllm/v1/metrics/stats.py:115-142