Skip to content

Add vllm:prefix_cache_hits and vllm:prefix_cache_queries Prometheus counters #356

@InfraWhisperer

Description

@InfraWhisperer

What

Add the vllm:prefix_cache_hits and vllm:prefix_cache_queries Prometheus counters to match vLLM's prefix cache observability surface.

Both are token-granularity counters — queries increments by total prompt tokens on each request, hits increments by the number of tokens found already cached. This matches vLLM v1's semantics in kv_cache_manager.py where prefix_cache_stats.record(num_tokens=request.num_tokens, num_hits=num_new_computed_tokens).

Why

The simulator currently tracks KV cache utilization (vllm:kv_cache_usage_perc) but has no metric for cache effectiveness. When benchmarking prefix-cache-aware scorer strategies (e.g., precise-prefix-cache-scorer vs prefix-cache-scorer), there's no way to measure whether routing decisions actually result in higher cache reuse without scraping these counters.

Both counters are needed — hits alone is uninterpretable without the queries denominator. Together they give rate(vllm:prefix_cache_hits[5m]) / rate(vllm:prefix_cache_queries[5m]) for a rolling hit rate.

Implementation notes

The data is already computed in pkg/kv-cache/kv_cache.go:OnRequestStart():

  • len(tokens) → maps to queries increment
  • nBlocksAlreadyInCache * blockSize → maps to hits increment

Wiring follows the existing pattern: add a channel + async updater goroutine in metrics.go, same as kvCacheUsageChankvCacheUsageUpdater().

Only applies when --enable-kvcache is set — counters stay at zero otherwise (matching vLLM behavior when prefix caching is disabled).

Ref

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions