Skip to content

Question: How does WVA behave when KV cache offloading (e.g., LMCache / tiered caching) is enabled? #927

@hyunnnchoi

Description

@hyunnnchoi

Hi, I've been reading through the WVA codebase and had a question about how it interacts with tiered KV caching setups.

Context

The V2 Saturation Analyzer computes capacity using:

  • k1 = TotalKvCapacityTokens × KvCacheThreshold (from vllm:cache_config_infonum_gpu_blocks × block_size)
  • TokensInUse = KvCacheUsage × TotalKvCapacityTokens (from vllm:kv_cache_usage_perc)

Both of these metrics appear to reflect GPU HBM block pool only.

Question

When KV cache offloading is active (e.g., LMCache, tiered prefix caching with host/remote tiers), the GPU KV block pool essentially becomes a Tier-1 hot cache that is intentionally kept near-full at all times — cold blocks get evicted to lower tiers rather than discarded.

In this scenario, wouldn't kv_cache_usage_perc report ~95-100% continuously, causing WVA to perceive the replica as always saturated and potentially trigger unnecessary scale-ups — even when the system still has headroom via lower tiers and latency is well within SLO?

Curious if this has been considered or if there are recommended config adjustments (e.g., kvCacheThreshold: 1.0 to effectively ignore the KV signal) for tiered caching environments.

Related: #834 (Throughput Analyzer) seems like it could naturally address this by shifting from cache-utilization signals to throughput-based signals.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageIndicates an issue or PR lacks a triage label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions