Question: How does WVA behave when KV cache offloading (e.g., LMCache / tiered caching) is enabled?

Hi, I've been reading through the WVA codebase and had a question about how it interacts with tiered KV caching setups.

## Context

The V2 Saturation Analyzer computes capacity using:
- `k1 = TotalKvCapacityTokens × KvCacheThreshold` (from `vllm:cache_config_info` → `num_gpu_blocks × block_size`)
- `TokensInUse = KvCacheUsage × TotalKvCapacityTokens` (from `vllm:kv_cache_usage_perc`)

Both of these metrics appear to reflect **GPU HBM block pool only**.

## Question

When KV cache offloading is active (e.g., LMCache, tiered prefix caching with host/remote tiers), the GPU KV block pool essentially becomes a Tier-1 hot cache that is intentionally kept near-full at all times — cold blocks get evicted to lower tiers rather than discarded.

In this scenario, wouldn't `kv_cache_usage_perc` report ~95-100% continuously, causing WVA to perceive the replica as always saturated and potentially trigger unnecessary scale-ups — even when the system still has headroom via lower tiers and latency is well within SLO?

Curious if this has been considered or if there are recommended config adjustments (e.g., `kvCacheThreshold: 1.0` to effectively ignore the KV signal) for tiered caching environments.

Related: #834 (Throughput Analyzer) seems like it could naturally address this by shifting from cache-utilization signals to throughput-based signals.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: How does WVA behave when KV cache offloading (e.g., LMCache / tiered caching) is enabled? #927

Context

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question: How does WVA behave when KV cache offloading (e.g., LMCache / tiered caching) is enabled? #927

Description

Context

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions