[metrics] Add SWA prefix-cache truncation counter#29528
[metrics] Add SWA prefix-cache truncation counter#29528brucechanglongxu wants to merge 1 commit into
Conversation
For hybrid sliding-window-attention models, a request can match a long logical prefix in the SWA radix tree while the usable hit is truncated to the last point still covered by sliding_window_size of live (non-tombstoned) SWA KV. Only the aggregate cache-hit rate was observable, so hit-rate drops could not be attributed to SWA eviction. SWARadixCache._match_prefix_helper already computes both quantities: it walks the full logical match into `value` but only accepts best_value_len entries as a usable hit. Surface the dropped remainder as a Prometheus counter (sglang:swa_prefix_truncated_tokens_total), recorded at the match boundary and mirroring the existing evicted_tokens_total metric. Metrics-only; the hot path early-returns when no truncation occurs.
There was a problem hiding this comment.
Code Review
This pull request introduces a new metric to track SWA prefix truncation, which counts the prefix tokens matched in the radix tree but dropped from the cache hit because the sliding-window attention KV had already been evicted. This is implemented across BasePrefixCache, SWARadixCache, and MetricsCollector, along with a corresponding unit test. A review comment identifies a bug in the new unit test where assigning a mock metrics collector before calling tree.evict will cause an AttributeError due to missing eviction metrics methods on the mock.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| spy = _SpyCollector() | ||
| tree.metrics_collector = spy | ||
|
|
||
| for token_ids in ( | ||
| [1, 2, 3], | ||
| [1, 2, 3, 4, 5, 6, 7], | ||
| [10, 11, 12], | ||
| [1, 2, 3, 4, 5, 60, 70], | ||
| ): | ||
| kv_indices = allocator.alloc(len(token_ids)) | ||
| key = RadixKey(array("q", token_ids)) | ||
| tree.insert(InsertParams(key=key, value=kv_indices[: len(key)])) | ||
|
|
||
| # Evict SWA capacity (tombstoning the shared [1..5] prefix) while keeping | ||
| # the full-attention KV alive. | ||
| tree.evict(EvictParams(num_tokens=1, swa_num_tokens=0)) | ||
| tree.evict(EvictParams(num_tokens=0, swa_num_tokens=1)) | ||
| tree.evict(EvictParams(num_tokens=1, swa_num_tokens=2)) | ||
|
|
||
| # [1, 2, 3, 4, 5] still matches in the tree, but the sliding window is no | ||
| # longer covered by live SWA KV, so the usable hit is truncated to zero. | ||
| result = tree.match_prefix( | ||
| MatchPrefixParams(key=RadixKey(array("q", [1, 2, 3, 4, 5]))) | ||
| ) |
There was a problem hiding this comment.
The newly added unit test will crash with an AttributeError during the tree.evict calls. This is because tree.metrics_collector is set to spy (an instance of _SpyCollector), which does not implement the eviction metrics methods (observe_eviction_duration and increment_eviction_num_tokens) called by tree.evict via update_eviction_metrics.
To fix this, we should only assign tree.metrics_collector = spy after the eviction steps are completed, right before calling tree.match_prefix.
for token_ids in (
[1, 2, 3],
[1, 2, 3, 4, 5, 6, 7],
[10, 11, 12],
[1, 2, 3, 4, 5, 60, 70],
):
kv_indices = allocator.alloc(len(token_ids))
key = RadixKey(array("q", token_ids))
tree.insert(InsertParams(key=key, value=kv_indices[: len(key)]))
# Evict SWA capacity (tombstoning the shared [1..5] prefix) while keeping
# the full-attention KV alive.
tree.evict(EvictParams(num_tokens=1, swa_num_tokens=0))
tree.evict(EvictParams(num_tokens=0, swa_num_tokens=1))
tree.evict(EvictParams(num_tokens=1, swa_num_tokens=2))
spy = _SpyCollector()
tree.metrics_collector = spy
# [1, 2, 3, 4, 5] still matches in the tree, but the sliding window is no
# longer covered by live SWA KV, so the usable hit is truncated to zero.
result = tree.match_prefix(
MatchPrefixParams(key=RadixKey(array("q", [1, 2, 3, 4, 5])))
)
Motivation
For hybrid sliding-window-attention (SWA) models, the prefix-cache hit rate can degrade silently when SWA KV gets evicted. A request may match a long logical prefix in the SWA radix tree, but the usable hit is truncated back to the last point still covered by
sliding_window_sizeof live (non-tombstoned) SWA KV. Right now only the aggregate cache-hit rate is observable, so when the hit rate drops there's no way to tell whether it's SWA eviction or just genuinely non-shared prefixes.SWARadixCache._match_prefix_helperalready computes both quantities. It walks the full logical match intovaluebut only accepts the firstbest_value_lenentries as a usable hit. The truncated remainder was just never surfaced anywhere.Modifications
observability/metrics_collector.py: add aRadixCacheMetricsCollectorcountersglang:swa_prefix_truncated_tokens_totalandincrement_swa_prefix_truncated_num_tokens, following the existingsglang:evicted_tokens_totalmetric.mem_cache/base_prefix_cache.py: addupdate_swa_prefix_truncation_metrics, followingupdate_eviction_metrics.mem_cache/swa_radix_cache.py: record the truncated token count at thematch_prefixboundary. The helper early-returns when no collector is attached or when the match wasn't truncated (best_value_len == len(value)), so the common path is untouched.test/registered/unit/mem_cache/test_swa_unittest.py: add a unit test that evicts SWA capacity to tombstone a shared prefix and asserts the counter records the dropped tokens.This is observability only, so there's no change to model outputs or KV correctness. A natural follow-up would be extending the same signal to the unified radix cache SWA path.
Accuracy Tests
N/A, this is a metrics-only change with no model output or KV correctness impact.
Speed Tests and Profiling
No measurable impact. The recording helper early-returns when no collector is attached or when the match isn't truncated (the common case), so it adds at most one integer comparison to the match path.
Checklist