[metrics] Add SWA prefix-cache truncation counter by brucechanglongxu · Pull Request #29528 · sgl-project/sglang

brucechanglongxu · 2026-06-27T20:01:00Z

Motivation

For hybrid sliding-window-attention (SWA) models, the prefix-cache hit rate can degrade silently when SWA KV gets evicted. A request may match a long logical prefix in the SWA radix tree, but the usable hit is truncated back to the last point still covered by sliding_window_size of live (non-tombstoned) SWA KV. Right now only the aggregate cache-hit rate is observable, so when the hit rate drops there's no way to tell whether it's SWA eviction or just genuinely non-shared prefixes.

SWARadixCache._match_prefix_helper already computes both quantities. It walks the full logical match into value but only accepts the first best_value_len entries as a usable hit. The truncated remainder was just never surfaced anywhere.

Modifications

observability/metrics_collector.py: add a RadixCacheMetricsCollector counter sglang:swa_prefix_truncated_tokens_total and increment_swa_prefix_truncated_num_tokens, following the existing sglang:evicted_tokens_total metric.
mem_cache/base_prefix_cache.py: add update_swa_prefix_truncation_metrics, following update_eviction_metrics.
mem_cache/swa_radix_cache.py: record the truncated token count at the match_prefix boundary. The helper early-returns when no collector is attached or when the match wasn't truncated (best_value_len == len(value)), so the common path is untouched.
test/registered/unit/mem_cache/test_swa_unittest.py: add a unit test that evicts SWA capacity to tombstone a shared prefix and asserts the counter records the dropped tokens.

This is observability only, so there's no change to model outputs or KV correctness. A natural follow-up would be extending the same signal to the unified radix cache SWA path.

Accuracy Tests

N/A, this is a metrics-only change with no model output or KV correctness impact.

Speed Tests and Profiling

No measurable impact. The recording helper early-returns when no collector is attached or when the match isn't truncated (the common case), so it adds at most one integer comparison to the match path.

Checklist

Format your code according to the pre-commit configuration (black, isort, ruff, codespell all pass).
Add unit tests.
Update documentation (not applicable).
Provide accuracy and speed benchmark results (not applicable, metrics-only).
Follow the SGLang code style guidance.

For hybrid sliding-window-attention models, a request can match a long logical prefix in the SWA radix tree while the usable hit is truncated to the last point still covered by sliding_window_size of live (non-tombstoned) SWA KV. Only the aggregate cache-hit rate was observable, so hit-rate drops could not be attributed to SWA eviction. SWARadixCache._match_prefix_helper already computes both quantities: it walks the full logical match into `value` but only accepts best_value_len entries as a usable hit. Surface the dropped remainder as a Prometheus counter (sglang:swa_prefix_truncated_tokens_total), recorded at the match boundary and mirroring the existing evicted_tokens_total metric. Metrics-only; the hot path early-returns when no truncation occurs.

gemini-code-assist

Code Review

This pull request introduces a new metric to track SWA prefix truncation, which counts the prefix tokens matched in the radix tree but dropped from the cache hit because the sliding-window attention KV had already been evicted. This is implemented across BasePrefixCache, SWARadixCache, and MetricsCollector, along with a corresponding unit test. A review comment identifies a bug in the new unit test where assigning a mock metrics collector before calling tree.evict will cause an AttributeError due to missing eviction metrics methods on the mock.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-27T20:02:18Z

+        spy = _SpyCollector()
+        tree.metrics_collector = spy
+
+        for token_ids in (
+            [1, 2, 3],
+            [1, 2, 3, 4, 5, 6, 7],
+            [10, 11, 12],
+            [1, 2, 3, 4, 5, 60, 70],
+        ):
+            kv_indices = allocator.alloc(len(token_ids))
+            key = RadixKey(array("q", token_ids))
+            tree.insert(InsertParams(key=key, value=kv_indices[: len(key)]))
+
+        # Evict SWA capacity (tombstoning the shared [1..5] prefix) while keeping
+        # the full-attention KV alive.
+        tree.evict(EvictParams(num_tokens=1, swa_num_tokens=0))
+        tree.evict(EvictParams(num_tokens=0, swa_num_tokens=1))
+        tree.evict(EvictParams(num_tokens=1, swa_num_tokens=2))
+
+        # [1, 2, 3, 4, 5] still matches in the tree, but the sliding window is no
+        # longer covered by live SWA KV, so the usable hit is truncated to zero.
+        result = tree.match_prefix(
+            MatchPrefixParams(key=RadixKey(array("q", [1, 2, 3, 4, 5])))
+        )


The newly added unit test will crash with an AttributeError during the tree.evict calls. This is because tree.metrics_collector is set to spy (an instance of _SpyCollector), which does not implement the eviction metrics methods (observe_eviction_duration and increment_eviction_num_tokens) called by tree.evict via update_eviction_metrics.

To fix this, we should only assign tree.metrics_collector = spy after the eviction steps are completed, right before calling tree.match_prefix.

for token_ids in ( [1, 2, 3], [1, 2, 3, 4, 5, 6, 7], [10, 11, 12], [1, 2, 3, 4, 5, 60, 70], ): kv_indices = allocator.alloc(len(token_ids)) key = RadixKey(array("q", token_ids)) tree.insert(InsertParams(key=key, value=kv_indices[: len(key)])) # Evict SWA capacity (tombstoning the shared [1..5] prefix) while keeping # the full-attention KV alive. tree.evict(EvictParams(num_tokens=1, swa_num_tokens=0)) tree.evict(EvictParams(num_tokens=0, swa_num_tokens=1)) tree.evict(EvictParams(num_tokens=1, swa_num_tokens=2)) spy = _SpyCollector() tree.metrics_collector = spy # [1, 2, 3, 4, 5] still matches in the tree, but the sliding window is no # longer covered by live SWA KV, so the usable hit is truncated to zero. result = tree.match_prefix( MatchPrefixParams(key=RadixKey(array("q", [1, 2, 3, 4, 5]))) )

brucechanglongxu requested review from Ying1123, alphabetc1, fzyzcjy, hanming-lu, hnyls2002, hzh0425, ispobock, merrymercy, sufeng-buaa, xiezhq-hermann and yizhang2077 as code owners June 27, 2026 20:01

gemini-code-assist Bot reviewed Jun 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[metrics] Add SWA prefix-cache truncation counter#29528

[metrics] Add SWA prefix-cache truncation counter#29528
brucechanglongxu wants to merge 1 commit into
sgl-project:mainfrom
brucechanglongxu:swa-prefix-truncation-metric

brucechanglongxu commented Jun 27, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

brucechanglongxu commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

brucechanglongxu commented Jun 27, 2026 •

edited

Loading