Skip to content

[metrics] Add SWA prefix-cache truncation counter#29528

Open
brucechanglongxu wants to merge 1 commit into
sgl-project:mainfrom
brucechanglongxu:swa-prefix-truncation-metric
Open

[metrics] Add SWA prefix-cache truncation counter#29528
brucechanglongxu wants to merge 1 commit into
sgl-project:mainfrom
brucechanglongxu:swa-prefix-truncation-metric

Conversation

@brucechanglongxu

@brucechanglongxu brucechanglongxu commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Motivation

For hybrid sliding-window-attention (SWA) models, the prefix-cache hit rate can degrade silently when SWA KV gets evicted. A request may match a long logical prefix in the SWA radix tree, but the usable hit is truncated back to the last point still covered by sliding_window_size of live (non-tombstoned) SWA KV. Right now only the aggregate cache-hit rate is observable, so when the hit rate drops there's no way to tell whether it's SWA eviction or just genuinely non-shared prefixes.

SWARadixCache._match_prefix_helper already computes both quantities. It walks the full logical match into value but only accepts the first best_value_len entries as a usable hit. The truncated remainder was just never surfaced anywhere.

Modifications

  • observability/metrics_collector.py: add a RadixCacheMetricsCollector counter sglang:swa_prefix_truncated_tokens_total and increment_swa_prefix_truncated_num_tokens, following the existing sglang:evicted_tokens_total metric.
  • mem_cache/base_prefix_cache.py: add update_swa_prefix_truncation_metrics, following update_eviction_metrics.
  • mem_cache/swa_radix_cache.py: record the truncated token count at the match_prefix boundary. The helper early-returns when no collector is attached or when the match wasn't truncated (best_value_len == len(value)), so the common path is untouched.
  • test/registered/unit/mem_cache/test_swa_unittest.py: add a unit test that evicts SWA capacity to tombstone a shared prefix and asserts the counter records the dropped tokens.

This is observability only, so there's no change to model outputs or KV correctness. A natural follow-up would be extending the same signal to the unified radix cache SWA path.

Accuracy Tests

N/A, this is a metrics-only change with no model output or KV correctness impact.

Speed Tests and Profiling

No measurable impact. The recording helper early-returns when no collector is attached or when the match isn't truncated (the common case), so it adds at most one integer comparison to the match path.

Checklist

  • Format your code according to the pre-commit configuration (black, isort, ruff, codespell all pass).
  • Add unit tests.
  • Update documentation (not applicable).
  • Provide accuracy and speed benchmark results (not applicable, metrics-only).
  • Follow the SGLang code style guidance.

For hybrid sliding-window-attention models, a request can match a long
logical prefix in the SWA radix tree while the usable hit is truncated to
the last point still covered by sliding_window_size of live (non-tombstoned)
SWA KV. Only the aggregate cache-hit rate was observable, so hit-rate drops
could not be attributed to SWA eviction.

SWARadixCache._match_prefix_helper already computes both quantities: it walks
the full logical match into `value` but only accepts best_value_len entries as
a usable hit. Surface the dropped remainder as a Prometheus counter
(sglang:swa_prefix_truncated_tokens_total), recorded at the match boundary and
mirroring the existing evicted_tokens_total metric. Metrics-only; the hot path
early-returns when no truncation occurs.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new metric to track SWA prefix truncation, which counts the prefix tokens matched in the radix tree but dropped from the cache hit because the sliding-window attention KV had already been evicted. This is implemented across BasePrefixCache, SWARadixCache, and MetricsCollector, along with a corresponding unit test. A review comment identifies a bug in the new unit test where assigning a mock metrics collector before calling tree.evict will cause an AttributeError due to missing eviction metrics methods on the mock.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +455 to +478
spy = _SpyCollector()
tree.metrics_collector = spy

for token_ids in (
[1, 2, 3],
[1, 2, 3, 4, 5, 6, 7],
[10, 11, 12],
[1, 2, 3, 4, 5, 60, 70],
):
kv_indices = allocator.alloc(len(token_ids))
key = RadixKey(array("q", token_ids))
tree.insert(InsertParams(key=key, value=kv_indices[: len(key)]))

# Evict SWA capacity (tombstoning the shared [1..5] prefix) while keeping
# the full-attention KV alive.
tree.evict(EvictParams(num_tokens=1, swa_num_tokens=0))
tree.evict(EvictParams(num_tokens=0, swa_num_tokens=1))
tree.evict(EvictParams(num_tokens=1, swa_num_tokens=2))

# [1, 2, 3, 4, 5] still matches in the tree, but the sliding window is no
# longer covered by live SWA KV, so the usable hit is truncated to zero.
result = tree.match_prefix(
MatchPrefixParams(key=RadixKey(array("q", [1, 2, 3, 4, 5])))
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The newly added unit test will crash with an AttributeError during the tree.evict calls. This is because tree.metrics_collector is set to spy (an instance of _SpyCollector), which does not implement the eviction metrics methods (observe_eviction_duration and increment_eviction_num_tokens) called by tree.evict via update_eviction_metrics.

To fix this, we should only assign tree.metrics_collector = spy after the eviction steps are completed, right before calling tree.match_prefix.

        for token_ids in (
            [1, 2, 3],
            [1, 2, 3, 4, 5, 6, 7],
            [10, 11, 12],
            [1, 2, 3, 4, 5, 60, 70],
        ):
            kv_indices = allocator.alloc(len(token_ids))
            key = RadixKey(array("q", token_ids))
            tree.insert(InsertParams(key=key, value=kv_indices[: len(key)]))

        # Evict SWA capacity (tombstoning the shared [1..5] prefix) while keeping
        # the full-attention KV alive.
        tree.evict(EvictParams(num_tokens=1, swa_num_tokens=0))
        tree.evict(EvictParams(num_tokens=0, swa_num_tokens=1))
        tree.evict(EvictParams(num_tokens=1, swa_num_tokens=2))

        spy = _SpyCollector()
        tree.metrics_collector = spy

        # [1, 2, 3, 4, 5] still matches in the tree, but the sliding window is no
        # longer covered by live SWA KV, so the usable hit is truncated to zero.
        result = tree.match_prefix(
            MatchPrefixParams(key=RadixKey(array("q", [1, 2, 3, 4, 5])))
        )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant