Skip to content

[Metrics] Temporary band-aid for "Counters can only be incremented by non-negative amounts"#36812

Open
markmc wants to merge 1 commit intovllm-project:mainfrom
markmc:prompt-token-stats-negative-inc
Open

[Metrics] Temporary band-aid for "Counters can only be incremented by non-negative amounts"#36812
markmc wants to merge 1 commit intovllm-project:mainfrom
markmc:prompt-token-stats-negative-inc

Conversation

@markmc
Copy link
Member

@markmc markmc commented Mar 11, 2026

Since num_computed_tokens, num_cached_tokens, and num_external_computed_tokens accounting seems quite brittle currently - with preemption reset bugs and P/D disaggregation accounting issues - add a defensive check to detect and prevent instances of Prometheus counter errors:

ValueError: Counters can only be incremented by non-negative amounts

The invariant check enforces:

prompt_len >= num_cached_tokens >= num_external_computed_tokens >= 0

with the additional nuance that when all tokens are cached, the scheduler forces recomputation of the last token, so the:

num_external_computed_tokens <= num_cached_tokens + recomputed

When the invariant is violated, we log a a warning once with diagnostic details, and discard suspect cache metrics.

Obviously, the accounting should be fixed and made more robust and future-proof, at which point we can remove this check (perhaps replacing with a simple assertion).

Related to issues #36533, #36755 and PRs #36638, #36752, #36757.

@markmc
Copy link
Member Author

markmc commented Mar 11, 2026

/cc @ZhanqiuHu

@mergify mergify bot added the v1 label Mar 11, 2026
@markmc
Copy link
Member Author

markmc commented Mar 11, 2026

To be clear, I don't like this, I see it as purely a temporary "stop the bleeding" band-aid that is also backportable to v0.16.0 where this first showed up

We still need to solidify the num_computed_tokens, num_cached_tokens, and num_external_computed_tokens accounting ... hopefully in a way that easier to maintain

@markmc markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 11, 2026
@markmc markmc moved this from Backlog to Ready in Metrics & Tracing Mar 11, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a defensive check to prevent crashes from metrics accounting bugs, specifically when Prometheus counters are incremented by negative values. The changes involve adding an invariant check for token counts and logging a warning when the invariant is violated. I've found a critical issue with the invariant check itself, as it doesn't fully prevent negative values in all cases. I've also identified an issue with the diagnostic logging that would make debugging harder. My review includes suggestions to address both of these points.

… non-negative amounts"

Since `num_computed_tokens`, `num_cached_tokens`, and
`num_external_computed_tokens` accounting seems quite brittle currently -
with preemption reset bugs and P/D disaggregation accounting issues -
add a defensive check to detect and prevent instances of Prometheus
counter errors:

```
ValueError: Counters can only be incremented by non-negative amounts
```

The invariant check enforces:

```
prompt_len >= num_cached_tokens >= num_external_computed_tokens >= 0
```

with the additional nuance that when all tokens are cached, the scheduler
forces recomputation of the last token, so the:

```
num_external_computed_tokens <= num_cached_tokens + recomputed
```

When the invariant is violated, we log a a warning once with diagnostic
details, and discard suspect cache metrics.

Obviously, the accounting should be fixed and made more robust and
future-proof, at which point we can remove this check (perhaps replacing
with a simple assertion).

Related to issues vllm-project#36533, vllm-project#36755 and PRs vllm-project#36638, vllm-project#36752, vllm-project#36757.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@markmc markmc force-pushed the prompt-token-stats-negative-inc branch from 6b2de38 to 04a4886 Compare March 11, 2026 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Ready

Development

Successfully merging this pull request may close these issues.

1 participant