[Metrics] Temporary band-aid for "Counters can only be incremented by non-negative amounts"#36812
[Metrics] Temporary band-aid for "Counters can only be incremented by non-negative amounts"#36812markmc wants to merge 1 commit intovllm-project:mainfrom
Conversation
|
/cc @ZhanqiuHu |
|
To be clear, I don't like this, I see it as purely a temporary "stop the bleeding" band-aid that is also backportable to v0.16.0 where this first showed up We still need to solidify the |
There was a problem hiding this comment.
Code Review
This pull request introduces a defensive check to prevent crashes from metrics accounting bugs, specifically when Prometheus counters are incremented by negative values. The changes involve adding an invariant check for token counts and logging a warning when the invariant is violated. I've found a critical issue with the invariant check itself, as it doesn't fully prevent negative values in all cases. I've also identified an issue with the diagnostic logging that would make debugging harder. My review includes suggestions to address both of these points.
… non-negative amounts" Since `num_computed_tokens`, `num_cached_tokens`, and `num_external_computed_tokens` accounting seems quite brittle currently - with preemption reset bugs and P/D disaggregation accounting issues - add a defensive check to detect and prevent instances of Prometheus counter errors: ``` ValueError: Counters can only be incremented by non-negative amounts ``` The invariant check enforces: ``` prompt_len >= num_cached_tokens >= num_external_computed_tokens >= 0 ``` with the additional nuance that when all tokens are cached, the scheduler forces recomputation of the last token, so the: ``` num_external_computed_tokens <= num_cached_tokens + recomputed ``` When the invariant is violated, we log a a warning once with diagnostic details, and discard suspect cache metrics. Obviously, the accounting should be fixed and made more robust and future-proof, at which point we can remove this check (perhaps replacing with a simple assertion). Related to issues vllm-project#36533, vllm-project#36755 and PRs vllm-project#36638, vllm-project#36752, vllm-project#36757. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>
6b2de38 to
04a4886
Compare
Since
num_computed_tokens,num_cached_tokens, andnum_external_computed_tokensaccounting seems quite brittle currently - with preemption reset bugs and P/D disaggregation accounting issues - add a defensive check to detect and prevent instances of Prometheus counter errors:The invariant check enforces:
with the additional nuance that when all tokens are cached, the scheduler forces recomputation of the last token, so the:
When the invariant is violated, we log a a warning once with diagnostic details, and discard suspect cache metrics.
Obviously, the accounting should be fixed and made more robust and future-proof, at which point we can remove this check (perhaps replacing with a simple assertion).
Related to issues #36533, #36755 and PRs #36638, #36752, #36757.