-
Notifications
You must be signed in to change notification settings - Fork 6.8k
observability: add tokenizer event-loop-lag metric #29527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
Kangyan-Zhou
wants to merge
1
commit into
sgl-project:main
Choose a base branch
from
Kangyan-Zhou:engine-loop-lag-metric
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+181
−0
Draft
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
120 changes: 120 additions & 0 deletions
120
test/registered/unit/observability/test_event_loop_lag.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,120 @@ | ||
| """Pure-CPU unit tests for the tokenizer event-loop-lag metric. | ||
|
|
||
| Covers the two halves of ``sglang:event_loop_lag_seconds``: | ||
|
|
||
| * ``TokenizerMetricsCollector`` registers the histogram and | ||
| ``observe_event_loop_lag`` routes the value through the collector's labels. | ||
| * ``TokenizerManager.watch_event_loop_lag`` records a lag sample roughly equal | ||
| to how long the loop was blocked -- the real failure mode this metric exists | ||
| to surface (a starved loop that cannot promptly stamp ``received_time``). | ||
| """ | ||
|
|
||
| from sglang.test.ci.ci_register import register_cpu_ci | ||
|
|
||
| register_cpu_ci(est_time=5, suite="base-a-test-cpu") | ||
|
|
||
| import asyncio | ||
| import time | ||
| import unittest | ||
|
|
||
|
|
||
| class _RecordingCollector: | ||
| def __init__(self): | ||
| self.lags = [] | ||
|
|
||
| def observe_event_loop_lag(self, lag): | ||
| self.lags.append(lag) | ||
|
|
||
|
|
||
| class _FakeCounter: | ||
| def __init__(self, name, documentation, labelnames): | ||
| pass | ||
|
|
||
|
|
||
| class _FakeHistogram: | ||
| by_name = {} | ||
|
|
||
| def __init__(self, name, documentation, labelnames, buckets): | ||
| self.name = name | ||
| self.buckets = buckets | ||
| self.observed = [] | ||
| self.last_labels = None | ||
| _FakeHistogram.by_name[name] = self | ||
|
|
||
| def labels(self, **kwargs): | ||
| self.last_labels = kwargs | ||
| return self | ||
|
|
||
| def observe(self, value): | ||
| self.observed.append(value) | ||
|
|
||
|
|
||
| class _StubServerArgs: | ||
| """Minimal ServerArgs stand-in for constructing the collector. | ||
|
|
||
| Only the two bucket-rule attributes are read in ``__init__``; ``None`` makes | ||
| ``generate_buckets`` fall back to its defaults. | ||
| """ | ||
|
|
||
| prompt_tokens_buckets = None | ||
| generation_tokens_buckets = None | ||
|
|
||
|
|
||
| class TestEventLoopLagMetricWiring(unittest.TestCase): | ||
| def test_metric_registered_and_routes_through_labels(self): | ||
| from sglang.srt.observability.metrics_collector import ( | ||
| TokenizerMetricsCollector, | ||
| ) | ||
|
|
||
| _FakeHistogram.by_name = {} | ||
|
|
||
| class _DICollector(TokenizerMetricsCollector): | ||
| _counter_cls = _FakeCounter | ||
| _histogram_cls = _FakeHistogram | ||
|
|
||
| labels = {"model_name": "m", "engine_type": "unified"} | ||
| collector = _DICollector(server_args=_StubServerArgs(), labels=labels) | ||
|
|
||
| self.assertIn("sglang:event_loop_lag_seconds", _FakeHistogram.by_name) | ||
| hist = _FakeHistogram.by_name["sglang:event_loop_lag_seconds"] | ||
| # Sub-millisecond floor through multi-second ceiling. | ||
| self.assertEqual(hist.buckets[0], 0.0005) | ||
| self.assertEqual(hist.buckets[-1], 10.0) | ||
|
|
||
| collector.observe_event_loop_lag(0.7) | ||
| self.assertEqual(hist.observed, [0.7]) | ||
| # Process-wide metric carries the collector's own labels. | ||
| self.assertEqual(hist.last_labels, labels) | ||
|
|
||
|
|
||
| class TestWatchEventLoopLag(unittest.TestCase): | ||
| def test_blocked_loop_records_lag(self): | ||
| from sglang.srt.managers.tokenizer_manager import TokenizerManager | ||
|
|
||
| # Bypass the heavy __init__; the monitor only touches metrics_collector. | ||
| tm = TokenizerManager.__new__(TokenizerManager) | ||
| collector = _RecordingCollector() | ||
| tm.metrics_collector = collector | ||
|
|
||
| async def scenario(): | ||
| task = asyncio.create_task(tm.watch_event_loop_lag(interval=0.02)) | ||
| await asyncio.sleep(0.06) # a few healthy ticks (lag ~0) | ||
| time.sleep(0.4) # synchronously block the loop | ||
| await asyncio.sleep(0.06) # let the overdue tick land | ||
| task.cancel() | ||
| try: | ||
| await task | ||
| except asyncio.CancelledError: | ||
| pass | ||
|
|
||
| asyncio.run(scenario()) | ||
|
|
||
| self.assertGreaterEqual(len(collector.lags), 2) | ||
| self.assertTrue( | ||
| any(lag >= 0.3 for lag in collector.lags), | ||
| f"expected a >=0.3s lag sample from the 0.4s block, got {collector.lags}", | ||
| ) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| unittest.main() |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: Potential Measurement Inaccuracy and Process Crash Risk
next_tickis calculated asnow + intervalbeforeself.metrics_collector.observe_event_loop_lagis called. Any CPU time spent insideobserve_event_loop_lag(or spent running other tasks on the event loop before the nextawait asyncio.sleepis reached) will be incorrectly counted as "scheduling lag" in the subsequent iteration. Measuring the scheduled wakeup time immediately before callingasyncio.sleepavoids this accumulation of overhead.print_exception_wrapper, any unhandled exception raised during metrics observation (e.g., ifself.metrics_collectorisNoneor Prometheus client fails) will trigger the wrapper's exception handler, which logs the error and terminates the entire server process. Wrapping the observation call in atry-exceptblock prevents metrics failures from crashing the production server.Suggested Improvement
We can simplify the logic, eliminate the stateful
next_tickvariable, and make the metric collection robust against unexpected errors.