observability: add tokenizer event-loop-lag metric#29527
Conversation
Add `sglang:event_loop_lag_seconds`, a histogram of the tokenizer process's asyncio event-loop scheduling lag. Each TokenizerManager (or each TokenizerWorker in multi-tokenizer mode) stamps `received_time` on the same loop it serves, so sustained lag is delay that accrues *before* `received_time` -- the slice of client-observed TTFT that engine TTFT (`received_time` -> first token) cannot see. A lightweight monitor coroutine (`watch_event_loop_lag`), started only when metrics are enabled, asks to sleep a fixed interval and records the overrun: a healthy loop wakes on time (lag ~0), a loop blocked by synchronous work or GIL contention wakes late. This surfaces the router-vs-engine TTFT gap on the engine side. The MultiTokenizerRouter forwarding loop does not stamp `received_time`, so it is intentionally not monitored. Adds a pure-CPU unit test covering metric wiring and that a blocked loop records a proportional lag sample. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01FnAPbVdPu5JtbmiX9tPUEn
There was a problem hiding this comment.
Code Review
This pull request introduces a new metric, sglang:event_loop_lag_seconds, to monitor the asyncio event loop scheduling lag in the tokenizer process, along with corresponding unit tests. The reviewer pointed out a potential measurement inaccuracy in the lag calculation and a process crash risk if metrics observation fails, suggesting a robust try-except block and simplified loop logic.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| async def watch_event_loop_lag(self, interval: float = 0.1): | ||
| """Sample this process's asyncio event-loop scheduling lag as a metric. | ||
|
|
||
| Each tick asks to sleep ``interval`` seconds. A healthy loop wakes on | ||
| time (lag ~0); a loop blocked by synchronous work or GIL contention wakes | ||
| late, and the overrun is how long it was unresponsive -- and thus unable | ||
| to stamp incoming requests' received_time. | ||
| """ | ||
| loop = asyncio.get_running_loop() | ||
| next_tick = loop.time() + interval | ||
| while True: | ||
| await asyncio.sleep(interval) | ||
| now = loop.time() | ||
| lag = now - next_tick | ||
| self.metrics_collector.observe_event_loop_lag(lag if lag > 0.0 else 0.0) | ||
| next_tick = now + interval |
There was a problem hiding this comment.
Issue: Potential Measurement Inaccuracy and Process Crash Risk
- Measurement Inaccuracy: In the current implementation,
next_tickis calculated asnow + intervalbeforeself.metrics_collector.observe_event_loop_lagis called. Any CPU time spent insideobserve_event_loop_lag(or spent running other tasks on the event loop before the nextawait asyncio.sleepis reached) will be incorrectly counted as "scheduling lag" in the subsequent iteration. Measuring the scheduled wakeup time immediately before callingasyncio.sleepavoids this accumulation of overhead. - Process Crash Risk: Since this background task is wrapped in
print_exception_wrapper, any unhandled exception raised during metrics observation (e.g., ifself.metrics_collectorisNoneor Prometheus client fails) will trigger the wrapper's exception handler, which logs the error and terminates the entire server process. Wrapping the observation call in atry-exceptblock prevents metrics failures from crashing the production server.
Suggested Improvement
We can simplify the logic, eliminate the stateful next_tick variable, and make the metric collection robust against unexpected errors.
| async def watch_event_loop_lag(self, interval: float = 0.1): | |
| """Sample this process's asyncio event-loop scheduling lag as a metric. | |
| Each tick asks to sleep ``interval`` seconds. A healthy loop wakes on | |
| time (lag ~0); a loop blocked by synchronous work or GIL contention wakes | |
| late, and the overrun is how long it was unresponsive -- and thus unable | |
| to stamp incoming requests' received_time. | |
| """ | |
| loop = asyncio.get_running_loop() | |
| next_tick = loop.time() + interval | |
| while True: | |
| await asyncio.sleep(interval) | |
| now = loop.time() | |
| lag = now - next_tick | |
| self.metrics_collector.observe_event_loop_lag(lag if lag > 0.0 else 0.0) | |
| next_tick = now + interval | |
| async def watch_event_loop_lag(self, interval: float = 0.1): | |
| """Sample this process's asyncio event-loop scheduling lag as a metric. | |
| Each tick asks to sleep ``interval`` seconds. A healthy loop wakes on | |
| time (lag ~0); a loop blocked by synchronous work or GIL contention wakes | |
| late, and the overrun is how long it was unresponsive -- and thus unable | |
| to stamp incoming requests' received_time. | |
| """ | |
| loop = asyncio.get_running_loop() | |
| while True: | |
| scheduled_wakeup = loop.time() + interval | |
| await asyncio.sleep(interval) | |
| lag = loop.time() - scheduled_wakeup | |
| try: | |
| self.metrics_collector.observe_event_loop_lag(max(0.0, lag)) | |
| except Exception as e: | |
| logger.warning("Failed to observe event loop lag: %s", e) |
Motivation
When comparing TTFT between the experimental
sgl-routerand the engine, the router consistently reports higher TTFT than the engine does for the same traffic. Part of that gap is delay that accrues before the engine starts its TTFT clock: the engine measures TTFT asreceived_time -> first token, butreceived_timeis stamped inside theTokenizerManagerprocess. If that process's asyncio event loop is starved (synchronous work, GIL contention, huge-prompt HF encoding), a request sits unstamped — invisible to engine-side TTFT but very visible to the client and the router.This PR surfaces that hidden slice as a first-class metric so the router-vs-engine TTFT gap can be attributed on the engine side.
What this adds
sglang:event_loop_lag_seconds— a histogram of the tokenizer process's asyncio event-loop scheduling lag, with sub-millisecond floor (0.0005s) through a multi-second ceiling (10s). Healthy lag is sub-millisecond; a starved loop lands at0.1s+.TokenizerManager.watch_event_loop_lag, started only whenenable_metricsis set. Each tick asks to sleep a fixed interval and records the overrun (a healthy loop wakes on time → lag ~0; a blocked loop wakes late → the overrun is how long it was unresponsive).Each
TokenizerManager— or eachTokenizerWorkerin multi-tokenizer mode — stampsreceived_timeon the same loop it serves, so monitoring that loop directly measures the pre-received_timedelay. TheMultiTokenizerRouterforwarding loop does not stampreceived_time, so it is intentionally not monitored.Test
Adds
test/registered/unit/observability/test_event_loop_lag.py(pure-CPU, registered tobase-a-test-cpu):observe_event_loop_lagroutes through the collector's labels;Both tests pass:
🤖 Generated with Claude Code
CI States
Latest PR Test (Base): ❌ Run #28299976283
Latest PR Test (Extra): ❌ Run #28299976219