Skip to content

observability: add tokenizer event-loop-lag metric#29527

Draft
Kangyan-Zhou wants to merge 1 commit into
sgl-project:mainfrom
Kangyan-Zhou:engine-loop-lag-metric
Draft

observability: add tokenizer event-loop-lag metric#29527
Kangyan-Zhou wants to merge 1 commit into
sgl-project:mainfrom
Kangyan-Zhou:engine-loop-lag-metric

Conversation

@Kangyan-Zhou

@Kangyan-Zhou Kangyan-Zhou commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

Motivation

When comparing TTFT between the experimental sgl-router and the engine, the router consistently reports higher TTFT than the engine does for the same traffic. Part of that gap is delay that accrues before the engine starts its TTFT clock: the engine measures TTFT as received_time -> first token, but received_time is stamped inside the TokenizerManager process. If that process's asyncio event loop is starved (synchronous work, GIL contention, huge-prompt HF encoding), a request sits unstamped — invisible to engine-side TTFT but very visible to the client and the router.

This PR surfaces that hidden slice as a first-class metric so the router-vs-engine TTFT gap can be attributed on the engine side.

What this adds

  • sglang:event_loop_lag_seconds — a histogram of the tokenizer process's asyncio event-loop scheduling lag, with sub-millisecond floor (0.0005s) through a multi-second ceiling (10s). Healthy lag is sub-millisecond; a starved loop lands at 0.1s+.
  • A lightweight monitor coroutine TokenizerManager.watch_event_loop_lag, started only when enable_metrics is set. Each tick asks to sleep a fixed interval and records the overrun (a healthy loop wakes on time → lag ~0; a blocked loop wakes late → the overrun is how long it was unresponsive).
  • The metric carries the collector's own (process-wide) labels rather than per-request labels.

Each TokenizerManager — or each TokenizerWorker in multi-tokenizer mode — stamps received_time on the same loop it serves, so monitoring that loop directly measures the pre-received_time delay. The MultiTokenizerRouter forwarding loop does not stamp received_time, so it is intentionally not monitored.

Test

Adds test/registered/unit/observability/test_event_loop_lag.py (pure-CPU, registered to base-a-test-cpu):

  • metric is registered and observe_event_loop_lag routes through the collector's labels;
  • a synchronously-blocked loop produces a lag sample proportional to the block duration.

Both tests pass:

test_event_loop_lag.py::TestEventLoopLagMetricWiring::test_metric_registered_and_routes_through_labels PASSED
test_event_loop_lag.py::TestWatchEventLoopLag::test_blocked_loop_records_lag PASSED

🤖 Generated with Claude Code


CI States

Latest PR Test (Base): ❌ Run #28299976283
Latest PR Test (Extra): ❌ Run #28299976219

Add `sglang:event_loop_lag_seconds`, a histogram of the tokenizer
process's asyncio event-loop scheduling lag. Each TokenizerManager (or
each TokenizerWorker in multi-tokenizer mode) stamps `received_time` on
the same loop it serves, so sustained lag is delay that accrues *before*
`received_time` -- the slice of client-observed TTFT that engine TTFT
(`received_time` -> first token) cannot see.

A lightweight monitor coroutine (`watch_event_loop_lag`), started only
when metrics are enabled, asks to sleep a fixed interval and records the
overrun: a healthy loop wakes on time (lag ~0), a loop blocked by
synchronous work or GIL contention wakes late. This surfaces the
router-vs-engine TTFT gap on the engine side.

The MultiTokenizerRouter forwarding loop does not stamp `received_time`,
so it is intentionally not monitored.

Adds a pure-CPU unit test covering metric wiring and that a blocked loop
records a proportional lag sample.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01FnAPbVdPu5JtbmiX9tPUEn

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new metric, sglang:event_loop_lag_seconds, to monitor the asyncio event loop scheduling lag in the tokenizer process, along with corresponding unit tests. The reviewer pointed out a potential measurement inaccuracy in the lag calculation and a process crash risk if metrics observation fails, suggesting a robust try-except block and simplified loop logic.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +1857 to +1872
async def watch_event_loop_lag(self, interval: float = 0.1):
"""Sample this process's asyncio event-loop scheduling lag as a metric.

Each tick asks to sleep ``interval`` seconds. A healthy loop wakes on
time (lag ~0); a loop blocked by synchronous work or GIL contention wakes
late, and the overrun is how long it was unresponsive -- and thus unable
to stamp incoming requests' received_time.
"""
loop = asyncio.get_running_loop()
next_tick = loop.time() + interval
while True:
await asyncio.sleep(interval)
now = loop.time()
lag = now - next_tick
self.metrics_collector.observe_event_loop_lag(lag if lag > 0.0 else 0.0)
next_tick = now + interval

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Issue: Potential Measurement Inaccuracy and Process Crash Risk

  1. Measurement Inaccuracy: In the current implementation, next_tick is calculated as now + interval before self.metrics_collector.observe_event_loop_lag is called. Any CPU time spent inside observe_event_loop_lag (or spent running other tasks on the event loop before the next await asyncio.sleep is reached) will be incorrectly counted as "scheduling lag" in the subsequent iteration. Measuring the scheduled wakeup time immediately before calling asyncio.sleep avoids this accumulation of overhead.
  2. Process Crash Risk: Since this background task is wrapped in print_exception_wrapper, any unhandled exception raised during metrics observation (e.g., if self.metrics_collector is None or Prometheus client fails) will trigger the wrapper's exception handler, which logs the error and terminates the entire server process. Wrapping the observation call in a try-except block prevents metrics failures from crashing the production server.

Suggested Improvement

We can simplify the logic, eliminate the stateful next_tick variable, and make the metric collection robust against unexpected errors.

Suggested change
async def watch_event_loop_lag(self, interval: float = 0.1):
"""Sample this process's asyncio event-loop scheduling lag as a metric.
Each tick asks to sleep ``interval`` seconds. A healthy loop wakes on
time (lag ~0); a loop blocked by synchronous work or GIL contention wakes
late, and the overrun is how long it was unresponsive -- and thus unable
to stamp incoming requests' received_time.
"""
loop = asyncio.get_running_loop()
next_tick = loop.time() + interval
while True:
await asyncio.sleep(interval)
now = loop.time()
lag = now - next_tick
self.metrics_collector.observe_event_loop_lag(lag if lag > 0.0 else 0.0)
next_tick = now + interval
async def watch_event_loop_lag(self, interval: float = 0.1):
"""Sample this process's asyncio event-loop scheduling lag as a metric.
Each tick asks to sleep ``interval`` seconds. A healthy loop wakes on
time (lag ~0); a loop blocked by synchronous work or GIL contention wakes
late, and the overrun is how long it was unresponsive -- and thus unable
to stamp incoming requests' received_time.
"""
loop = asyncio.get_running_loop()
while True:
scheduled_wakeup = loop.time() + interval
await asyncio.sleep(interval)
lag = loop.time() - scheduled_wakeup
try:
self.metrics_collector.observe_event_loop_lag(max(0.0, lag))
except Exception as e:
logger.warning("Failed to observe event loop lag: %s", e)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant