[Serve] Throttle serve_deployment_replica_healthy gauge recording on the controller by abrarsheikh · Pull Request #60823 · ray-project/ray

abrarsheikh · 2026-02-07T04:17:25Z

Why

Profiling the Serve controller under load shows Gauge.set() for the serve_deployment_replica_healthy metric consuming ~5.9% of CPU. The call is made for every running replica on every control-loop iteration, even when the value hasn't changed. At 128+ replicas this is O(num_replicas) Cython FFI calls per loop — pure waste in steady state.

What

Add a time-based cache (_health_gauge_cache) to DeploymentState that tracks the last-reported (value, timestamp) per replica. _set_health_gauge() skips the Gauge.set() call when the value is unchanged and the entry is younger than RAY_SERVE_HEALTH_GAUGE_REPORT_INTERVAL_S (default 10s). The gauge is still re-recorded periodically so it remains visible across OpenCensus export windows, and is always recorded immediately on state transitions (healthy ↔ unhealthy). Cache entries are cleaned up when replicas are fully stopped.

The interval is configurable via the RAY_SERVE_HEALTH_GAUGE_REPORT_INTERVAL_S env var and set to 0.1 in CI to stay within the fast test metrics export cadence.

Test plan

New unit test test_health_gauge_caching — verifies cache hits, TTL expiry re-records, value-change bypass, and cleanup on replica stop
test_replica_metrics_fields integration test passes (metric still appears in Prometheus)

Related to #60680

Signed-off-by: abrar <abrar@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces a performance optimization for the Serve controller by throttling the recording of the serve_deployment_replica_healthy metric. This is achieved by adding a time-based cache to DeploymentState, which avoids redundant Gauge.set() calls for replicas with unchanged health status. The implementation is clean, and the new logic is well-tested with unit tests covering various scenarios like cache hits, TTL expiration, and cache cleanup. My only suggestion is to use time.monotonic() instead of time.time() for measuring time intervals to make the caching logic robust against system clock changes.

gemini-code-assist · 2026-02-07T04:18:29Z

python/ray/serve/_private/deployment_state.py

+        every control-loop iteration while still refreshing the metric often
+        enough for Prometheus export.
+        """
+        now = time.time()


time.time() is not guaranteed to be monotonic. If the system clock is adjusted backwards, now - cached[1] could be negative, which could lead to the gauge not being updated for a long time, even past the intended interval. It's recommended to use time.monotonic() for measuring time durations to avoid this issue.

Suggested change

now = time.time()

now = time.monotonic()

…guage

[Serve] reduce the frequency of replica healthy metric reporting

38edf98

Signed-off-by: abrar <abrar@anyscale.com>

abrarsheikh requested a review from a team as a code owner February 7, 2026 04:17

abrarsheikh added the go add ONLY when ready to merge, run all tests label Feb 7, 2026

gemini-code-assist bot reviewed Feb 7, 2026

View reviewed changes

Merge branch 'master' of github.com:ray-project/ray into 60680-abrar-…

878b091

…guage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Throttle serve_deployment_replica_healthy gauge recording on the controller#60823

[Serve] Throttle serve_deployment_replica_healthy gauge recording on the controller#60823
abrarsheikh wants to merge 2 commits intomasterfrom
60680-abrar-guage

abrarsheikh commented Feb 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abrarsheikh commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abrarsheikh commented Feb 7, 2026 •

edited

Loading