Skip to content

[Serve] Throttle serve_deployment_replica_healthy gauge recording on the controller#60823

Open
abrarsheikh wants to merge 2 commits intomasterfrom
60680-abrar-guage
Open

[Serve] Throttle serve_deployment_replica_healthy gauge recording on the controller#60823
abrarsheikh wants to merge 2 commits intomasterfrom
60680-abrar-guage

Conversation

@abrarsheikh
Copy link
Contributor

@abrarsheikh abrarsheikh commented Feb 7, 2026

Why

Profiling the Serve controller under load shows Gauge.set() for the serve_deployment_replica_healthy metric consuming ~5.9% of CPU. The call is made for every running replica on every control-loop iteration, even when the value hasn't changed. At 128+ replicas this is O(num_replicas) Cython FFI calls per loop — pure waste in steady state.

image

What

Add a time-based cache (_health_gauge_cache) to DeploymentState that tracks the last-reported (value, timestamp) per replica. _set_health_gauge() skips the Gauge.set() call when the value is unchanged and the entry is younger than RAY_SERVE_HEALTH_GAUGE_REPORT_INTERVAL_S (default 10s). The gauge is still re-recorded periodically so it remains visible across OpenCensus export windows, and is always recorded immediately on state transitions (healthy ↔ unhealthy). Cache entries are cleaned up when replicas are fully stopped.

The interval is configurable via the RAY_SERVE_HEALTH_GAUGE_REPORT_INTERVAL_S env var and set to 0.1 in CI to stay within the fast test metrics export cadence.

Test plan

  • New unit test test_health_gauge_caching — verifies cache hits, TTL expiry re-records, value-change bypass, and cleanup on replica stop
  • test_replica_metrics_fields integration test passes (metric still appears in Prometheus)

Related to #60680

Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh requested a review from a team as a code owner February 7, 2026 04:17
@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Feb 7, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance optimization for the Serve controller by throttling the recording of the serve_deployment_replica_healthy metric. This is achieved by adding a time-based cache to DeploymentState, which avoids redundant Gauge.set() calls for replicas with unchanged health status. The implementation is clean, and the new logic is well-tested with unit tests covering various scenarios like cache hits, TTL expiration, and cache cleanup. My only suggestion is to use time.monotonic() instead of time.time() for measuring time intervals to make the caching logic robust against system clock changes.

every control-loop iteration while still refreshing the metric often
enough for Prometheus export.
"""
now = time.time()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

time.time() is not guaranteed to be monotonic. If the system clock is adjusted backwards, now - cached[1] could be negative, which could lead to the gauge not being updated for a long time, even past the intended interval. It's recommended to use time.monotonic() for measuring time durations to avoid this issue.

Suggested change
now = time.time()
now = time.monotonic()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant