Skip to content

[Serve] Skip routing stats collection for deployments without record_routing_stats#60810

Open
abrarsheikh wants to merge 3 commits intomasterfrom
60680-abrar-routing
Open

[Serve] Skip routing stats collection for deployments without record_routing_stats#60810
abrarsheikh wants to merge 3 commits intomasterfrom
60680-abrar-routing

Conversation

@abrarsheikh
Copy link
Contributor

@abrarsheikh abrarsheikh commented Feb 6, 2026

Summary

  • The controller was spending ~9% of its control loop time making periodic record_routing_stats.remote() calls to every replica, even when the user hadn't defined a record_routing_stats method on their deployment class. In that case, the remote call simply returns {}, making it pure overhead.
  • This PR adds a has_user_routing_stats_method boolean to ReplicaMetadata, reported during replica initialization. The controller uses this to skip the remote call entirely when the method is absent.
image

Test plan

  • test_deployment_state.py -- 102 unit tests pass
  • test_controller_recovery.py::test_recover_start_from_replica_actor_names -- 2 tests pass
  • test_metrics.py::test_routing_stats_delay_metric -- updated to define record_routing_stats on test deployment
  • test_metrics.py::test_routing_stats_error_metric -- no changes needed (already defines the method)

related to #60680

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh requested a review from a team as a code owner February 6, 2026 21:11
@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Feb 6, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable optimization by skipping routing stats collection for deployments that don't define the record_routing_stats method. The implementation is clear and the test updates are appropriate. I've included a few suggestions to enhance maintainability, primarily by refactoring the ReplicaMetadata tuple into a dataclass to make the code more robust against future modifications. I also noted a minor redundancy that could be simplified. Overall, this is a solid improvement.

Comment on lines +1077 to +1080
has_user_routing_stats_method = (
self._user_callable_wrapper is not None
and self._user_callable_wrapper.has_user_routing_stats_method
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

self._user_callable_wrapper is initialized in ReplicaBase.__init__ and is not expected to be None at this point. The is not None check is redundant and can be removed to simplify the code.

        has_user_routing_stats_method = (
            self._user_callable_wrapper.has_user_routing_stats_method
        )

for replica in deployment_dict[id]:
ref = replica.get_actor_handle().initialize_and_get_metadata.remote()
_, version, _, _, _, _, _, _, _, _ = ray.get(ref)
_, version, _, _, _, _, _, _, _, _, _ = ray.get(ref)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This large tuple unpacking is fragile. If ReplicaMetadata is refactored to a dataclass as suggested in python/ray/serve/_private/replica.py, this can be simplified to be more readable and robust.

        metadata = ray.get(ref)
        version = metadata.version

actor_handle = ray.get_actor(replica_name, namespace=SERVE_NAMESPACE)
ref = actor_handle.initialize_and_get_metadata.remote()
_, version, _, _, _, _, _, _, _, _ = ray.get(ref)
_, version, _, _, _, _, _, _, _, _, _ = ray.get(ref)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This large tuple unpacking is fragile. If ReplicaMetadata is refactored to a dataclass as suggested in python/ray/serve/_private/replica.py, this can be simplified to be more readable and robust.

        metadata = ray.get(ref)
        version = metadata.version

Copy link
Contributor

@eicherseiji eicherseiji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Propagates _user_record_routing_stats presence info to ActorReplicaWrapper so _should_record_routing_stats can return early

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants