[Serve] Optimize stop_replicas() to avoid pop-all/re-add cycle#60832
Open
abrarsheikh wants to merge 2 commits intomasterfrom
Open
[Serve] Optimize stop_replicas() to avoid pop-all/re-add cycle#60832abrarsheikh wants to merge 2 commits intomasterfrom
abrarsheikh wants to merge 2 commits intomasterfrom
Conversation
Signed-off-by: abrar <abrar@anyscale.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request optimizes stop_replicas to avoid a costly cycle of popping and re-adding all replicas when only a few need to be stopped. This is achieved by introducing a remove(replica_id) method in ReplicaStateContainer and using it in a loop within stop_replicas. The benchmarks clearly show significant performance improvements for stopping a small number of replicas.
My main concern, detailed in a specific comment, is a potential performance regression when stopping a large number of replicas, due to the O(k*N) complexity of the new approach. I've suggested an alternative that would be efficient for all cases.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
stop_replicas()pops every replica across all 7 states, checks set membership, and re-adds the vast majority back. Each re-add triggersupdate_actor_details()which rebuilds aReplicaDetailspydantic object. When stopping 2 out of 4096 replicas, 4094 replicas get needlessly popped, rebuilt, and re-added.Fix
Add a
remove(replica_ids)method toReplicaStateContainerthat performs a single O(N) pass with O(1) set lookups. Non-matching replicas stay in place — no re-add, no update_state call. Early-exits once all targets are found, and only rebuilds the list for states where a match was found.Benchmark results
Benchmark script - AI
Scenario 1 — Downscale: stop 2 out of N (typical)
upd_stateupd_stateScenario 2 — Teardown: stop ALL N (no regression)
Memory is flat and nearly identical for both — when stopping everything, neither approach re-adds anything.
Scenario 3 — Moderate downscale: stop 10% of N
upd_stateupd_stateThe old code's memory grows linearly with N because
pop()allocates a temporary list of all replicas, thenadd()rebuildsReplicaDetailsfor each re-inserted one. The new code only allocates aremaininglist for states where a match is found, and never touches non-matching replicas.related to #60680