-
Notifications
You must be signed in to change notification settings - Fork 7k
[Serve] Immediately terminate unscheduled replicas #52416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
zcin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great! Could you also add an e2e test? it can start an application with a fake custom resource, then when it is terminated the replica should not wait for the graceful shutdown timeout.
zcin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there! Unfortunately, we just have to preemptively guard against flaky tests, left the details in a comment!
zcin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
|
@zcin |
|
investigation results:
The cause of the bug is that when the new code in I recommend avoiding another call to |
Signed-off-by: Dhakshin Suriakannu <[email protected]>
Signed-off-by: Dhakshin Suriakannu <[email protected]>
Signed-off-by: Dhakshin Suriakannu <[email protected]>
Signed-off-by: Dhakshin Suriakannu <[email protected]>
Signed-off-by: Dhakshin Suriakannu <[email protected]>
Signed-off-by: Dhakshin Suriakannu <[email protected]>
Signed-off-by: Dhakshin Suriakannu <[email protected]>
Signed-off-by: Dhakshin Suriakannu <[email protected]>
Resigned and force pushed commits since I forgot to sign one. This is causing the above error now |
|
@dhakshin32 |
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
Signed-off-by: Dhakshin <[email protected]>
Co-authored-by: Cindy Zhang <[email protected]> Signed-off-by: Dhakshin <[email protected]>
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
|
This pull request has been automatically closed because there has been no more activity in the 14 days Please feel free to reopen or open a new pull request if you'd still like this to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for your contribution! |
Why are these changes needed?
This PR fixes an issue where deployments would hang indefinitely during shutdown when they contained replicas in the PENDING_ALLOCATION state. Previously, Ray Serve would attempt to gracefully stop these unallocated replicas by calling replica.stop(), which waits for non-existent actors to be created first. The fix skips the graceful shutdown step for replicas that haven't been allocated yet, allowing deployments to shut down properly without getting stuck waiting for actors that don't exist.
Related issue number
#52416
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.