[Serve] Immediately terminate unscheduled replicas #52416

dhakshin32 · 2025-04-17T22:24:54Z

Why are these changes needed?

This PR fixes an issue where deployments would hang indefinitely during shutdown when they contained replicas in the PENDING_ALLOCATION state. Previously, Ray Serve would attempt to gracefully stop these unallocated replicas by calling replica.stop(), which waits for non-existent actors to be created first. The fix skips the graceful shutdown step for replicas that haven't been allocated yet, allowing deployments to shut down properly without getting stuck waiting for actors that don't exist.

Related issue number

#52416

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/serve/_private/deployment_state.py

zcin

Looking great! Could you also add an e2e test? it can start an application with a fake custom resource, then when it is terminated the replica should not wait for the graceful shutdown timeout.

zcin

Almost there! Unfortunately, we just have to preemptively guard against flaky tests, left the details in a comment!

python/ray/serve/tests/test_deploy_2.py

zcin

Awesome!

dhakshin32 · 2025-04-24T04:50:26Z

@zcin python/ray/serve/tests/test_deploy_app.py::test_deploy_multi_app_deleting seems to be failing repeatedly due to connection issue. Is this a flaky test of some kind?

zcin · 2025-04-28T22:36:48Z

investigation results:

test_update_config_graceful_shutdown_timeout is failing because the replica that is initially deployed with graceful_shutdown_timeout_s=1000, doesn't get updated correctly so the replica stays around for the remainder of the test suite, messing up all the tests that come after it.

The cause of the bug is that when the new code in _stop_replica calls check_started, it will reset the deployment config here, which overrides graceful_shutdown_timeout_s back to 1000.

I recommend avoiding another call to check_started, and instead cache the startup status results.

Signed-off-by: Dhakshin Suriakannu <[email protected]>

dhakshin32 · 2025-04-30T04:44:36Z

  | [2025-04-30T04:40:22Z]  * branch                  refs/pull/52416/head -> FETCH_HEAD
  | [2025-04-30T04:40:22Z] # FETCH_HEAD is now `17e540f212d4138532e10ee7a69c2b9f4ff23801`
  | [2025-04-30T04:40:22Z] $ git checkout -f 64411d1d7c1abc968919bddc75233bf1338a8ffb
  | [2025-04-30T04:40:22Z] fatal: reference is not a tree: 64411d1d7c1abc968919bddc75233bf1338a8ffb
  | [2025-04-30T04:40:22Z] ⚠️ Warning: Checkout failed! checking out commit "64411d1d7c1abc968919bddc75233bf1338a8ffb": exit status 128 (Attempt 3/3)
  | [2025-04-30T04:40:22Z] 🚨 Error: checking out commit "64411d1d7c1abc968919bddc75233bf1338a8ffb": exit status 128

Resigned and force pushed commits since I forgot to sign one. This is causing the above error now

python/ray/serve/_private/deployment_state.py

zcin · 2025-04-30T21:15:27Z

@dhakshin32 test_deploy_app is passing now, but seems like test_recover_deleting_application is failing now. Let me know if you need help debugging.

github-actions · 2025-06-08T00:37:20Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

Signed-off-by: Dhakshin <[email protected]>

Co-authored-by: Cindy Zhang <[email protected]> Signed-off-by: Dhakshin <[email protected]>

github-actions · 2025-06-29T00:44:14Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions · 2025-07-13T00:44:56Z

This pull request has been automatically closed because there has been no more activity in the 14 days
since being marked stale.

Please feel free to reopen or open a new pull request if you'd still like this to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for your contribution!

dhakshin32 changed the title ~~skip shutdown~~ [Serve] Immediately terminate unscheduled replicas Apr 17, 2025

dhakshin32 marked this pull request as ready for review April 17, 2025 22:28

hainesmichaelc added the community-contribution Contributed by the community label Apr 18, 2025

mascharkh added the serve Ray Serve Related Issue label Apr 19, 2025

zcin self-requested a review April 21, 2025 22:32

zcin self-assigned this Apr 21, 2025

zcin reviewed Apr 21, 2025

View reviewed changes

python/ray/serve/_private/deployment_state.py Outdated Show resolved Hide resolved

zcin reviewed Apr 21, 2025

View reviewed changes

zcin reviewed Apr 23, 2025

View reviewed changes

python/ray/serve/tests/test_deploy_2.py Outdated Show resolved Hide resolved

python/ray/serve/tests/test_deploy_2.py Outdated Show resolved Hide resolved

zcin approved these changes Apr 23, 2025

View reviewed changes

zcin added the go add ONLY when ready to merge, run all tests label Apr 23, 2025

dhakshin32 added 8 commits April 29, 2025 21:38

skip shutdown

1469c48

Signed-off-by: Dhakshin Suriakannu <[email protected]>

graceful stop

e0215ce

Signed-off-by: Dhakshin Suriakannu <[email protected]>

add e2e

e1da093

Signed-off-by: Dhakshin Suriakannu <[email protected]>

fix e2e

66f95d3

Signed-off-by: Dhakshin Suriakannu <[email protected]>

add timing

7ebe212

Signed-off-by: Dhakshin Suriakannu <[email protected]>

fix timing

9b5b296

Signed-off-by: Dhakshin Suriakannu <[email protected]>

fix test

f904ce8

Signed-off-by: Dhakshin Suriakannu <[email protected]>

add cache

56c9c64

Signed-off-by: Dhakshin Suriakannu <[email protected]>

dhakshin32 force-pushed the 50426 branch from 17e540f to 56c9c64 Compare April 30, 2025 04:38

Merge branch 'master' into 50426

64411d1

Merge branch 'master' into 50426

751eff7

zcin reviewed Apr 30, 2025

View reviewed changes

python/ray/serve/_private/deployment_state.py Show resolved Hide resolved

hainesmichaelc added community-backlog and removed community-backlog labels May 22, 2025

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 8, 2025

Merge branch 'master' into 50426

2a7d182

Signed-off-by: Dhakshin <[email protected]>

dhakshin32 requested a review from a team as a code owner June 14, 2025 19:30

dhakshin32 and others added 2 commits June 14, 2025 12:37

Update python/ray/serve/_private/deployment_state.py

6b564ec

Co-authored-by: Cindy Zhang <[email protected]> Signed-off-by: Dhakshin <[email protected]>

fix format

0b02f91

github-actions bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 15, 2025

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 29, 2025

github-actions bot closed this Jul 13, 2025

[Serve] Immediately terminate unscheduled replicas #52416

[Serve] Immediately terminate unscheduled replicas #52416

Uh oh!

Conversation

dhakshin32 commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

Uh oh!

zcin left a comment

Choose a reason for hiding this comment

Uh oh!

zcin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zcin left a comment

Choose a reason for hiding this comment

Uh oh!

dhakshin32 commented Apr 24, 2025

Uh oh!

zcin commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dhakshin32 commented Apr 30, 2025

Uh oh!

Uh oh!

zcin commented Apr 30, 2025

Uh oh!

github-actions bot commented Jun 8, 2025

Uh oh!

github-actions bot commented Jun 29, 2025

Uh oh!

github-actions bot commented Jul 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dhakshin32 commented Apr 17, 2025 •

edited

Loading

zcin commented Apr 28, 2025 •

edited

Loading