Skip to content

Conversation

@dhakshin32
Copy link
Contributor

@dhakshin32 dhakshin32 commented Apr 17, 2025

Why are these changes needed?

This PR fixes an issue where deployments would hang indefinitely during shutdown when they contained replicas in the PENDING_ALLOCATION state. Previously, Ray Serve would attempt to gracefully stop these unallocated replicas by calling replica.stop(), which waits for non-existent actors to be created first. The fix skips the graceful shutdown step for replicas that haven't been allocated yet, allowing deployments to shut down properly without getting stuck waiting for actors that don't exist.

Related issue number

#52416

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@dhakshin32 dhakshin32 changed the title skip shutdown [Serve] Immediately terminate unscheduled replicas Apr 17, 2025
@dhakshin32 dhakshin32 marked this pull request as ready for review April 17, 2025 22:28
@hainesmichaelc hainesmichaelc added the community-contribution Contributed by the community label Apr 18, 2025
@mascharkh mascharkh added the serve Ray Serve Related Issue label Apr 19, 2025
@zcin zcin self-requested a review April 21, 2025 22:32
@zcin zcin self-assigned this Apr 21, 2025
Copy link
Contributor

@zcin zcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! Could you also add an e2e test? it can start an application with a fake custom resource, then when it is terminated the replica should not wait for the graceful shutdown timeout.

Copy link
Contributor

@zcin zcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there! Unfortunately, we just have to preemptively guard against flaky tests, left the details in a comment!

Copy link
Contributor

@zcin zcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@zcin zcin added the go add ONLY when ready to merge, run all tests label Apr 23, 2025
@dhakshin32
Copy link
Contributor Author

@zcin python/ray/serve/tests/test_deploy_app.py::test_deploy_multi_app_deleting seems to be failing repeatedly due to connection issue. Is this a flaky test of some kind?

@zcin
Copy link
Contributor

zcin commented Apr 28, 2025

investigation results:

test_update_config_graceful_shutdown_timeout is failing because the replica that is initially deployed with graceful_shutdown_timeout_s=1000, doesn't get updated correctly so the replica stays around for the remainder of the test suite, messing up all the tests that come after it.

The cause of the bug is that when the new code in _stop_replica calls check_started, it will reset the deployment config here, which overrides graceful_shutdown_timeout_s back to 1000.

I recommend avoiding another call to check_started, and instead cache the startup status results.

Signed-off-by: Dhakshin Suriakannu <[email protected]>
Signed-off-by: Dhakshin Suriakannu <[email protected]>
Signed-off-by: Dhakshin Suriakannu <[email protected]>
Signed-off-by: Dhakshin Suriakannu <[email protected]>
Signed-off-by: Dhakshin Suriakannu <[email protected]>
Signed-off-by: Dhakshin Suriakannu <[email protected]>
Signed-off-by: Dhakshin Suriakannu <[email protected]>
Signed-off-by: Dhakshin Suriakannu <[email protected]>
@dhakshin32
Copy link
Contributor Author

  | [2025-04-30T04:40:22Z]  * branch                  refs/pull/52416/head -> FETCH_HEAD
  | [2025-04-30T04:40:22Z] # FETCH_HEAD is now `17e540f212d4138532e10ee7a69c2b9f4ff23801`
  | [2025-04-30T04:40:22Z] $ git checkout -f 64411d1d7c1abc968919bddc75233bf1338a8ffb
  | [2025-04-30T04:40:22Z] fatal: reference is not a tree: 64411d1d7c1abc968919bddc75233bf1338a8ffb
  | [2025-04-30T04:40:22Z] ⚠️ Warning: Checkout failed! checking out commit "64411d1d7c1abc968919bddc75233bf1338a8ffb": exit status 128 (Attempt 3/3)
  | [2025-04-30T04:40:22Z] 🚨 Error: checking out commit "64411d1d7c1abc968919bddc75233bf1338a8ffb": exit status 128

Resigned and force pushed commits since I forgot to sign one. This is causing the above error now

@zcin
Copy link
Contributor

zcin commented Apr 30, 2025

@dhakshin32 test_deploy_app is passing now, but seems like test_recover_deleting_application is failing now. Let me know if you need help debugging.

@github-actions
Copy link

github-actions bot commented Jun 8, 2025

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 8, 2025
@dhakshin32 dhakshin32 requested a review from a team as a code owner June 14, 2025 19:30
@github-actions github-actions bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 15, 2025
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 29, 2025
@github-actions
Copy link

This pull request has been automatically closed because there has been no more activity in the 14 days
since being marked stale.

Please feel free to reopen or open a new pull request if you'd still like this to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for your contribution!

@github-actions github-actions bot closed this Jul 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue stale The issue is stale. It will be closed within 7 days unless there are further conversation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants