Skip to content

[Autoscaler][V2] Fix autoscaler terminating more nodes than exist of a type #52760

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ryanaoleary
Copy link
Contributor

Why are these changes needed?

We're seeing an issue where the value of worker_to_delete_set is incorrect, stopping the autoscaler from reconciling correctly due to the following assertion failing:

assert num_workers_dict[to_delete_instance.node_type] >= 0

Currently in _get_workers_delete_info, if there are multiple nodes in the same worker group with a pending deletion, only the first one is added to worker_to_delete_set. This set is then used in _initialize_scale_request here:

if to_delete_instance.cloud_instance_id in worker_to_delete_set:
  # If the instance is already in the workersToDelete field of
  # any worker group, skip it.
  continue

but since all workers aren't being added to the set, the above assertion fails and future calls to intialize_scale_request are unsuccessful.

Related issue number

Closes #52264

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ryanaoleary ryanaoleary requested a review from a team as a code owner May 2, 2025 23:40
@ryanaoleary
Copy link
Contributor Author

ryanaoleary commented May 2, 2025

cc: @kevin85421 I tested this with the V2 autoscaler unit tests, but haven't been able to manually replicate the issue to where I could test the fix with my dev image. I added a guard and logging in 503eeea, since it seems like bad behavior for the autoscaler to block all future scale up and terminate calls (since they both rely on initialize_scale_request) if the workers_to_delete set is out of sync.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryanaoleary, I recall you mentioned that a user can consistently reproduce the issue. Could you build an image for them to verify it? Additionally, if possible, please do your best to reproduce the issue so we can add tests to prevent it from happening again.

@@ -249,8 +249,14 @@ def _initialize_scale_request(
# any worker group, skip it.
continue

num_workers_dict[to_delete_instance.node_type] -= 1
assert num_workers_dict[to_delete_instance.node_type] >= 0
if num_workers_dict[to_delete_instances.node_type] <= 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert this change; it may bury some issues deeper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Autoscaler][V2] Autoscaler fails to delete idle KubeRay Pod
3 participants