[Autoscaler][V2] Fix autoscaler terminating more nodes than exist of a type #52760

ryanaoleary · 2025-05-02T23:40:29Z

Why are these changes needed?

We're seeing an issue where the value of worker_to_delete_set is incorrect, stopping the autoscaler from reconciling correctly due to the following assertion failing:

assert num_workers_dict[to_delete_instance.node_type] >= 0

Currently in _get_workers_delete_info, if there are multiple nodes in the same worker group with a pending deletion, only the first one is added to worker_to_delete_set. This set is then used in _initialize_scale_request here:

if to_delete_instance.cloud_instance_id in worker_to_delete_set:
  # If the instance is already in the workersToDelete field of
  # any worker group, skip it.
  continue

but since all workers aren't being added to the set, the above assertion fails and future calls to intialize_scale_request are unsuccessful.

Related issue number

#52264

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary · 2025-05-02T23:42:31Z

cc: @kevin85421 I tested this with the V2 autoscaler unit tests, but haven't been able to manually replicate the issue to where I could test the fix with my dev image. I added a guard and logging in 503eeea, since it seems like bad behavior for the autoscaler to block all future scale up and terminate calls (since they both rely on initialize_scale_request) if the workers_to_delete set is out of sync.

kevin85421

@ryanaoleary, I recall you mentioned that a user can consistently reproduce the issue. Could you build an image for them to verify it? Additionally, if possible, please do your best to reproduce the issue so we can add tests to prevent it from happening again.

python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py

Signed-off-by: Ryan O'Leary <[email protected]>

kevin85421

Chat with @ryanaoleary offline. The customer can't use the dev image. Let's merge this PR and ask the users to verify whether the issue is resolved or not.

kevin85421 · 2025-05-06T00:32:23Z

cc @jjyao @edoakes for merging this PR. Thanks!

python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py

ryanaoleary · 2025-05-08T21:07:46Z

I wrote a unit test that sets replicas and minReplicas to 0 and then attempts to terminate a node on the same reconciliation (I haven't pushed it yet because it's failing), this results in the error occurring even with the fix:

            num_workers_dict[to_delete_instance.node_type] -= 1
>           assert num_workers_dict[to_delete_instance.node_type] >= 0
E           AssertionError

../instance_manager/cloud_providers/kuberay/cloud_provider.py:253: AssertionError

 python/ray/autoscaler/v2/tests/test_node_provider.py::KubeRayProviderIntegrationTest.test_scale_down_pods_with_replica_limit ⨯93% █████████▍2025-05-08 18:36:39,480  INFO cloud_provider.py:121 -- Terminating worker pods: ['raycluster-autoscaler-worker-small-group-dkz2r']
2025-05-08 18:36:39,480 INFO cloud_provider.py:480 -- Fetched pod data at resource version .
2025-05-08 18:36:39,481 INFO cloud_provider.py:331 -- Submitting a scale request: KubeRayProvider.ScaleRequest(desired_num_workers=defaultdict(<class 'int'>, {'small-group': 0, 'gpu-group': 1, 'tpu-group': 1}), workers_to_delete=defaultdict(<class 'list'>, {'small-group': [CloudInstance(cloud_instance_id='raycluster-autoscaler-worker-small-group-dkz2r', node_type='small-group', node_kind=2, is_running=False, request_id=None)]}), worker_groups_without_pending_deletes=set(), worker_groups_with_pending_deletes=set())

since cur_instances still contains the instance to be deleted, but the value of replicas has been updated to a lower value. This is why I think we need the change from b4f32fa that adds a guard and a warning (rather than an assertion) when a value of num_workers_dict goes below 0, since I think the autoscaler shouldn't continuously error on each subsequent iteration that calls _initialize_scale_request if a user updates replicas/minReplicas manually. We can instead log a warning and allow it to reconcile on the next iteration - since the Pod will terminate after the scale request gets submitted, workers_to_delete will clear, and the value of replicas/minReplicas will become accurate. What do you think @kevin85421 @jjyao?

kevin85421 · 2025-05-09T20:59:49Z

Move our slack messages to here:

In my opinion, we shouldn’t allow users to manually update replicas; they should only be able to adjust the minimum and maximum replica counts.

I think we shouldn't have the autoscaler completely crash if a user updates replicas/minReplicas manually

This makes sense, but I’m worried it could bury other issues even deeper. Could you ask the users if they’re trying to manually lower both minReplicas and replicas?

If we can confirm that the issue is caused by either _get_workers_delete_info or manually lowering both minReplicas and replicas, I’m OK with accepting b4f32fa.

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary · 2025-05-12T17:00:54Z

Move our slack messages to here:

In my opinion, we shouldn’t allow users to manually update replicas; they should only be able to adjust the minimum and maximum replica counts.

I think we shouldn't have the autoscaler completely crash if a user updates replicas/minReplicas manually

This makes sense, but I’m worried it could bury other issues even deeper. Could you ask the users if they’re trying to manually lower both minReplicas and replicas?

If we can confirm that the issue is caused by either _get_workers_delete_info or manually lowering both minReplicas and replicas, I’m OK with accepting b4f32fa.

I added a unit test in 9a6b143 that verifies that the provider can handle multiple node deletions of the same type in one autoscaler iteration. This should cover the bug with workers_to_delete not being correctly counted that is fixed by this PR. I don't think they're manually updating replicas or minReplicas, can we merge this PR as-is and then revisit the guard/logging change in b4f32fa if it's still required based on their response?

kevin85421 · 2025-05-12T17:11:38Z

can we merge this PR as-is and then revisit the guard/logging change in b4f32fa if it's still required based on their response?

SG. I will review the new changes.

Signed-off-by: Ryan O'Leary <[email protected]>

python/ray/autoscaler/v2/tests/test_node_provider.py

kevin85421

#52760 (comment)

Signed-off-by: Ryan O'Leary <[email protected]>

python/ray/autoscaler/v2/tests/test_node_provider.py

kevin85421

Approve to unblock the user. I will open a follow up PR to do some improvements.

kevin85421 · 2025-05-13T07:27:51Z

cc @jjyao would you mind merging this PR?

jjyao · 2025-05-13T15:41:32Z

python/ray/autoscaler/v2/tests/test_node_provider.py

+            pending_deletes,
+            finished_deletes,
+            workers_to_delete,
+        ) = self.provider._get_workers_delete_info(


We just had a conversation about not testing private method (i.e. implementation detail): #52812 (comment). Can we test the public interface instead?

I don’t agree with this philosophy in this case. Unit tests are much more appropriate here. We should

(1) Add more real e2e tests for KubeRay.
(2) Add more unit tests for well-defined functions.

We should avoid tests between these two which we need to mock the behaviors, and it's pretty common that the mock behavior is wrong.

Yea, is there a way to test it through the public APIs?

Let's chat in-person.

jjyao · 2025-05-14T23:06:22Z

python/ray/autoscaler/v2/tests/test_node_provider.py

+            pending_deletes,
+            finished_deletes,
+            workers_to_delete,
+        ) = self.provider._get_workers_delete_info(


_get_workers_delete_info is a static method so we we don't need to create a KubeRayProvider instance at all.

Oh oops yeah, I was using the provider to get the list of cloud instances, but since it's KubeRay we can just use the Pod names. This should be fixed in 2b10b1a.

Signed-off-by: Ryan O'Leary <[email protected]>

jjyao · 2025-05-15T17:00:12Z

Merging to unblock the fix. @kevin85421 will follow-up on some of the tests cleanup.

…a type (ray-project#52760) Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: iamjustinhsu <[email protected]>

…a type (ray-project#52760) Signed-off-by: Ryan O'Leary <[email protected]>

JKBGIT1 · 2025-05-26T09:48:44Z

Hello guys, what's the ETA for release, please?

…a type (ray-project#52760) Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Vicky Tsang <[email protected]>

…a type (ray-project#52760) Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Scott Lee <[email protected]>

ryanaoleary added 2 commits May 2, 2025 02:36

Remove break that excludes workers from workers_to_delete set

d1373d6

Signed-off-by: Ryan O'Leary <[email protected]>

Add guard and logging for invalid case

503eeea

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary requested a review from a team as a code owner May 2, 2025 23:40

jjyao assigned rueian and kevin85421 May 2, 2025

kevin85421 reviewed May 3, 2025

View reviewed changes

python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py Outdated Show resolved Hide resolved

ryanaoleary and others added 2 commits May 5, 2025 20:16

Revert guard/logging change

b4f32fa

Signed-off-by: Ryan O'Leary <[email protected]>

Merge branch 'master' into fix-workers-to-delete-bug

cd39b5e

ryanaoleary requested a review from kevin85421 May 5, 2025 22:30

kevin85421 approved these changes May 5, 2025

View reviewed changes

kevin85421 added the go add ONLY when ready to merge, run all tests label May 5, 2025

kevin85421 mentioned this pull request May 5, 2025

[Autoscaler][V2] Autoscaler fails to delete idle KubeRay Pod #52264

Closed

jjyao reviewed May 6, 2025

View reviewed changes

python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py Show resolved Hide resolved

hainesmichaelc added the community-contribution Contributed by the community label May 7, 2025

Add unit test for deleting multiple workers of same node type

9a6b143

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary requested a review from jjyao May 12, 2025 16:56

Merge branch 'master' into fix-workers-to-delete-bug

857f2ea

Remove verbose logging from python test

113250a

Signed-off-by: Ryan O'Leary <[email protected]>

kevin85421 reviewed May 12, 2025

View reviewed changes

python/ray/autoscaler/v2/tests/test_node_provider.py Show resolved Hide resolved

kevin85421 requested changes May 12, 2025

View reviewed changes

masoudcharkhabi added stability clusters labels May 12, 2025

ryanaoleary added 2 commits May 13, 2025 00:17

Add workers_to_delete_info test case to cover edge case

d147020

Signed-off-by: Ryan O'Leary <[email protected]>

Remove verbose logging again

93dc3f9

Signed-off-by: Ryan O'Leary <[email protected]>

Merge branch 'master' into fix-workers-to-delete-bug

738abd6

ryanaoleary requested a review from kevin85421 May 13, 2025 00:21

kevin85421 reviewed May 13, 2025

View reviewed changes

python/ray/autoscaler/v2/tests/test_node_provider.py Show resolved Hide resolved

kevin85421 approved these changes May 13, 2025

View reviewed changes

jjyao reviewed May 13, 2025

View reviewed changes

jjyao reviewed May 14, 2025

View reviewed changes

Call workers_delete_info as static method

2b10b1a

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary requested a review from jjyao May 15, 2025 00:00

Merge branch 'master' into fix-workers-to-delete-bug

6e98dcf

jjyao merged commit aa4deb3 into ray-project:master May 15, 2025
5 checks passed

lk-chen pushed a commit to lk-chen/ray that referenced this pull request May 17, 2025

[Autoscaler][V2] Fix autoscaler terminating more nodes than exist of …

f773a87

…a type (ray-project#52760) Signed-off-by: Ryan O'Leary <[email protected]>

hainesmichaelc added the community-backlog label May 22, 2025

vickytsang pushed a commit to ROCm/ray that referenced this pull request Jun 3, 2025

[Autoscaler][V2] Fix autoscaler terminating more nodes than exist of …

80ea675

…a type (ray-project#52760) Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Vicky Tsang <[email protected]>

[Autoscaler][V2] Fix autoscaler terminating more nodes than exist of a type #52760

[Autoscaler][V2] Fix autoscaler terminating more nodes than exist of a type #52760

Uh oh!

Conversation

ryanaoleary commented May 2, 2025 • edited by kevin85421 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

ryanaoleary commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevin85421 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kevin85421 left a comment

Choose a reason for hiding this comment

Uh oh!

kevin85421 commented May 6, 2025

Uh oh!

Uh oh!

ryanaoleary commented May 8, 2025

Uh oh!

kevin85421 commented May 9, 2025

Uh oh!

ryanaoleary commented May 12, 2025

Uh oh!

kevin85421 commented May 12, 2025

Uh oh!

Uh oh!

kevin85421 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kevin85421 left a comment

Choose a reason for hiding this comment

Uh oh!

kevin85421 commented May 13, 2025

Uh oh!

jjyao May 13, 2025

Choose a reason for hiding this comment

Uh oh!

kevin85421 May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjyao May 13, 2025

Choose a reason for hiding this comment

Uh oh!

kevin85421 May 13, 2025

Choose a reason for hiding this comment

Uh oh!

jjyao May 14, 2025

Choose a reason for hiding this comment

Uh oh!

ryanaoleary May 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jjyao commented May 15, 2025

Uh oh!

JKBGIT1 commented May 26, 2025

Uh oh!

Uh oh!

ryanaoleary commented May 2, 2025 •

edited by kevin85421

Loading

ryanaoleary commented May 2, 2025 •

edited

Loading

kevin85421 May 13, 2025 •

edited

Loading