-
Notifications
You must be signed in to change notification settings - Fork 6.6k
[Autoscaler][V2] Fix autoscaler terminating more nodes than exist of a type #52760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Autoscaler][V2] Fix autoscaler terminating more nodes than exist of a type #52760
Conversation
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
cc: @kevin85421 I tested this with the V2 autoscaler unit tests, but haven't been able to manually replicate the issue to where I could test the fix with my dev image. I added a guard and logging in 503eeea, since it seems like bad behavior for the autoscaler to block all future scale up and terminate calls (since they both rely on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryanaoleary, I recall you mentioned that a user can consistently reproduce the issue. Could you build an image for them to verify it? Additionally, if possible, please do your best to reproduce the issue so we can add tests to prevent it from happening again.
python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Ryan O'Leary <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chat with @ryanaoleary offline. The customer can't use the dev image. Let's merge this PR and ask the users to verify whether the issue is resolved or not.
python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py
Show resolved
Hide resolved
I wrote a unit test that sets
since |
Move our slack messages to here: In my opinion, we shouldn’t allow users to manually update replicas; they should only be able to adjust the minimum and maximum replica counts.
This makes sense, but I’m worried it could bury other issues even deeper. Could you ask the users if they’re trying to manually lower both minReplicas and replicas? If we can confirm that the issue is caused by either _get_workers_delete_info or manually lowering both minReplicas and replicas, I’m OK with accepting b4f32fa. |
Signed-off-by: Ryan O'Leary <[email protected]>
I added a unit test in 9a6b143 that verifies that the provider can handle multiple node deletions of the same type in one autoscaler iteration. This should cover the bug with |
SG. I will review the new changes. |
Signed-off-by: Ryan O'Leary <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve to unblock the user. I will open a follow up PR to do some improvements.
cc @jjyao would you mind merging this PR? |
pending_deletes, | ||
finished_deletes, | ||
workers_to_delete, | ||
) = self.provider._get_workers_delete_info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just had a conversation about not testing private method (i.e. implementation detail): #52812 (comment). Can we test the public interface instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t agree with this philosophy in this case. Unit tests are much more appropriate here. We should
(1) Add more real e2e tests for KubeRay.
(2) Add more unit tests for well-defined functions.
We should avoid tests between these two which we need to mock the behaviors, and it's pretty common that the mock behavior is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, is there a way to test it through the public APIs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's chat in-person.
pending_deletes, | ||
finished_deletes, | ||
workers_to_delete, | ||
) = self.provider._get_workers_delete_info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_get_workers_delete_info
is a static method so we we don't need to create a KubeRayProvider instance at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh oops yeah, I was using the provider to get the list of cloud instances, but since it's KubeRay we can just use the Pod names. This should be fixed in 2b10b1a.
Signed-off-by: Ryan O'Leary <[email protected]>
Merging to unblock the fix. @kevin85421 will follow-up on some of the tests cleanup. |
…a type (ray-project#52760) Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: iamjustinhsu <[email protected]>
…a type (ray-project#52760) Signed-off-by: Ryan O'Leary <[email protected]>
Hello guys, what's the ETA for release, please? |
…a type (ray-project#52760) Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Vicky Tsang <[email protected]>
…a type (ray-project#52760) Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Scott Lee <[email protected]>
Why are these changes needed?
We're seeing an issue where the value of
worker_to_delete_set
is incorrect, stopping the autoscaler from reconciling correctly due to the following assertion failing:Currently in
_get_workers_delete_info
, if there are multiple nodes in the same worker group with a pending deletion, only the first one is added toworker_to_delete_set
. This set is then used in_initialize_scale_request
here:but since all workers aren't being added to the set, the above assertion fails and future calls to
intialize_scale_request
are unsuccessful.Related issue number
#52264
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.