[ca] fix: undo Scale operation when a scaleUpRequest times out#9132
[ca] fix: undo Scale operation when a scaleUpRequest times out#9132thatmidwesterncoder wants to merge 2 commits intokubernetes:masterfrom
Conversation
|
Welcome @thatmidwesterncoder! |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: thatmidwesterncoder The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @thatmidwesterncoder. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
90291a6 to
c06cd0a
Compare
|
@elmiko / @jackfrancis , hello. Seeing how you've reviewed another community PR wondering if you could look into this one as well (or please redirect to someone else). Thank you in advance, much appreciated. |
|
this seems like a nice fix, thank you @thatmidwesterncoder /ok-to-test |
elmiko
left a comment
There was a problem hiding this comment.
this patch makes sense to me, and i appreciate the addition of the tests too. i have a couple questions and we definitely need to get another reviewer to examine this.
| // Attempt to revert the failed scale-up by decreasing target size. | ||
| // This prevents cloud providers from indefinitely retrying failed provisioning attempts. | ||
| if scaleUpRequest.Increase > 0 { | ||
| klog.V(2).Infof("Reverting timed-out scale-up for node group %v by decreasing target size by %d", |
There was a problem hiding this comment.
i wonder if this log should also be warning level?
how frequently does this get emitted during a failure scenario?
There was a problem hiding this comment.
i wonder if this log should also be warning level?
That is a good point - I missed that the log on line 302 is a warn as well so it makes sense to use the same level.
how frequently does this get emitted during a failure scenario?
It essentially gets emitted every N minutes equivalent to the max-node-provision-time flag. This is per-node group though, so if multiple node so if the autoscaler is attempting to scale M node groups it would appear N*M times. All in all not very often and won't show up unless something bad is going on with the infrastructure provider.
There was a problem hiding this comment.
It essentially gets emitted every N minutes equivalent to the max-node-provision-time flag.
this doesn't seem overly chatty to me, i was concerned that if we made this a warning level log then perhaps it would create too much spam in the logs. but i think it's slow enough that even with multiple node groups in failure it would not be too noisy.
| assert.NoError(t, err) | ||
|
|
||
| // Verify DecreaseTargetSize was called even though it returned an error | ||
| mockedNodeGroup.AssertCalled(t, "DecreaseTargetSize", -4) |
There was a problem hiding this comment.
do we have any way to detect that the error path was utilized?
There was a problem hiding this comment.
yes, I updated the test to monitor for the events that would only be triggered during the error condition (FailedToRevertScaleUp).
| assert.Nil(t, clusterstate.scaleUpRequests["ng1"]) | ||
| } | ||
|
|
||
| func TestExpiredScaleUpRevertsPartialIncrease(t *testing.T) { |
There was a problem hiding this comment.
given the title of this test, i would have expected that one of the two nodes requested for scale up has failed. could you highlight the differences from the TestExpiredScaleUpRevertsTargetSize, i'm having trouble seeing them.
There was a problem hiding this comment.
Ah - I should have added some nodes to the mocked node group to make it a little more clear. Essentially I wanted to prove that when we have some nodes come online as ready, but not enough to satisfy the scale request that the autoscaler would still timeout waiting for the nodes, resulting in the entire scale-up operation getting undone. I updated it to make it to demonstrate that a little better hopefully.
There was a problem hiding this comment.
thanks! i figured that is what you wanted to test, i was having difficulty seeing the difference.
c06cd0a to
ae55033
Compare
elmiko
left a comment
There was a problem hiding this comment.
thanks for the updates!
/lgtm
given the nature of this change i definitely think we should get another set of eyes on the review. cc @jackfrancis @towca
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR modifies the cluster-autoscaler to automatically revert (decrease) the target size of a node group when a scale-up request times out.
Currently, when a scale-up fails due to infrastructure provider capacity limits (e.g., CAPI infrastructure provider cannot provision new nodes), the autoscaler removes the scale-up request from tracking and puts the node group in backoff, but it does not decrease the target size back to its original value. This causes the infrastructure provider to indefinitely retry provisioning the failed nodes, even after the workload that triggered the scale-up has disappeared. Manual intervention is required to scale the cluster back down.
With this fix, when a scale-up request times out:
DecreaseTargetSize()on the node group to revert the increaseThis allows clusters running close to infrastructure capacity to automatically recover from failed scale-ups without manual intervention.
Which issue(s) this PR fixes:
Fixes #9120
Special notes for your reviewer:
DecreaseTargetSizemethod is called with a negative value equal to the originalIncreasefrom the scale-up requestWithOnScaleUphandler since the test provider now needs to handle the decrease operationTestExpiredScaleUpRevertsTargetSize- verifies target size is decreased on timeoutTestExpiredScaleUpRevertsTargetSizeHandlesError- verifies graceful handling when decrease failsTestExpiredScaleUpRevertsPartialIncrease- verifies correct decrease amount for partial increasesDoes this PR introduce a user-facing change?