Fix LoadBalancer hang by cleaning up orphaned NIC on quota failures#6386
Fix LoadBalancer hang by cleaning up orphaned NIC on quota failures#6386cgiradkar wants to merge 1 commit into
Conversation
|
Welcome @cgiradkar! |
|
Hi @cgiradkar. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
ec0f07e to
73dc45e
Compare
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
73dc45e to
0520278
Compare
|
Hey @cgiradkar thanks for the PR! I'll take a look at the Jira ticket you tagged me in, but was AI used to generate this PR? It looks so from the PR description generated. If so, you must declare it was generated with AI per https://www.kubernetes.dev/docs/guide/pull-requests/#ai-guidance |
When VM provisioning fails due to Azure quota exhaustion, CAPZ leaves the NIC orphaned (without VirtualMachine.ID). This causes cloud-provider-azure's LoadBalancer reconciler to fail cluster-wide when querying NICs. This change implements defensive cleanup of orphaned NICs after a 1-minute grace period when quota errors are detected. The Machine is then marked as permanently Failed to avoid recreation loops. Changes: - Add quota error detection using azcore.ResponseError with multiple error codes - Add special handling for OperationNotAllowed (validates "quota" in message) - Snapshot VMRunningCondition before reconcile to preserve LastTransitionTime - Track quota failures via VMProvisionFailedReason condition - Delete orphaned NIC via service layer if quota error persists after grace period - Mark Machine as Failed with FailureReason=CreateError after cleanup - Add unit tests for condition-based grace period and quota detection Fixes cluster-wide LoadBalancer failures during quota exhaustion while avoiding recreation loops by marking Machines as permanently Failed. Signed-off-by: Chetan Giradkar <cgiradka@redhat.com>
0520278 to
dece0c5
Compare
Thanks for pointing it out, I have declared that in the description now. |
bryan-cox
left a comment
There was a problem hiding this comment.
Thanks for the PR, @cgiradkar. The problem is real — currently, quota errors from the Azure API aren't wrapped as azure.ReconcileError (terminal or transient) in azure/services/async/async.go:125-127, so they fall through to the catch-all error return in reconcileNormal. The Machine retries indefinitely, the NIC stays orphaned, and the LB stays broken. This PR addresses that gap.
However, I have several concerns about the approach and implementation, detailed in inline comments below. The biggest questions:
1. Recreation loops with MHC
This PR sets SetFailureReason(azure.CreateError) to mark the Machine as permanently Failed. The PR description claims "Machine stays Failed until admin intervention." However, per CAPI MHC docs: "If a Machine fails for any reason (if the FailureReason is set), the Machine will be remediated immediately." This means MHC will delete the Machine and MachineSet will create a replacement — the exact recreation loop this PR aims to prevent. Does this approach only work when MHC is not configured for the MachineSet?
2. Scope is narrow
This only handles quota errors (4 Azure error codes). Orphaned NICs can result from any VM creation failure (throttling, timeouts, capacity constraints, transient API errors). Your cloud-provider-azure PR #10164 handles orphaned NICs comprehensively at the CCM layer. What's the review status on that PR? It seems like the more complete fix.
3. Simpler alternative?
Would classifying quota errors as terminal ReconcileError in azure/services/async/async.go be a simpler path? The existing terminal error handling in reconcileNormal (lines 317-324) already does SetFailureReason + SetVMState(Failed) + event emission. The recreation loop concern applies equally to both approaches, so the simpler path would get the same result with significantly less new code and no new grace period mechanism.
4. AI disclosure
Per Kubernetes AI guidance, please disclose if AI was used to generate this PR or its description. The PR structure (commented-out alternative strategies, detailed "Special notes for your reviewer" sections) suggests it may have been.
This review was AI-assisted using Claude Code.
| } | ||
| } | ||
|
|
||
| /* |
There was a problem hiding this comment.
Commented-out code should not be shipped. These ~25 lines of commented-out retry logic should be removed entirely. This block also contains a syntax error (}. instead of } on the closing line of setRetryCount) that proves it was never compiled.
Alternative strategies belong in the PR description or a linked design doc, not in production source. CAPZ's own CONTRIBUTING.md (lines 53-54) says: "Expect reviewers to request that you avoid common go style mistakes in your PRs."
Please remove the entire /* ... */ block.
| if graceExpired { | ||
| log.Info("Grace period expired, cleaning up orphaned NIC", "error", err.Error()) | ||
| if nicErr := ams.DeleteOrphanedNIC(ctx); nicErr != nil { | ||
| log.Error(nicErr, "Failed to delete orphaned NIC during quota failure cleanup: %s", nicErr.Error()) |
There was a problem hiding this comment.
Bug: log.Error uses structured logging (logr), not format strings. The %s will appear literally in the log output, and nicErr.Error() is passed as a malformed key-value pair.
The error is already the first argument to log.Error and is automatically included in the structured output. Compare to the existing pattern at line 319:
log.Error(err, "failed to reconcile AzureMachine", "name", machineScope.Name())This should be:
log.Error(nicErr, "Failed to delete orphaned NIC during quota failure cleanup")|
|
||
| machineScope.SetFailureReason(azure.CreateError) | ||
| machineScope.SetFailureMessage(err) | ||
| machineScope.SetNotReady() |
There was a problem hiding this comment.
Two issues with this block:
1. Missing SetVMState(infrav1.Failed). Compare to the existing terminal error handling pattern at lines 320-323 of this same file:
machineScope.SetFailureReason(azure.CreateError)
machineScope.SetFailureMessage(err)
machineScope.SetNotReady()
machineScope.SetVMState(infrav1.Failed) // <-- missing from this PROmitting SetVMState leaves Status.VMState inconsistent — failure reason is set but VM state doesn't reflect it.
2. Missing Kubernetes event emission. The existing error handling patterns in this function emit events via amr.Recorder.Eventf(...) for significant state changes (see line 307, line 318, line 336). An event as significant as "cleaning up orphaned NIC and permanently marking Machine as Failed due to quota exhaustion" should emit an event — operators watching kubectl get events need this.
| // Snapshot before ams.Reconcile: virtualmachines.Reconcile calls UpdatePutStatus(VMRunningCondition, ..., err), | ||
| // overwriting the reason to infrav1.FailedReason. A post-reconcile grace check would see "Failed", not "VMProvisionFailedReason", | ||
| // and re-marking afterward would reset LastTransitionTime, so the grace window would never elapse. | ||
| quotaFailureCondSnapshot := getQuotaProvisionFailedCondition(machineScope.AzureMachine) |
There was a problem hiding this comment.
The snapshot pattern is correct but dangerously subtle. It works because:
- Snapshot captures
VMRunningConditionwithReason=VMProvisionFailedReason+ originalLastTransitionTime ams.Reconcile→UpdatePutStatusoverwrites the condition toReason=FailedReason(azure/scope/machine.go:809)markQuotaProvisionFailedConditionthen re-overwrites it back toVMProvisionFailedReason, preserving the originalLastTransitionTime
This creates a condition ping-pong every reconcile loop (VMProvisionFailedReason → FailedReason → VMProvisionFailedReason), and correctness depends on knowing the internal behavior of UpdatePutStatus. If that function's behavior changes, this logic silently breaks.
Consider using an annotation (e.g., azure.infrastructure.cluster.x-k8s.io/quota-failure-timestamp) instead of repurposing LastTransitionTime as a timer. Annotations are the standard CAPI pattern for controller-internal state tracking and don't conflict with the condition management framework.
| if respErr.ErrorCode == code { | ||
| if code == azureOperationNotAllowedCode { | ||
| errMsg := strings.ToLower(respErr.Error()) | ||
| return strings.Contains(errMsg, "quota") |
There was a problem hiding this comment.
Fragile detection: depends on Azure's error message wording. ResponseError.Error() (source) reads the response body and formats it into a multi-line diagnostic string. This check depends on the word "quota" appearing somewhere in that formatted output.
If Azure changes the wording (e.g., "resource limit" instead of "quota"), this detection silently fails and orphaned NICs won't be cleaned up. Consider also checking respErr.StatusCode as a secondary signal, or document this as a known best-effort heuristic.
| } | ||
|
|
||
| // deleteOrphanedNIC deletes the network interface service. | ||
| func (s *azureMachineService) deleteOrphanedNIC(ctx context.Context) error { |
There was a problem hiding this comment.
Several issues with this method:
-
String coupling across packages.
service.Name() == "interfaces"depends on theserviceNameconstant inazure/services/networkinterfaces/networkinterfaces.go:31. If that constant changes, this silently returnsnil— the caller believes deletion succeeded when nothing happened. -
Silent success when service not found. If "interfaces" isn't in
s.services, the method returnsnil. This should return an error likefmt.Errorf("network interfaces service not found"). -
Misleading name.
DeleteOrphanedNICimplies selective deletion of only orphaned NICs. In reality,service.Delete(ctx)callsnetworkinterfaces.Delete(), which iterates allNICSpecs()and deletes every NIC. -
Undocumented side effect.
networkinterfaces.Delete()callsUpdateDeleteStatus(NetworkInterfaceReadyCondition, ...)at line 120, modifyingNetworkInterfaceReadyConditionon the AzureMachine. This is untested.
Alternative: Consider whether the existing ams.Delete(ctx) (which already deletes NICs as part of the reverse-order teardown at azuremachine_reconciler.go:155-168) could be reused.
| } | ||
|
|
||
| func TestQuotaHandling(t *testing.T) { | ||
| g := NewWithT(t) |
There was a problem hiding this comment.
gomega NewWithT scoping issue. This g := NewWithT(t) uses the outer test's t, but each t.Run(...) subtest gets its own *testing.T. When a gomega assertion inside a subtest fails, it reports against the outer t, not the subtest's t. Each subtest should create its own:
t.Run("detects quota errors", func(t *testing.T) {
g := NewWithT(t) // use the subtest's t
...
})Also missing test cases:
DeleteOrphanedNICreturning an error (the error path at line 441-442 is untested)- The
OperationNotAllowedtest constructsResponseErrorby settingBodydirectly viaio.NopCloser, bypassing the SDK's response pipeline — this may not reflect how real Azure responses are processed
| } | ||
|
|
||
| const ( | ||
| quotaFailureGracePeriod = 1 * time.Minute |
There was a problem hiding this comment.
nit: Why 1 minute? Quota errors are deterministic rejections — if Azure says you exceeded your quota at time T, it'll still be exceeded at T+60s unless someone increases the quota (unlikely within 60s). During this grace period, the orphaned NIC continues causing the cluster-wide LB failure this PR aims to fix.
If the goal is to distinguish "VM creation still in progress" from "VM creation rejected," the existing IsOperationNotDoneError check already handles the former. A quota error is an explicit rejection, not a pending operation.
What type of PR is this?
/kind bug
What this PR does / why we need it:
Fixes cluster-wide LoadBalancer reconciliation failures caused by orphaned Azure NICs when VM provisioning fails due to quota exhaustion.
Problem: When Azure VM creation fails due to quota exhaustion, CAPZ leaves the NIC orphaned (created but with
VirtualMachine.ID = nil). The cloud-provider-azure CCM enumerates all NICs during LoadBalancer backend pool reconciliation and crashes when encountering these orphaned NICs, causing cluster-wide LoadBalancer failures—not just for the affected NodePool.Solution: Implement defensive NIC cleanup with a 1-minute grace period when quota errors are detected, then mark the Machine as permanently Failed to prevent recreation loops.
Which issue(s) this PR fixes:
Fixes # (ARO-HCP internal: ARO-25867)
Special notes for your reviewer:
Critical Non-Obvious Pattern: Condition Snapshot Before Reconcile
Lines 139-143 snapshot the condition before
ams.Reconcile()for a subtle but critical reason:virtualmachines.ReconcilecallsUpdatePutStatus(VMRunningCondition, ..., err)which overwrites the condition reason toinfrav1.FailedReason. If we checked the grace period after reconcile, we'd see "Failed" instead of "VMProvisionFailedReason", and re-marking would resetLastTransitionTimeto "now". The grace window would restart every reconcile loop and never elapse.The snapshot preserves the original
LastTransitionTimefrom the first quota failure, allowing the grace period to work correctly across multiple reconcile attempts.Implementation Approach: Conditions Over Annotations
This implementation uses existing CAPI
Status.ConditionsAPI instead of custom annotations for grace period tracking:VMRunningCondition.LastTransitionTimeto track when quota failure first occurredVMRunningCondition.Status = FalsewithReason = VMProvisionFailedReasonLastTransitionTimeacross reconcile loops to enable accurate grace period trackingNon-Obvious: Machine Is NOT Deleted
Critical behavioral note: After NIC cleanup, this code marks the Machine as Failed but does NOT delete the Machine object. This is intentional.
Why not delete? Deleting the Machine triggers CAPI MachineSet to create a replacement (to maintain replica count). If quota persists across zones, this creates an infinite recreation loop. Therefore:
OperationNotAllowed Handling (lines 51-54)
#Ref: https://learn.microsoft.com/en-us/azure/azure-resource-manager/troubleshooting/error-resource-quota?tabs=azure-cli
OperationNotAllowedis ambiguous — it can indicate quota issues OR other authorization problems. We validate the error message contains "quota" to avoid cleaning up NICs for unrelatedOperationNotAllowederrors (e.g., RBAC issues, policy violations).Two Strategies (One Active, One Commented)
Active Strategy:
return ctrl.Result{}, nil- Permanent failure, no retriesCommented Strategy (lines 106-130): Exponential backoff retry mechanism
Edge Cases Handled
QuotaExceeded,OperationNotAllowed,ResourceQuotaExceeded,DeploymentQuotaExceeded) with special validation forOperationNotAllowedDeleteOrphanedNICfinds the "interfaces" service by name and calls itsDelete, reusing existing service deletion logicTest Fixtures (lines 249-282)
Two separate test fixtures handle quota scenarios:
getFakeAzureMachineServiceWithQuotaError: Returns quota error on Reconcile (for initial failure + grace period scenarios)getFakeAzureMachineServiceWithQuotaErrorAndNICDelete: Additionally mocks successful NIC deletion (for grace-expired scenario)This separation makes test intent clear and avoids conditional logic in fixtures.
Alternatives Considered
Annotation-based grace period: Rejected because:
LastTransitionTimeprovides exactly what we needImmediate Machine deletion: Rejected because:
New Custom FailureReason="QuotaExhausted": Rejected in favor of reusing
azure.CreateError. Benefits:CreateErrorRollout Considerations
Safe to deploy directly:
Backport candidate: This fixes a critical production issue (cluster-wide LB failures). Consider backporting to actively supported CAPZ releases.
TODOs:
Release note: