Skip to content

Fix LoadBalancer hang by cleaning up orphaned NIC on quota failures#6386

Draft
cgiradkar wants to merge 1 commit into
kubernetes-sigs:mainfrom
cgiradkar:feature/quota-failure-nic-cleanup
Draft

Fix LoadBalancer hang by cleaning up orphaned NIC on quota failures#6386
cgiradkar wants to merge 1 commit into
kubernetes-sigs:mainfrom
cgiradkar:feature/quota-failure-nic-cleanup

Conversation

@cgiradkar

@cgiradkar cgiradkar commented Jun 22, 2026

Copy link
Copy Markdown

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fixes cluster-wide LoadBalancer reconciliation failures caused by orphaned Azure NICs when VM provisioning fails due to quota exhaustion.

This PR was written in part with the assistance of generative AI

Problem: When Azure VM creation fails due to quota exhaustion, CAPZ leaves the NIC orphaned (created but with VirtualMachine.ID = nil). The cloud-provider-azure CCM enumerates all NICs during LoadBalancer backend pool reconciliation and crashes when encountering these orphaned NICs, causing cluster-wide LoadBalancer failures—not just for the affected NodePool.

Solution: Implement defensive NIC cleanup with a 1-minute grace period when quota errors are detected, then mark the Machine as permanently Failed to prevent recreation loops.

Which issue(s) this PR fixes:
Fixes # (ARO-HCP internal: ARO-25867)

Special notes for your reviewer:

Critical Non-Obvious Pattern: Condition Snapshot Before Reconcile

Lines 139-143 snapshot the condition before ams.Reconcile() for a subtle but critical reason:

virtualmachines.Reconcile calls UpdatePutStatus(VMRunningCondition, ..., err) which overwrites the condition reason to infrav1.FailedReason. If we checked the grace period after reconcile, we'd see "Failed" instead of "VMProvisionFailedReason", and re-marking would reset LastTransitionTime to "now". The grace window would restart every reconcile loop and never elapse.

The snapshot preserves the original LastTransitionTime from the first quota failure, allowing the grace period to work correctly across multiple reconcile attempts.

Implementation Approach: Conditions Over Annotations

This implementation uses existing CAPI Status.Conditions API instead of custom annotations for grace period tracking:

  • Leverages VMRunningCondition.LastTransitionTime to track when quota failure first occurred
  • Sets VMRunningCondition.Status = False with Reason = VMProvisionFailedReason
  • Preserves LastTransitionTime across reconcile loops to enable accurate grace period tracking

Non-Obvious: Machine Is NOT Deleted

Critical behavioral note: After NIC cleanup, this code marks the Machine as Failed but does NOT delete the Machine object. This is intentional.

Why not delete? Deleting the Machine triggers CAPI MachineSet to create a replacement (to maintain replica count). If quota persists across zones, this creates an infinite recreation loop. Therefore:

  • LoadBalancer is unblocked (primary goal achieved - NIC deleted)
  • No Azure resource consumption (NIC deleted, VM never created)
  • No recreation loop (Machine stays Failed)
  • Admin must eventually delete/update NodePool to trigger replacement

OperationNotAllowed Handling (lines 51-54)

#Ref: https://learn.microsoft.com/en-us/azure/azure-resource-manager/troubleshooting/error-resource-quota?tabs=azure-cli
OperationNotAllowed is ambiguous — it can indicate quota issues OR other authorization problems. We validate the error message contains "quota" to avoid cleaning up NICs for unrelated OperationNotAllowed errors (e.g., RBAC issues, policy violations).

Two Strategies (One Active, One Commented)

Active Strategy: return ctrl.Result{}, nil - Permanent failure, no retries

  • Machine stays Failed until admin intervention (NodePool delete/update/scale)

Commented Strategy (lines 106-130): Exponential backoff retry mechanism

  • Uncomment to enable self-healing behavior
  • Backoff: 15min → 30min → 60min, then permanent failure
  • Rejected for initial implementation because quota exhaustion is typically a capacity planning issue, not transient

Edge Cases Handled

  1. Grace period prevents premature cleanup: 1-minute wait allows transient quota issues to resolve before NIC deletion
  2. Condition-based idempotency: Multiple reconcile loops don't restart grace period if condition already set (LastTransitionTime doesn't change)
  3. Quota detection is comprehensive: Handles 4 Azure error codes (QuotaExceeded, OperationNotAllowed, ResourceQuotaExceeded, DeploymentQuotaExceeded) with special validation for OperationNotAllowed
  4. Service layer cleanup (azuremachine_reconciler.go): DeleteOrphanedNIC finds the "interfaces" service by name and calls its Delete, reusing existing service deletion logic

Test Fixtures (lines 249-282)

Two separate test fixtures handle quota scenarios:

  • getFakeAzureMachineServiceWithQuotaError: Returns quota error on Reconcile (for initial failure + grace period scenarios)
  • getFakeAzureMachineServiceWithQuotaErrorAndNICDelete: Additionally mocks successful NIC deletion (for grace-expired scenario)

This separation makes test intent clear and avoids conditional logic in fixtures.

Alternatives Considered

Annotation-based grace period: Rejected because:

  • Adding new annotations faces API review friction
  • Conditions are the idiomatic CAPI way to track state transitions
  • LastTransitionTime provides exactly what we need

Immediate Machine deletion: Rejected because:

  • Creates recreation loop if quota persists across all zones
  • Wastes Azure API calls
  • Makes problem less visible to operators

New Custom FailureReason="QuotaExhausted": Rejected in favor of reusing azure.CreateError. Benefits:

  • Minimal API surface change
  • Quota details preserved in condition message
  • Existing MHC rules already handle CreateError

Rollout Considerations

Safe to deploy directly:

  • Fully backward compatible (only activates on quota errors)
  • Uses existing CAPI condition API
  • No schema changes or migrations required
  • Defensive: only cleans NICs after confirmed quota failure + grace period

Backport candidate: This fixes a critical production issue (cluster-wide LB failures). Consider backporting to actively supported CAPZ releases.

TODOs:

  • squashed commits
  • includes documentation (inline code comments)
  • adds unit tests
  • cherry-pick candidate (needs maintainer input on which versions)

Release note:

Fix cluster-wide LoadBalancer failures caused by orphaned NICs during quota exhaustion. When VM creation fails due to Azure quota limits, CAPZ now cleans up orphaned NICs after a 1-minute grace period and marks the Machine as Failed to prevent recreation loops.

@kubernetes-prow kubernetes-prow Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Jun 22, 2026
@kubernetes-prow

Copy link
Copy Markdown
Contributor

Welcome @cgiradkar!

It looks like this is your first PR to kubernetes-sigs/cluster-api-provider-azure 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api-provider-azure has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@kubernetes-prow kubernetes-prow Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 22, 2026
@kubernetes-prow

Copy link
Copy Markdown
Contributor

Hi @cgiradkar. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kubernetes-prow

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sbueringer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cgiradkar cgiradkar force-pushed the feature/quota-failure-nic-cleanup branch 2 times, most recently from ec0f07e to 73dc45e Compare June 22, 2026 16:34
@kubernetes-prow

Copy link
Copy Markdown
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kubernetes-prow kubernetes-prow Bot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Jun 22, 2026
@cgiradkar cgiradkar force-pushed the feature/quota-failure-nic-cleanup branch from 73dc45e to 0520278 Compare June 23, 2026 14:28
@bryan-cox

Copy link
Copy Markdown
Contributor

Hey @cgiradkar thanks for the PR! I'll take a look at the Jira ticket you tagged me in, but was AI used to generate this PR? It looks so from the PR description generated.

If so, you must declare it was generated with AI per https://www.kubernetes.dev/docs/guide/pull-requests/#ai-guidance

When VM provisioning fails due to Azure quota exhaustion, CAPZ leaves
the NIC orphaned (without VirtualMachine.ID). This causes cloud-provider-azure's
LoadBalancer reconciler to fail cluster-wide when querying NICs.

This change implements defensive cleanup of orphaned NICs after a 1-minute
grace period when quota errors are detected. The Machine is then marked as
permanently Failed to avoid recreation loops.

Changes:
- Add quota error detection using azcore.ResponseError with multiple error codes
- Add special handling for OperationNotAllowed (validates "quota" in message)
- Snapshot VMRunningCondition before reconcile to preserve LastTransitionTime
- Track quota failures via VMProvisionFailedReason condition
- Delete orphaned NIC via service layer if quota error persists after grace period
- Mark Machine as Failed with FailureReason=CreateError after cleanup
- Add unit tests for condition-based grace period and quota detection

Fixes cluster-wide LoadBalancer failures during quota exhaustion while
avoiding recreation loops by marking Machines as permanently Failed.

Signed-off-by: Chetan Giradkar <cgiradka@redhat.com>
@cgiradkar cgiradkar force-pushed the feature/quota-failure-nic-cleanup branch from 0520278 to dece0c5 Compare June 24, 2026 12:58
@cgiradkar

Copy link
Copy Markdown
Author

Hey @cgiradkar thanks for the PR! I'll take a look at the Jira ticket you tagged me in, but was AI used to generate this PR? It looks so from the PR description generated.

If so, you must declare it was generated with AI per https://www.kubernetes.dev/docs/guide/pull-requests/#ai-guidance

Thanks for pointing it out, I have declared that in the description now.
This PR was created using AI after multiple rounds of revision and the description was also drafted using AI but it conveys my thought process and alternatives considered during investigation of the original issue.

@bryan-cox bryan-cox left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @cgiradkar. The problem is real — currently, quota errors from the Azure API aren't wrapped as azure.ReconcileError (terminal or transient) in azure/services/async/async.go:125-127, so they fall through to the catch-all error return in reconcileNormal. The Machine retries indefinitely, the NIC stays orphaned, and the LB stays broken. This PR addresses that gap.

However, I have several concerns about the approach and implementation, detailed in inline comments below. The biggest questions:

1. Recreation loops with MHC

This PR sets SetFailureReason(azure.CreateError) to mark the Machine as permanently Failed. The PR description claims "Machine stays Failed until admin intervention." However, per CAPI MHC docs: "If a Machine fails for any reason (if the FailureReason is set), the Machine will be remediated immediately." This means MHC will delete the Machine and MachineSet will create a replacement — the exact recreation loop this PR aims to prevent. Does this approach only work when MHC is not configured for the MachineSet?

2. Scope is narrow

This only handles quota errors (4 Azure error codes). Orphaned NICs can result from any VM creation failure (throttling, timeouts, capacity constraints, transient API errors). Your cloud-provider-azure PR #10164 handles orphaned NICs comprehensively at the CCM layer. What's the review status on that PR? It seems like the more complete fix.

3. Simpler alternative?

Would classifying quota errors as terminal ReconcileError in azure/services/async/async.go be a simpler path? The existing terminal error handling in reconcileNormal (lines 317-324) already does SetFailureReason + SetVMState(Failed) + event emission. The recreation loop concern applies equally to both approaches, so the simpler path would get the same result with significantly less new code and no new grace period mechanism.

4. AI disclosure

Per Kubernetes AI guidance, please disclose if AI was used to generate this PR or its description. The PR structure (commented-out alternative strategies, detailed "Special notes for your reviewer" sections) suggests it may have been.


This review was AI-assisted using Claude Code.

}
}

/*

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented-out code should not be shipped. These ~25 lines of commented-out retry logic should be removed entirely. This block also contains a syntax error (}. instead of } on the closing line of setRetryCount) that proves it was never compiled.

Alternative strategies belong in the PR description or a linked design doc, not in production source. CAPZ's own CONTRIBUTING.md (lines 53-54) says: "Expect reviewers to request that you avoid common go style mistakes in your PRs."

Please remove the entire /* ... */ block.

if graceExpired {
log.Info("Grace period expired, cleaning up orphaned NIC", "error", err.Error())
if nicErr := ams.DeleteOrphanedNIC(ctx); nicErr != nil {
log.Error(nicErr, "Failed to delete orphaned NIC during quota failure cleanup: %s", nicErr.Error())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: log.Error uses structured logging (logr), not format strings. The %s will appear literally in the log output, and nicErr.Error() is passed as a malformed key-value pair.

The error is already the first argument to log.Error and is automatically included in the structured output. Compare to the existing pattern at line 319:

log.Error(err, "failed to reconcile AzureMachine", "name", machineScope.Name())

This should be:

log.Error(nicErr, "Failed to delete orphaned NIC during quota failure cleanup")


machineScope.SetFailureReason(azure.CreateError)
machineScope.SetFailureMessage(err)
machineScope.SetNotReady()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues with this block:

1. Missing SetVMState(infrav1.Failed). Compare to the existing terminal error handling pattern at lines 320-323 of this same file:

machineScope.SetFailureReason(azure.CreateError)
machineScope.SetFailureMessage(err)
machineScope.SetNotReady()
machineScope.SetVMState(infrav1.Failed)  // <-- missing from this PR

Omitting SetVMState leaves Status.VMState inconsistent — failure reason is set but VM state doesn't reflect it.

2. Missing Kubernetes event emission. The existing error handling patterns in this function emit events via amr.Recorder.Eventf(...) for significant state changes (see line 307, line 318, line 336). An event as significant as "cleaning up orphaned NIC and permanently marking Machine as Failed due to quota exhaustion" should emit an event — operators watching kubectl get events need this.

// Snapshot before ams.Reconcile: virtualmachines.Reconcile calls UpdatePutStatus(VMRunningCondition, ..., err),
// overwriting the reason to infrav1.FailedReason. A post-reconcile grace check would see "Failed", not "VMProvisionFailedReason",
// and re-marking afterward would reset LastTransitionTime, so the grace window would never elapse.
quotaFailureCondSnapshot := getQuotaProvisionFailedCondition(machineScope.AzureMachine)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snapshot pattern is correct but dangerously subtle. It works because:

  1. Snapshot captures VMRunningCondition with Reason=VMProvisionFailedReason + original LastTransitionTime
  2. ams.ReconcileUpdatePutStatus overwrites the condition to Reason=FailedReason (azure/scope/machine.go:809)
  3. markQuotaProvisionFailedCondition then re-overwrites it back to VMProvisionFailedReason, preserving the original LastTransitionTime

This creates a condition ping-pong every reconcile loop (VMProvisionFailedReasonFailedReasonVMProvisionFailedReason), and correctness depends on knowing the internal behavior of UpdatePutStatus. If that function's behavior changes, this logic silently breaks.

Consider using an annotation (e.g., azure.infrastructure.cluster.x-k8s.io/quota-failure-timestamp) instead of repurposing LastTransitionTime as a timer. Annotations are the standard CAPI pattern for controller-internal state tracking and don't conflict with the condition management framework.

if respErr.ErrorCode == code {
if code == azureOperationNotAllowedCode {
errMsg := strings.ToLower(respErr.Error())
return strings.Contains(errMsg, "quota")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fragile detection: depends on Azure's error message wording. ResponseError.Error() (source) reads the response body and formats it into a multi-line diagnostic string. This check depends on the word "quota" appearing somewhere in that formatted output.

If Azure changes the wording (e.g., "resource limit" instead of "quota"), this detection silently fails and orphaned NICs won't be cleaned up. Consider also checking respErr.StatusCode as a secondary signal, or document this as a known best-effort heuristic.

}

// deleteOrphanedNIC deletes the network interface service.
func (s *azureMachineService) deleteOrphanedNIC(ctx context.Context) error {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several issues with this method:

  1. String coupling across packages. service.Name() == "interfaces" depends on the serviceName constant in azure/services/networkinterfaces/networkinterfaces.go:31. If that constant changes, this silently returns nil — the caller believes deletion succeeded when nothing happened.

  2. Silent success when service not found. If "interfaces" isn't in s.services, the method returns nil. This should return an error like fmt.Errorf("network interfaces service not found").

  3. Misleading name. DeleteOrphanedNIC implies selective deletion of only orphaned NICs. In reality, service.Delete(ctx) calls networkinterfaces.Delete(), which iterates all NICSpecs() and deletes every NIC.

  4. Undocumented side effect. networkinterfaces.Delete() calls UpdateDeleteStatus(NetworkInterfaceReadyCondition, ...) at line 120, modifying NetworkInterfaceReadyCondition on the AzureMachine. This is untested.

Alternative: Consider whether the existing ams.Delete(ctx) (which already deletes NICs as part of the reverse-order teardown at azuremachine_reconciler.go:155-168) could be reused.

}

func TestQuotaHandling(t *testing.T) {
g := NewWithT(t)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gomega NewWithT scoping issue. This g := NewWithT(t) uses the outer test's t, but each t.Run(...) subtest gets its own *testing.T. When a gomega assertion inside a subtest fails, it reports against the outer t, not the subtest's t. Each subtest should create its own:

t.Run("detects quota errors", func(t *testing.T) {
    g := NewWithT(t) // use the subtest's t
    ...
})

Also missing test cases:

  • DeleteOrphanedNIC returning an error (the error path at line 441-442 is untested)
  • The OperationNotAllowed test constructs ResponseError by setting Body directly via io.NopCloser, bypassing the SDK's response pipeline — this may not reflect how real Azure responses are processed

}

const (
quotaFailureGracePeriod = 1 * time.Minute

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why 1 minute? Quota errors are deterministic rejections — if Azure says you exceeded your quota at time T, it'll still be exceeded at T+60s unless someone increases the quota (unlikely within 60s). During this grace period, the orphaned NIC continues causing the cluster-wide LB failure this PR aims to fix.

If the goal is to distinguish "VM creation still in progress" from "VM creation rejected," the existing IsOperationNotDoneError check already handles the former. A quota error is an explicit rejection, not a pending operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

2 participants