Skip to content

AWSManagedMachinePool cannot recover when launch template referenced in status is deleted externally #5901

@Anindya-Banerjee

Description

@Anindya-Banerjee

/kind bug

What steps did you take and what happened:
When an AWSManagedMachinePool with amiType: CUSTOM and an awsLaunchTemplate spec has its EC2 launch template deleted externally (outside of CAPA), the controller enters a permanent reconciliation failure loop. It cannot recover without manual intervention to clear the stale launch template ID from the resource's .status.

The controller logs:

Failed to reconcile launch template: operation error EC2: DescribeLaunchTemplateVersions,
https response error StatusCode: 400, RequestID: ...,
api error InvalidLaunchTemplateId.NotFound: The specified launch template,
with template ID lt-XXXXXXXXXXXXXXXXX, does not exist.

The LaunchTemplateReady condition is permanently set to False with reason LaunchTemplateReconcileFailed.

What did you expect to happen:

When the launch template stored in .status.launchTemplateID no longer exists in AWS, the controller should detect this and create a new launch template (the same behavior as initial provisioning), rather than failing permanently.

Anything else you would like to add:

How to reproduce

  1. Create an AWSManagedMachinePool with a custom AMI (which triggers amiType: CUSTOM and awsLaunchTemplate)
  2. Wait for the machine pool to become ready — CAPA creates a launch template and stores its ID in .status.launchTemplateID
  3. Delete the launch template directly in AWS (e.g., via the EC2 console or CLI)
  4. Wait for the next reconciliation cycle
  5. The controller fails permanently with InvalidLaunchTemplateId.NotFound and never recovers

Root cause analysis

The ReconcileLaunchTemplate function in pkg/cloud/services/ec2/launchtemplate.go has two lookup patterns:

Name-based lookups (handle NotFound gracefully):

  • GetLaunchTemplate(name) — uses LaunchTemplateName in DescribeLaunchTemplateVersions. When the template doesn't exist, it returns nil (not an error), and the caller correctly creates a new template.
  • GetLaunchTemplateID(name) — same pattern.

ID-based lookups (do NOT handle NotFound):

  • GetLaunchTemplateLatestVersion(id) — uses LaunchTemplateId in DescribeLaunchTemplateVersions. When the template doesn't exist, AWS returns a 400 InvalidLaunchTemplateId.NotFound error, which is propagated as-is.
  • PruneLaunchTemplateVersions(id) — same pattern.

Both ID-based functions are called with scope.GetLaunchTemplateIDStatus(), which returns the stale ID from .status. Even though the name-based lookup in step 1 of reconciliation returns nil (correctly detecting the template is gone), the subsequent ID-based operations use the old status value and fail.

The relevant call chain in ReconcileLaunchTemplate:

  1. GetLaunchTemplate(name) → returns nil (template gone by name)
  2. Code path to create new template may or may not execute
  3. GetLaunchTemplateLatestVersion(scope.GetLaunchTemplateIDStatus()) → uses stale ID → fails
  4. PruneLaunchTemplateVersions(scope.GetLaunchTemplateIDStatus()) → uses stale ID → fails

Suggested fix

The ID-based functions (GetLaunchTemplateLatestVersion, PruneLaunchTemplateVersions) should handle InvalidLaunchTemplateId.NotFound errors, similar to how GetLaunchTemplate handles LaunchTemplateNameNotFound. When the stored ID is stale:

  1. Clear .status.launchTemplateID and .status.launchTemplateVersion
  2. Fall through to the "no existing launch template found, creating" code path
  3. Create a new launch template and update status accordingly

Alternatively, the reconciler could validate the status ID early in the reconciliation loop and clear it if the referenced template no longer exists.

Impact

In our environment, this bug affected 7 out of 21 clusters simultaneously. The launch templates were likely deleted by the continuous version churn described in #5819 (which also affects our CAPA version). Once deleted, the clusters entered a permanent failure state that persisted for over 60 days. The EKS node groups themselves continued running, but CAPA could not reconcile any changes to them.

The only workaround is to manually patch the status subresource:

kubectl patch awsmanagedmachinepool <name> -n <namespace> \
  --type=merge --subresource=status \
  -p '{"status":{"launchTemplateID":"","launchTemplateVersion":""}}'

Environment:

  • Cluster-api-provider-aws version: v2.10.0 (also likely affects v2.9.x and v2.10.1/v2.10.2 — the code path has not changed)
  • Kubernetes version: (use kubectl version): v1.34.3-eks-ac2d5a0
  • OS (e.g. from /etc/os-release): Latest EKS optimized Bottlerocket ami

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.needs-priorityneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions