/kind bug
What steps did you take and what happened:
When an AWSManagedMachinePool with amiType: CUSTOM and an awsLaunchTemplate spec has its EC2 launch template deleted externally (outside of CAPA), the controller enters a permanent reconciliation failure loop. It cannot recover without manual intervention to clear the stale launch template ID from the resource's .status.
The controller logs:
Failed to reconcile launch template: operation error EC2: DescribeLaunchTemplateVersions,
https response error StatusCode: 400, RequestID: ...,
api error InvalidLaunchTemplateId.NotFound: The specified launch template,
with template ID lt-XXXXXXXXXXXXXXXXX, does not exist.
The LaunchTemplateReady condition is permanently set to False with reason LaunchTemplateReconcileFailed.
What did you expect to happen:
When the launch template stored in .status.launchTemplateID no longer exists in AWS, the controller should detect this and create a new launch template (the same behavior as initial provisioning), rather than failing permanently.
Anything else you would like to add:
How to reproduce
- Create an
AWSManagedMachinePool with a custom AMI (which triggers amiType: CUSTOM and awsLaunchTemplate)
- Wait for the machine pool to become ready — CAPA creates a launch template and stores its ID in
.status.launchTemplateID
- Delete the launch template directly in AWS (e.g., via the EC2 console or CLI)
- Wait for the next reconciliation cycle
- The controller fails permanently with
InvalidLaunchTemplateId.NotFound and never recovers
Root cause analysis
The ReconcileLaunchTemplate function in pkg/cloud/services/ec2/launchtemplate.go has two lookup patterns:
Name-based lookups (handle NotFound gracefully):
GetLaunchTemplate(name) — uses LaunchTemplateName in DescribeLaunchTemplateVersions. When the template doesn't exist, it returns nil (not an error), and the caller correctly creates a new template.
GetLaunchTemplateID(name) — same pattern.
ID-based lookups (do NOT handle NotFound):
GetLaunchTemplateLatestVersion(id) — uses LaunchTemplateId in DescribeLaunchTemplateVersions. When the template doesn't exist, AWS returns a 400 InvalidLaunchTemplateId.NotFound error, which is propagated as-is.
PruneLaunchTemplateVersions(id) — same pattern.
Both ID-based functions are called with scope.GetLaunchTemplateIDStatus(), which returns the stale ID from .status. Even though the name-based lookup in step 1 of reconciliation returns nil (correctly detecting the template is gone), the subsequent ID-based operations use the old status value and fail.
The relevant call chain in ReconcileLaunchTemplate:
GetLaunchTemplate(name) → returns nil (template gone by name)
- Code path to create new template may or may not execute
GetLaunchTemplateLatestVersion(scope.GetLaunchTemplateIDStatus()) → uses stale ID → fails
PruneLaunchTemplateVersions(scope.GetLaunchTemplateIDStatus()) → uses stale ID → fails
Suggested fix
The ID-based functions (GetLaunchTemplateLatestVersion, PruneLaunchTemplateVersions) should handle InvalidLaunchTemplateId.NotFound errors, similar to how GetLaunchTemplate handles LaunchTemplateNameNotFound. When the stored ID is stale:
- Clear
.status.launchTemplateID and .status.launchTemplateVersion
- Fall through to the "no existing launch template found, creating" code path
- Create a new launch template and update status accordingly
Alternatively, the reconciler could validate the status ID early in the reconciliation loop and clear it if the referenced template no longer exists.
Impact
In our environment, this bug affected 7 out of 21 clusters simultaneously. The launch templates were likely deleted by the continuous version churn described in #5819 (which also affects our CAPA version). Once deleted, the clusters entered a permanent failure state that persisted for over 60 days. The EKS node groups themselves continued running, but CAPA could not reconcile any changes to them.
The only workaround is to manually patch the status subresource:
kubectl patch awsmanagedmachinepool <name> -n <namespace> \
--type=merge --subresource=status \
-p '{"status":{"launchTemplateID":"","launchTemplateVersion":""}}'
Environment:
- Cluster-api-provider-aws version: v2.10.0 (also likely affects v2.9.x and v2.10.1/v2.10.2 — the code path has not changed)
- Kubernetes version: (use
kubectl version): v1.34.3-eks-ac2d5a0
- OS (e.g. from
/etc/os-release): Latest EKS optimized Bottlerocket ami
/kind bug
What steps did you take and what happened:
When an
AWSManagedMachinePoolwithamiType: CUSTOMand anawsLaunchTemplatespec has its EC2 launch template deleted externally (outside of CAPA), the controller enters a permanent reconciliation failure loop. It cannot recover without manual intervention to clear the stale launch template ID from the resource's.status.The controller logs:
The
LaunchTemplateReadycondition is permanently set toFalsewith reasonLaunchTemplateReconcileFailed.What did you expect to happen:
When the launch template stored in
.status.launchTemplateIDno longer exists in AWS, the controller should detect this and create a new launch template (the same behavior as initial provisioning), rather than failing permanently.Anything else you would like to add:
How to reproduce
AWSManagedMachinePoolwith a custom AMI (which triggersamiType: CUSTOMandawsLaunchTemplate).status.launchTemplateIDInvalidLaunchTemplateId.NotFoundand never recoversRoot cause analysis
The
ReconcileLaunchTemplatefunction inpkg/cloud/services/ec2/launchtemplate.gohas two lookup patterns:Name-based lookups (handle NotFound gracefully):
GetLaunchTemplate(name)— usesLaunchTemplateNameinDescribeLaunchTemplateVersions. When the template doesn't exist, it returnsnil(not an error), and the caller correctly creates a new template.GetLaunchTemplateID(name)— same pattern.ID-based lookups (do NOT handle NotFound):
GetLaunchTemplateLatestVersion(id)— usesLaunchTemplateIdinDescribeLaunchTemplateVersions. When the template doesn't exist, AWS returns a 400InvalidLaunchTemplateId.NotFounderror, which is propagated as-is.PruneLaunchTemplateVersions(id)— same pattern.Both ID-based functions are called with
scope.GetLaunchTemplateIDStatus(), which returns the stale ID from.status. Even though the name-based lookup in step 1 of reconciliation returnsnil(correctly detecting the template is gone), the subsequent ID-based operations use the old status value and fail.The relevant call chain in
ReconcileLaunchTemplate:GetLaunchTemplate(name)→ returnsnil(template gone by name)GetLaunchTemplateLatestVersion(scope.GetLaunchTemplateIDStatus())→ uses stale ID → failsPruneLaunchTemplateVersions(scope.GetLaunchTemplateIDStatus())→ uses stale ID → failsSuggested fix
The ID-based functions (
GetLaunchTemplateLatestVersion,PruneLaunchTemplateVersions) should handleInvalidLaunchTemplateId.NotFounderrors, similar to howGetLaunchTemplatehandlesLaunchTemplateNameNotFound. When the stored ID is stale:.status.launchTemplateIDand.status.launchTemplateVersionAlternatively, the reconciler could validate the status ID early in the reconciliation loop and clear it if the referenced template no longer exists.
Impact
In our environment, this bug affected 7 out of 21 clusters simultaneously. The launch templates were likely deleted by the continuous version churn described in #5819 (which also affects our CAPA version). Once deleted, the clusters entered a permanent failure state that persisted for over 60 days. The EKS node groups themselves continued running, but CAPA could not reconcile any changes to them.
The only workaround is to manually patch the status subresource:
Environment:
kubectl version): v1.34.3-eks-ac2d5a0/etc/os-release): Latest EKS optimized Bottlerocket ami