AWSManagedMachinePool cannot recover when launch template referenced in status is deleted externally

/kind bug

**What steps did you take and what happened:**
When an `AWSManagedMachinePool` with `amiType: CUSTOM` and an `awsLaunchTemplate` spec has its EC2 launch template deleted externally (outside of CAPA), the controller enters a **permanent reconciliation failure loop**. It cannot recover without manual intervention to clear the stale launch template ID from the resource's `.status`.

The controller logs:

```
Failed to reconcile launch template: operation error EC2: DescribeLaunchTemplateVersions,
https response error StatusCode: 400, RequestID: ...,
api error InvalidLaunchTemplateId.NotFound: The specified launch template,
with template ID lt-XXXXXXXXXXXXXXXXX, does not exist.
```

The `LaunchTemplateReady` condition is permanently set to `False` with reason `LaunchTemplateReconcileFailed`.


**What did you expect to happen:**

When the launch template stored in `.status.launchTemplateID` no longer exists in AWS, the controller should detect this and create a new launch template (the same behavior as initial provisioning), rather than failing permanently.

**Anything else you would like to add:**
### How to reproduce

1. Create an `AWSManagedMachinePool` with a custom AMI (which triggers `amiType: CUSTOM` and `awsLaunchTemplate`)
2. Wait for the machine pool to become ready — CAPA creates a launch template and stores its ID in `.status.launchTemplateID`
3. Delete the launch template directly in AWS (e.g., via the EC2 console or CLI)
4. Wait for the next reconciliation cycle
5. The controller fails permanently with `InvalidLaunchTemplateId.NotFound` and never recovers

### Root cause analysis

The `ReconcileLaunchTemplate` function in `pkg/cloud/services/ec2/launchtemplate.go` has two lookup patterns:

**Name-based lookups (handle NotFound gracefully):**
- `GetLaunchTemplate(name)` — uses `LaunchTemplateName` in `DescribeLaunchTemplateVersions`. When the template doesn't exist, it returns `nil` (not an error), and the caller correctly creates a new template.
- `GetLaunchTemplateID(name)` — same pattern.

**ID-based lookups (do NOT handle NotFound):**
- `GetLaunchTemplateLatestVersion(id)` — uses `LaunchTemplateId` in `DescribeLaunchTemplateVersions`. When the template doesn't exist, AWS returns a 400 `InvalidLaunchTemplateId.NotFound` error, which is propagated as-is.
- `PruneLaunchTemplateVersions(id)` — same pattern.

Both ID-based functions are called with `scope.GetLaunchTemplateIDStatus()`, which returns the **stale ID** from `.status`. Even though the name-based lookup in step 1 of reconciliation returns `nil` (correctly detecting the template is gone), the subsequent ID-based operations use the old status value and fail.

The relevant call chain in `ReconcileLaunchTemplate`:
1. `GetLaunchTemplate(name)` → returns `nil` (template gone by name) 
2. Code path to create new template may or may not execute
3. `GetLaunchTemplateLatestVersion(scope.GetLaunchTemplateIDStatus())` → uses stale ID → **fails** 
4. `PruneLaunchTemplateVersions(scope.GetLaunchTemplateIDStatus())` → uses stale ID → **fails** 

### Suggested fix

The ID-based functions (`GetLaunchTemplateLatestVersion`, `PruneLaunchTemplateVersions`) should handle `InvalidLaunchTemplateId.NotFound` errors, similar to how `GetLaunchTemplate` handles `LaunchTemplateNameNotFound`. When the stored ID is stale:

1. Clear `.status.launchTemplateID` and `.status.launchTemplateVersion`
2. Fall through to the "no existing launch template found, creating" code path
3. Create a new launch template and update status accordingly

Alternatively, the reconciler could validate the status ID early in the reconciliation loop and clear it if the referenced template no longer exists.

### Impact

In our environment, this bug affected **7 out of 21 clusters** simultaneously. The launch templates were likely deleted by the continuous version churn described in #5819 (which also affects our CAPA version). Once deleted, the clusters entered a permanent failure state that persisted for over 60 days. The EKS node groups themselves continued running, but CAPA could not reconcile any changes to them.

The only workaround is to manually patch the status subresource:

```bash
kubectl patch awsmanagedmachinepool <name> -n <namespace> \
  --type=merge --subresource=status \
  -p '{"status":{"launchTemplateID":"","launchTemplateVersion":""}}'
```


**Environment:**

- Cluster-api-provider-aws version: v2.10.0 (also likely affects v2.9.x and v2.10.1/v2.10.2 — the code path has not changed)
- Kubernetes version: (use `kubectl version`): v1.34.3-eks-ac2d5a0
- OS (e.g. from `/etc/os-release`): Latest EKS optimized Bottlerocket ami 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWSManagedMachinePool cannot recover when launch template referenced in status is deleted externally #5901

How to reproduce

Root cause analysis

Suggested fix

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AWSManagedMachinePool cannot recover when launch template referenced in status is deleted externally #5901

Description

How to reproduce

Root cause analysis

Suggested fix

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions