e2e-aks BYO node-pool test failing since ~2026-06-09

### Summary

The `pull-cluster-api-provider-azure-e2e-aks` presubmit has been failing on **every PR since ~2026-06-09**. The failing subtest is the **AKS BYO (self-managed VMSS) node-pool** test: the BYO VMs provision in Azure but their kubelet never registers as a node. The trigger is the auto-selected Kubernetes version rolling from **v1.35.4 → v1.35.5** (AKS began offering 1.35.5 control planes around June 9).

This is **not** caused by any particular PR (it reproduces on unrelated PRs).

> **Update / correction:** An earlier version of this issue hypothesized an **image-builder regression** in the self-managed node reference image between 1.35.4 and 1.35.5. **That has been ruled out** — see "Root cause analysis" below. The v1.35.5 node image is fine; the problem is the **self-managed-node → AKS-managed-control-plane join path** at 1.35.5.

### Failing test

```
Workload cluster creation > Creating an AKS cluster for node pool tests
  [Managed Kubernetes] with a single control plane node and 1 node
```
Fails at `test/e2e/aks_byo_node.go:183` (the `Eventually` asserting `len(nodes) == ExpectedWorkerNodes + 2`, line 175), timing out after 1500s:

```
Expected
    <int32>: 3
to equal
    <int32>: 5
```

The 3 AKS managed-pool nodes register; the 2 self-managed BYO VMSS nodes never do.

### When it broke + version correlation

| | Last passing (Jun 8) | Failing (Jun 9–10) |
|---|---|---|
| BYO `MachinePool` `spec.version` | **v1.35.4** | **v1.35.5** |
| BYO nodes registered? | ✅ `byo-pool000000/1` `node-describe.txt` present | ❌ no BYO `node-describe.txt` |
| AKS managed-pool nodes | ✅ present | ✅ present |

### Evidence the VMs come up but never join

From the management-cluster CAPZ controller logs and resource dumps on a failing run:

- The VMSS instances provision successfully: `Scale Set VM is Succeeded` for `byo-pool/virtualMachines/0` and `/1`.
- But the `AzureMachinePoolMachine`s stay `ready=false` (continuously `Requeuing ... state="Succeeded" ready=false`).
- The owning `Machine` is stuck `Pending`, `InfrastructureReady=False (NodeProvisioning)`, `NodeHealthy=Unknown` ("Waiting for AzureMachinePoolMachine to report spec.providerID"), no `nodeRef`.
- `BootstrapSucceeded=True` / `BootstrapConfigReady=True (DataSecretProvided)` — bootstrap data is fine.
- No `node-describe.txt` is collected for the BYO nodes (they never registered with the AKS API server).

So the VM boots in Azure (provisioningState `Succeeded`), but the kubelet never joins — isolated to **self-managed node bring-up against the AKS control plane at v1.35.5**. The AKS control plane and managed node pools are healthy in every run.

### Root cause analysis (image exonerated)

The BYO `AzureMachinePool` uses `image: nil`, so CAPZ resolves the default community-gallery reference image `capi-ubun2-2404` at the **exact k8s version** (here `1.35.5`). That image is built/published by image-builder.

The image is **not** the problem:

- `KUBERNETES_VERSION` / `CAPZ_GALLERY_VERSION` default to `stable-1.35`, which currently resolves to **v1.35.5**. So the **main self-managed e2e** (`pull-cluster-api-provider-azure-e2e`) uses the **same `capi-ubun2-2404` v1.35.5 image**.
- That job **passes** (e.g. on the same PRs where `-e2e-aks` fails) — i.e., the 1.35.5 image boots and its kubelet **joins a kubeadm control plane successfully**.
- The v1.35.5 image has been published in the community gallery since **2026-05-12**; if it were broken for joining, self-managed e2e would have been failing for ~4 weeks. It hasn't been.
- Reviewed all `images/capi` image-builder commits between the 1.35.4 build (HEAD `aa0c7ae7f`, ~Apr 15) and the 1.35.5 build (HEAD ~`567d5b08b`, ~May 12): no change that would break Ubuntu-node join. The usual suspect `5e16c2e83` ("move kubelet `--system-reserved` to kubelet-config") is excluded twice over — it merged **June 3** (after the May 12 image build) **and** is gated behind `kubernetes_enable_automatic_resource_sizing`, which is `false` for Azure.

Conclusion: the failure is specific to the **BYO join path** — a self-managed CAPZ node joining an **AKS-managed** control plane at 1.35.5 — and surfaced when AKS rolled its control planes from 1.35.4 to 1.35.5 (~June 9). Likely areas: the kubeadm/TLS bootstrap-token join against the AKS API server, kubelet client-cert/CA handling, or a version-skew/behavioral change between self-managed 1.35.5 nodes and the AKS 1.35.5 control plane.

### Scope (not PR-specific)

Same failure, same test, same timeout on unrelated PRs since ~Jun 9: #6325, #6347, #6348, #6350, #6353. Job was green Jun 5–8.

### Example runs

- ✅ Last passing (v1.35.4), PR #6330, Jun 8: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/6330/pull-cluster-api-provider-azure-e2e-aks/2064087176676642816
- ❌ Failing (v1.35.5), PR #6347, Jun 10: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/6347/pull-cluster-api-provider-azure-e2e-aks/2064511153769287680
- ❌ Failing, PR #6325, Jun 9: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/6325/pull-cluster-api-provider-azure-e2e-aks/2064387571605049344
- ❌ Failing, PR #6353, Jun 10: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/6353/pull-cluster-api-provider-azure-e2e-aks/2064525499652116480

### Suggested next steps

- Debug the **BYO-to-AKS join** at 1.35.5: pull the kubelet/cloud-init/`kubeadm join` logs from a BYO VMSS instance (not auto-collected, since the node never joins) to see why the kubelet doesn't register against the AKS API server.
- Check for bootstrap-token / TLS-bootstrap / client-cert behavior changes between AKS 1.35.4 and 1.35.5 control planes.
- As a stopgap to unblock the job, pin the e2e/BYO Kubernetes version to v1.35.4 (known-good) until the 1.35.5 BYO-join issue is understood.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

e2e-aks BYO node-pool test failing since ~2026-06-09 #6354

Summary

Failing test

When it broke + version correlation

Evidence the VMs come up but never join

Root cause analysis (image exonerated)

Scope (not PR-specific)

Example runs

Suggested next steps

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	Last passing (Jun 8)	Failing (Jun 9–10)
BYO `MachinePool` `spec.version`	v1.35.4	v1.35.5
BYO nodes registered?	✅ `byo-pool000000/1` `node-describe.txt` present	❌ no BYO `node-describe.txt`
AKS managed-pool nodes	✅ present	✅ present

Uh oh!

e2e-aks BYO node-pool test failing since ~2026-06-09 #6354

Description

Summary

Failing test

When it broke + version correlation

Evidence the VMs come up but never join

Root cause analysis (image exonerated)

Scope (not PR-specific)

Example runs

Suggested next steps

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions