Summary
The pull-cluster-api-provider-azure-e2e-aks presubmit has been failing on every PR since ~2026-06-09. The failing subtest is the AKS BYO (self-managed VMSS) node-pool test: the BYO VMs provision in Azure but their kubelet never registers as a node. The trigger is the auto-selected Kubernetes version rolling from v1.35.4 → v1.35.5 (AKS began offering 1.35.5 control planes around June 9).
This is not caused by any particular PR (it reproduces on unrelated PRs).
Update / correction: An earlier version of this issue hypothesized an image-builder regression in the self-managed node reference image between 1.35.4 and 1.35.5. That has been ruled out — see "Root cause analysis" below. The v1.35.5 node image is fine; the problem is the self-managed-node → AKS-managed-control-plane join path at 1.35.5.
Failing test
Workload cluster creation > Creating an AKS cluster for node pool tests
[Managed Kubernetes] with a single control plane node and 1 node
Fails at test/e2e/aks_byo_node.go:183 (the Eventually asserting len(nodes) == ExpectedWorkerNodes + 2, line 175), timing out after 1500s:
Expected
<int32>: 3
to equal
<int32>: 5
The 3 AKS managed-pool nodes register; the 2 self-managed BYO VMSS nodes never do.
When it broke + version correlation
|
Last passing (Jun 8) |
Failing (Jun 9–10) |
BYO MachinePool spec.version |
v1.35.4 |
v1.35.5 |
| BYO nodes registered? |
✅ byo-pool000000/1 node-describe.txt present |
❌ no BYO node-describe.txt |
| AKS managed-pool nodes |
✅ present |
✅ present |
Evidence the VMs come up but never join
From the management-cluster CAPZ controller logs and resource dumps on a failing run:
- The VMSS instances provision successfully:
Scale Set VM is Succeeded for byo-pool/virtualMachines/0 and /1.
- But the
AzureMachinePoolMachines stay ready=false (continuously Requeuing ... state="Succeeded" ready=false).
- The owning
Machine is stuck Pending, InfrastructureReady=False (NodeProvisioning), NodeHealthy=Unknown ("Waiting for AzureMachinePoolMachine to report spec.providerID"), no nodeRef.
BootstrapSucceeded=True / BootstrapConfigReady=True (DataSecretProvided) — bootstrap data is fine.
- No
node-describe.txt is collected for the BYO nodes (they never registered with the AKS API server).
So the VM boots in Azure (provisioningState Succeeded), but the kubelet never joins — isolated to self-managed node bring-up against the AKS control plane at v1.35.5. The AKS control plane and managed node pools are healthy in every run.
Root cause analysis (image exonerated)
The BYO AzureMachinePool uses image: nil, so CAPZ resolves the default community-gallery reference image capi-ubun2-2404 at the exact k8s version (here 1.35.5). That image is built/published by image-builder.
The image is not the problem:
KUBERNETES_VERSION / CAPZ_GALLERY_VERSION default to stable-1.35, which currently resolves to v1.35.5. So the main self-managed e2e (pull-cluster-api-provider-azure-e2e) uses the same capi-ubun2-2404 v1.35.5 image.
- That job passes (e.g. on the same PRs where
-e2e-aks fails) — i.e., the 1.35.5 image boots and its kubelet joins a kubeadm control plane successfully.
- The v1.35.5 image has been published in the community gallery since 2026-05-12; if it were broken for joining, self-managed e2e would have been failing for ~4 weeks. It hasn't been.
- Reviewed all
images/capi image-builder commits between the 1.35.4 build (HEAD aa0c7ae7f, ~Apr 15) and the 1.35.5 build (HEAD ~567d5b08b, ~May 12): no change that would break Ubuntu-node join. The usual suspect 5e16c2e83 ("move kubelet --system-reserved to kubelet-config") is excluded twice over — it merged June 3 (after the May 12 image build) and is gated behind kubernetes_enable_automatic_resource_sizing, which is false for Azure.
Conclusion: the failure is specific to the BYO join path — a self-managed CAPZ node joining an AKS-managed control plane at 1.35.5 — and surfaced when AKS rolled its control planes from 1.35.4 to 1.35.5 (~June 9). Likely areas: the kubeadm/TLS bootstrap-token join against the AKS API server, kubelet client-cert/CA handling, or a version-skew/behavioral change between self-managed 1.35.5 nodes and the AKS 1.35.5 control plane.
Scope (not PR-specific)
Same failure, same test, same timeout on unrelated PRs since ~Jun 9: #6325, #6347, #6348, #6350, #6353. Job was green Jun 5–8.
Example runs
Suggested next steps
- Debug the BYO-to-AKS join at 1.35.5: pull the kubelet/cloud-init/
kubeadm join logs from a BYO VMSS instance (not auto-collected, since the node never joins) to see why the kubelet doesn't register against the AKS API server.
- Check for bootstrap-token / TLS-bootstrap / client-cert behavior changes between AKS 1.35.4 and 1.35.5 control planes.
- As a stopgap to unblock the job, pin the e2e/BYO Kubernetes version to v1.35.4 (known-good) until the 1.35.5 BYO-join issue is understood.
Summary
The
pull-cluster-api-provider-azure-e2e-akspresubmit has been failing on every PR since ~2026-06-09. The failing subtest is the AKS BYO (self-managed VMSS) node-pool test: the BYO VMs provision in Azure but their kubelet never registers as a node. The trigger is the auto-selected Kubernetes version rolling from v1.35.4 → v1.35.5 (AKS began offering 1.35.5 control planes around June 9).This is not caused by any particular PR (it reproduces on unrelated PRs).
Failing test
Fails at
test/e2e/aks_byo_node.go:183(theEventuallyassertinglen(nodes) == ExpectedWorkerNodes + 2, line 175), timing out after 1500s:The 3 AKS managed-pool nodes register; the 2 self-managed BYO VMSS nodes never do.
When it broke + version correlation
MachinePoolspec.versionbyo-pool000000/1node-describe.txtpresentnode-describe.txtEvidence the VMs come up but never join
From the management-cluster CAPZ controller logs and resource dumps on a failing run:
Scale Set VM is Succeededforbyo-pool/virtualMachines/0and/1.AzureMachinePoolMachines stayready=false(continuouslyRequeuing ... state="Succeeded" ready=false).Machineis stuckPending,InfrastructureReady=False (NodeProvisioning),NodeHealthy=Unknown("Waiting for AzureMachinePoolMachine to report spec.providerID"), nonodeRef.BootstrapSucceeded=True/BootstrapConfigReady=True (DataSecretProvided)— bootstrap data is fine.node-describe.txtis collected for the BYO nodes (they never registered with the AKS API server).So the VM boots in Azure (provisioningState
Succeeded), but the kubelet never joins — isolated to self-managed node bring-up against the AKS control plane at v1.35.5. The AKS control plane and managed node pools are healthy in every run.Root cause analysis (image exonerated)
The BYO
AzureMachinePoolusesimage: nil, so CAPZ resolves the default community-gallery reference imagecapi-ubun2-2404at the exact k8s version (here1.35.5). That image is built/published by image-builder.The image is not the problem:
KUBERNETES_VERSION/CAPZ_GALLERY_VERSIONdefault tostable-1.35, which currently resolves to v1.35.5. So the main self-managed e2e (pull-cluster-api-provider-azure-e2e) uses the samecapi-ubun2-2404v1.35.5 image.-e2e-aksfails) — i.e., the 1.35.5 image boots and its kubelet joins a kubeadm control plane successfully.images/capiimage-builder commits between the 1.35.4 build (HEADaa0c7ae7f, ~Apr 15) and the 1.35.5 build (HEAD ~567d5b08b, ~May 12): no change that would break Ubuntu-node join. The usual suspect5e16c2e83("move kubelet--system-reservedto kubelet-config") is excluded twice over — it merged June 3 (after the May 12 image build) and is gated behindkubernetes_enable_automatic_resource_sizing, which isfalsefor Azure.Conclusion: the failure is specific to the BYO join path — a self-managed CAPZ node joining an AKS-managed control plane at 1.35.5 — and surfaced when AKS rolled its control planes from 1.35.4 to 1.35.5 (~June 9). Likely areas: the kubeadm/TLS bootstrap-token join against the AKS API server, kubelet client-cert/CA handling, or a version-skew/behavioral change between self-managed 1.35.5 nodes and the AKS 1.35.5 control plane.
Scope (not PR-specific)
Same failure, same test, same timeout on unrelated PRs since ~Jun 9: #6325, #6347, #6348, #6350, #6353. Job was green Jun 5–8.
Example runs
Suggested next steps
kubeadm joinlogs from a BYO VMSS instance (not auto-collected, since the node never joins) to see why the kubelet doesn't register against the AKS API server.