Skip to content

e2e-aks BYO node-pool test failing since ~2026-06-09 #6354

Description

@mboersma

Summary

The pull-cluster-api-provider-azure-e2e-aks presubmit has been failing on every PR since ~2026-06-09. The failing subtest is the AKS BYO (self-managed VMSS) node-pool test: the BYO VMs provision in Azure but their kubelet never registers as a node. The trigger is the auto-selected Kubernetes version rolling from v1.35.4 → v1.35.5 (AKS began offering 1.35.5 control planes around June 9).

This is not caused by any particular PR (it reproduces on unrelated PRs).

Update / correction: An earlier version of this issue hypothesized an image-builder regression in the self-managed node reference image between 1.35.4 and 1.35.5. That has been ruled out — see "Root cause analysis" below. The v1.35.5 node image is fine; the problem is the self-managed-node → AKS-managed-control-plane join path at 1.35.5.

Failing test

Workload cluster creation > Creating an AKS cluster for node pool tests
  [Managed Kubernetes] with a single control plane node and 1 node

Fails at test/e2e/aks_byo_node.go:183 (the Eventually asserting len(nodes) == ExpectedWorkerNodes + 2, line 175), timing out after 1500s:

Expected
    <int32>: 3
to equal
    <int32>: 5

The 3 AKS managed-pool nodes register; the 2 self-managed BYO VMSS nodes never do.

When it broke + version correlation

Last passing (Jun 8) Failing (Jun 9–10)
BYO MachinePool spec.version v1.35.4 v1.35.5
BYO nodes registered? byo-pool000000/1 node-describe.txt present ❌ no BYO node-describe.txt
AKS managed-pool nodes ✅ present ✅ present

Evidence the VMs come up but never join

From the management-cluster CAPZ controller logs and resource dumps on a failing run:

  • The VMSS instances provision successfully: Scale Set VM is Succeeded for byo-pool/virtualMachines/0 and /1.
  • But the AzureMachinePoolMachines stay ready=false (continuously Requeuing ... state="Succeeded" ready=false).
  • The owning Machine is stuck Pending, InfrastructureReady=False (NodeProvisioning), NodeHealthy=Unknown ("Waiting for AzureMachinePoolMachine to report spec.providerID"), no nodeRef.
  • BootstrapSucceeded=True / BootstrapConfigReady=True (DataSecretProvided) — bootstrap data is fine.
  • No node-describe.txt is collected for the BYO nodes (they never registered with the AKS API server).

So the VM boots in Azure (provisioningState Succeeded), but the kubelet never joins — isolated to self-managed node bring-up against the AKS control plane at v1.35.5. The AKS control plane and managed node pools are healthy in every run.

Root cause analysis (image exonerated)

The BYO AzureMachinePool uses image: nil, so CAPZ resolves the default community-gallery reference image capi-ubun2-2404 at the exact k8s version (here 1.35.5). That image is built/published by image-builder.

The image is not the problem:

  • KUBERNETES_VERSION / CAPZ_GALLERY_VERSION default to stable-1.35, which currently resolves to v1.35.5. So the main self-managed e2e (pull-cluster-api-provider-azure-e2e) uses the same capi-ubun2-2404 v1.35.5 image.
  • That job passes (e.g. on the same PRs where -e2e-aks fails) — i.e., the 1.35.5 image boots and its kubelet joins a kubeadm control plane successfully.
  • The v1.35.5 image has been published in the community gallery since 2026-05-12; if it were broken for joining, self-managed e2e would have been failing for ~4 weeks. It hasn't been.
  • Reviewed all images/capi image-builder commits between the 1.35.4 build (HEAD aa0c7ae7f, ~Apr 15) and the 1.35.5 build (HEAD ~567d5b08b, ~May 12): no change that would break Ubuntu-node join. The usual suspect 5e16c2e83 ("move kubelet --system-reserved to kubelet-config") is excluded twice over — it merged June 3 (after the May 12 image build) and is gated behind kubernetes_enable_automatic_resource_sizing, which is false for Azure.

Conclusion: the failure is specific to the BYO join path — a self-managed CAPZ node joining an AKS-managed control plane at 1.35.5 — and surfaced when AKS rolled its control planes from 1.35.4 to 1.35.5 (~June 9). Likely areas: the kubeadm/TLS bootstrap-token join against the AKS API server, kubelet client-cert/CA handling, or a version-skew/behavioral change between self-managed 1.35.5 nodes and the AKS 1.35.5 control plane.

Scope (not PR-specific)

Same failure, same test, same timeout on unrelated PRs since ~Jun 9: #6325, #6347, #6348, #6350, #6353. Job was green Jun 5–8.

Example runs

Suggested next steps

  • Debug the BYO-to-AKS join at 1.35.5: pull the kubelet/cloud-init/kubeadm join logs from a BYO VMSS instance (not auto-collected, since the node never joins) to see why the kubelet doesn't register against the AKS API server.
  • Check for bootstrap-token / TLS-bootstrap / client-cert behavior changes between AKS 1.35.4 and 1.35.5 control planes.
  • As a stopgap to unblock the job, pin the e2e/BYO Kubernetes version to v1.35.4 (known-good) until the 1.35.5 BYO-join issue is understood.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions