Skip to content

ASO AgentPool status lag causes provider ID count mismatch in AzureASOManagedMachinePool during scale operations #5985

@andreipantelimon

Description

@andreipantelimon

/kind bug

What steps did you take and what happened:
When autoscaler is scaling up an AKS Agent Pool managed by ASO, there is a timing mismatch between the statuses reported by AzureASOManagedMachinePool to CAPI MachinePool causing the CAPI object to go into a NotReady state.

  • T+0 minutes: CAPZ ASO-Managed cluster is stable, one nodepool is reconciled and ready with x nodes, no scale-up events
  • T+2 minutes: AKS autoscaler triggers a scale-up event (same behaviour on scale-down as well)
  • T+4 minutes: Azure creates 2 new VM's, and they get provider IDs
  • T+4 minutes: AzureASOManagedMachinePool controller reconciles normal, queries the existing nodes via the controller runtime client and sees x+2 nodes. It then updates it's spec.providerIDList with the nodes.
  • T+4 minutes: AzureASOManagedMachinePool controller checks the ASO AgentPool status.count replicas and updates it's status.replicas and the CAPI MachinePool spec.replicas.
  • T+4 minutes: CAPI MachinePool now has:
    • spec.replicas: x (modified by the CAPZ controller, inherited from the ASO AgentPool status)
    • status.readyReplicas: x+2 (count of the actual Nodes, from the infrastructureRef AzureASOManagedMachinePool)
  • Due to this mismatch, CAPI puts the MachinePool into Scaling and NotReady status
  • T+60 minutes: If no other scale events happen, ASO reconciles the new count and the MachinePool eventually goes into a Ready state

What did you expect to happen:
Due to the nature of the controllers, some delay is expected when scale events happen, but the default 1h is a bit much. When dealing with frequent scale events, this NotReady state can go on for a long period of time.
However, we expect the statuses to eventually sync in a couple of minutes.

Environment:

  • cluster-api-provider-azure version: v1.20.2
  • Kubernetes version: (use kubectl version): v1.30.7
  • OS (e.g. from /etc/os-release):
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions