fix: (azure)machinepools: stop continuous reconciliation of resources by sorting ProviderIDs #6011

damdo · 2025-12-18T17:47:13Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fixes continuous reconciliation of AzureMachinePool and MachinePool resources that causes the E2E clusterctl upgrade test to fail with: resourceVersions didn't stay stable

The ProviderIDList field in AzureMachinePool.Spec was being populated from results of client.List() calls, which do not guarantee a deterministic order. When the same set of machines is returned in a different order across reconciliations, the ProviderIDList slice changes, causing a spec update and bumping the resourceVersion—even though the underlying set of provider IDs is identical.

Why did this start failing now ?
This issue only started showing up after the CAPI v1.11 bump.
This is because CAPI v1.11 introduced a new validation step in the clusterctl upgrade E2E test framework via kubernetes-sigs/cluster-api#12546 that checks resourceVersions remain stable after upgrade completion:

"If this is the last step of the upgrade sequence check that the resourceVersions are stable, i.e. it verifies there are no continuous reconciles when everything should be stable."

The underlying bug has likely always existed, but was never detected before because we didn't have this stability check in the upgrade test.

How does this PR solve it?
Sorts providerIDs slices before assigning them to ProviderIDList fields, ensuring deterministic ordering regardless of the order returned by client.List() or Azure API calls.

This PR stacks on top of #6010 so:

/hold

until #6010 is merged

Release note:

fix: (azure)machinepools: stop continuous reconciliation of resources by sorting ProviderIDs

damdo · 2025-12-18T17:47:29Z

Depends on #6010

codecov · 2025-12-18T17:57:31Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.41%. Comparing base (01f24d8) to head (39b272d).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #6011   +/-   ##
=======================================
  Coverage   44.41%   44.41%           
=======================================
  Files         280      280           
  Lines       25377    25379    +2     
=======================================
+ Hits        11271    11273    +2     
  Misses      13293    13293           
  Partials      813      813

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

damdo · 2025-12-18T18:00:06Z

/assign @mboersma @nojnhuh @jackfrancis @willie-yao @bryan-cox

nojnhuh · 2025-12-18T22:18:07Z

I'd like to see this test pass a few times in a row to make sure we're confident this is fixing what we think it is.

/test pull-cluster-api-provider-azure-apiversion-upgrade

jackfrancis · 2025-12-18T23:35:05Z

@damdo anything interesting from the last test failure

(thank you so much for looking into this! 🍻)

mboersma · 2025-12-22T18:37:10Z

/retest

mboersma · 2025-12-22T20:09:42Z

/test pull-cluster-api-provider-azure-apiversion-upgrade

1 passed.

damdo · 2026-01-05T11:25:37Z

/test pull-cluster-api-provider-azure-apiversion-upgrade

damdo · 2026-01-08T17:41:34Z

/test pull-cluster-api-provider-azure-apiversion-upgrade

nojnhuh · 2026-01-09T03:14:43Z

It looks like this test should be working better now.

/test pull-cluster-api-provider-azure-apiversion-upgrade

nojnhuh · 2026-01-09T05:39:32Z

/test pull-cluster-api-provider-azure-apiversion-upgrade

damdo · 2026-01-09T10:02:25Z

Timedout for unrelated reasons

/test pull-cluster-api-provider-azure-apiversion-upgrade

damdo · 2026-01-09T15:42:00Z

Timedout for unrelated reasons

/test pull-cluster-api-provider-azure-apiversion-upgrade

nojnhuh · 2026-01-09T19:55:20Z

I think it's safe to say this isn't the cause of the upgrade test flakes since the AzureMachinePools created in those tests only have one replica (so one provider ID).

The more likely cause is from CAPZ updating the cloud-init data for the VMSS when CAPI periodically updates the KubeadmConfig's bootstrap data and storing the hash in an annotation on the AzureMachinePool. If that update every 10 minutes happens within the 2-minute window that the framework expects resourceVersions to remain stable, then the test fails.

I think this change is still worth it, but we shouldn't expect it to fix the upgrade test.

jackfrancis · 2026-01-09T20:18:32Z

I think it's safe to say this isn't the cause of the upgrade test flakes since the AzureMachinePools created in those tests only have one replica (so one provider ID).

The more likely cause is from CAPZ updating the cloud-init data for the VMSS when CAPI periodically updates the KubeadmConfig's bootstrap data and storing the hash in an annotation on the AzureMachinePool. If that update every 10 minutes happens within the 2-minute window that the framework expects resourceVersions to remain stable, then the test fails.

I think this change is still worth it, but we shouldn't expect it to fix the upgrade test.

For the purposes of this test should we manually update the bootstrap data refresh time to be greater than the P50 total time duration of the test run (plus some extra overhead) so that we can get the test outcome that this test is actually looking for?

nojnhuh · 2026-01-09T20:41:06Z

For the purposes of this test should we manually update the bootstrap data refresh time to be greater than the P50 total time duration of the test run (plus some extra overhead) so that we can get the test outcome that this test is actually looking for?

SGTM

nojnhuh · 2026-01-10T05:49:59Z

For the purposes of this test should we manually update the bootstrap data refresh time to be greater than the P50 total time duration of the test run (plus some extra overhead) so that we can get the test outcome that this test is actually looking for?

SGTM

Testing this out in #6031

damdo · 2026-01-12T15:24:13Z

Thanks @nojnhuh @jackfrancis

Would it make sense to base #6031 on top of this PR to make sure we cover both issues?
So that if we do that and see 6031 consistently go green we can merge both, TY.

k8s-ci-robot · 2026-01-23T17:33:06Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jackfrancis. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

damdo · 2026-01-25T09:32:27Z

@mboersma @nojnhuh @jackfrancis even though the main cause of the upgrade issue has been identified and fixed, I think this could still be a good QoL improvement.
Are we up for merging it? TY

nojnhuh · 2026-01-25T17:38:39Z

Change looks good overall. Can we update the unit tests?

damdo · 2026-01-26T19:11:05Z

@nojnhuh Done thanks!

… by sorting ProviderIDs

damdo · 2026-01-28T00:15:40Z

/test pull-cluster-api-provider-azure-apiversion-upgrade

k8s-ci-robot · 2026-01-28T01:28:50Z

@damdo: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-provider-azure-apiversion-upgrade	`39b272d`	link	true	`/test pull-cluster-api-provider-azure-apiversion-upgrade`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

github-project-automation bot added this to CAPZ Planning Dec 18, 2025

github-project-automation bot moved this to Todo in CAPZ Planning Dec 18, 2025

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Dec 18, 2025

k8s-ci-robot requested a review from jsturtevant December 18, 2025 17:47

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 18, 2025

k8s-ci-robot requested a review from willie-yao December 18, 2025 17:47

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Dec 18, 2025

damdo mentioned this pull request Dec 18, 2025

Update Kubernetes versions to 1.33 #5994

Merged

4 tasks

k8s-ci-robot assigned bryan-cox, jackfrancis, mboersma, nojnhuh and willie-yao Dec 18, 2025

mboersma mentioned this pull request Jan 13, 2026

[WIP] Fix upgrade test #6041

Closed

4 tasks

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 22, 2026

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 23, 2026

damdo force-pushed the fix-machinepools-continuous-reconciliation-w-providerid-sort branch 2 times, most recently from 0d33307 to 513dd29 Compare January 23, 2026 19:41

damdo force-pushed the fix-machinepools-continuous-reconciliation-w-providerid-sort branch from 513dd29 to 659025e Compare January 26, 2026 19:10

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 26, 2026

fix: (azure)machinepools: stop continuous reconciliation of resources…

39b272d

… by sorting ProviderIDs

damdo force-pushed the fix-machinepools-continuous-reconciliation-w-providerid-sort branch from 659025e to 39b272d Compare January 27, 2026 20:49

fix: (azure)machinepools: stop continuous reconciliation of resources by sorting ProviderIDs #6011

Are you sure you want to change the base?

fix: (azure)machinepools: stop continuous reconciliation of resources by sorting ProviderIDs #6011

Conversation

damdo commented Dec 18, 2025

Uh oh!

damdo commented Dec 18, 2025

Uh oh!

codecov bot commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

damdo commented Dec 18, 2025

Uh oh!

nojnhuh commented Dec 18, 2025

Uh oh!

jackfrancis commented Dec 18, 2025

Uh oh!

mboersma commented Dec 22, 2025

Uh oh!

mboersma commented Dec 22, 2025

Uh oh!

damdo commented Jan 5, 2026

Uh oh!

damdo commented Jan 8, 2026

Uh oh!

nojnhuh commented Jan 9, 2026

Uh oh!

nojnhuh commented Jan 9, 2026

Uh oh!

damdo commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

damdo commented Jan 9, 2026

Uh oh!

nojnhuh commented Jan 9, 2026

Uh oh!

jackfrancis commented Jan 9, 2026

Uh oh!

nojnhuh commented Jan 9, 2026

Uh oh!

nojnhuh commented Jan 10, 2026

Uh oh!

damdo commented Jan 12, 2026

Uh oh!

k8s-ci-robot commented Jan 23, 2026

Uh oh!

damdo commented Jan 25, 2026

Uh oh!

nojnhuh commented Jan 25, 2026

Uh oh!

damdo commented Jan 26, 2026

Uh oh!

damdo commented Jan 28, 2026

Uh oh!

k8s-ci-robot commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

codecov bot commented Dec 18, 2025 •

edited

Loading

damdo commented Jan 9, 2026 •

edited

Loading