Skip to content

Conversation

@damdo
Copy link
Member

@damdo damdo commented Dec 18, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fixes continuous reconciliation of AzureMachinePool and MachinePool resources that causes the E2E clusterctl upgrade test to fail with: resourceVersions didn't stay stable

The ProviderIDList field in AzureMachinePool.Spec was being populated from results of client.List() calls, which do not guarantee a deterministic order. When the same set of machines is returned in a different order across reconciliations, the ProviderIDList slice changes, causing a spec update and bumping the resourceVersion—even though the underlying set of provider IDs is identical.

Why did this start failing now ?
This issue only started showing up after the CAPI v1.11 bump.
This is because CAPI v1.11 introduced a new validation step in the clusterctl upgrade E2E test framework via kubernetes-sigs/cluster-api#12546 that checks resourceVersions remain stable after upgrade completion:

"If this is the last step of the upgrade sequence check that the resourceVersions are stable, i.e. it verifies there are no continuous reconciles when everything should be stable."

The underlying bug has likely always existed, but was never detected before because we didn't have this stability check in the upgrade test.

How does this PR solve it?
Sorts providerIDs slices before assigning them to ProviderIDList fields, ensuring deterministic ordering regardless of the order returned by client.List() or Azure API calls.

This PR stacks on top of #6010 so:

/hold

until #6010 is merged

Release note:

fix: (azure)machinepools: stop continuous reconciliation of resources by sorting ProviderIDs

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Dec 18, 2025
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 18, 2025
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Dec 18, 2025
@damdo
Copy link
Member Author

damdo commented Dec 18, 2025

Depends on #6010

@damdo damdo mentioned this pull request Dec 18, 2025
4 tasks
@codecov
Copy link

codecov bot commented Dec 18, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.41%. Comparing base (01f24d8) to head (39b272d).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #6011   +/-   ##
=======================================
  Coverage   44.41%   44.41%           
=======================================
  Files         280      280           
  Lines       25377    25379    +2     
=======================================
+ Hits        11271    11273    +2     
  Misses      13293    13293           
  Partials      813      813           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@damdo
Copy link
Member Author

damdo commented Dec 18, 2025

@nojnhuh
Copy link
Contributor

nojnhuh commented Dec 18, 2025

I'd like to see this test pass a few times in a row to make sure we're confident this is fixing what we think it is.

/test pull-cluster-api-provider-azure-apiversion-upgrade

@jackfrancis
Copy link
Contributor

@damdo anything interesting from the last test failure

(thank you so much for looking into this! 🍻)

@mboersma
Copy link
Contributor

/retest

@mboersma
Copy link
Contributor

/test pull-cluster-api-provider-azure-apiversion-upgrade

1 passed.

@damdo
Copy link
Member Author

damdo commented Jan 5, 2026

/test pull-cluster-api-provider-azure-apiversion-upgrade

1 similar comment
@damdo
Copy link
Member Author

damdo commented Jan 8, 2026

/test pull-cluster-api-provider-azure-apiversion-upgrade

@nojnhuh
Copy link
Contributor

nojnhuh commented Jan 9, 2026

It looks like this test should be working better now.

/test pull-cluster-api-provider-azure-apiversion-upgrade

@nojnhuh
Copy link
Contributor

nojnhuh commented Jan 9, 2026

/test pull-cluster-api-provider-azure-apiversion-upgrade

@damdo
Copy link
Member Author

damdo commented Jan 9, 2026

Timedout for unrelated reasons

/test pull-cluster-api-provider-azure-apiversion-upgrade

1 similar comment
@damdo
Copy link
Member Author

damdo commented Jan 9, 2026

Timedout for unrelated reasons

/test pull-cluster-api-provider-azure-apiversion-upgrade

@nojnhuh
Copy link
Contributor

nojnhuh commented Jan 9, 2026

I think it's safe to say this isn't the cause of the upgrade test flakes since the AzureMachinePools created in those tests only have one replica (so one provider ID).

The more likely cause is from CAPZ updating the cloud-init data for the VMSS when CAPI periodically updates the KubeadmConfig's bootstrap data and storing the hash in an annotation on the AzureMachinePool. If that update every 10 minutes happens within the 2-minute window that the framework expects resourceVersions to remain stable, then the test fails.

I think this change is still worth it, but we shouldn't expect it to fix the upgrade test.

@jackfrancis
Copy link
Contributor

I think it's safe to say this isn't the cause of the upgrade test flakes since the AzureMachinePools created in those tests only have one replica (so one provider ID).

The more likely cause is from CAPZ updating the cloud-init data for the VMSS when CAPI periodically updates the KubeadmConfig's bootstrap data and storing the hash in an annotation on the AzureMachinePool. If that update every 10 minutes happens within the 2-minute window that the framework expects resourceVersions to remain stable, then the test fails.

I think this change is still worth it, but we shouldn't expect it to fix the upgrade test.

For the purposes of this test should we manually update the bootstrap data refresh time to be greater than the P50 total time duration of the test run (plus some extra overhead) so that we can get the test outcome that this test is actually looking for?

@nojnhuh
Copy link
Contributor

nojnhuh commented Jan 9, 2026

For the purposes of this test should we manually update the bootstrap data refresh time to be greater than the P50 total time duration of the test run (plus some extra overhead) so that we can get the test outcome that this test is actually looking for?

SGTM

@nojnhuh
Copy link
Contributor

nojnhuh commented Jan 10, 2026

For the purposes of this test should we manually update the bootstrap data refresh time to be greater than the P50 total time duration of the test run (plus some extra overhead) so that we can get the test outcome that this test is actually looking for?

SGTM

Testing this out in #6031

@damdo
Copy link
Member Author

damdo commented Jan 12, 2026

Thanks @nojnhuh @jackfrancis

Would it make sense to base #6031 on top of this PR to make sure we cover both issues?
So that if we do that and see 6031 consistently go green we can merge both, TY.

@mboersma mboersma mentioned this pull request Jan 13, 2026
4 tasks
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 22, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jackfrancis. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 23, 2026
@damdo damdo force-pushed the fix-machinepools-continuous-reconciliation-w-providerid-sort branch 2 times, most recently from 0d33307 to 513dd29 Compare January 23, 2026 19:41
@damdo
Copy link
Member Author

damdo commented Jan 25, 2026

@mboersma @nojnhuh @jackfrancis even though the main cause of the upgrade issue has been identified and fixed, I think this could still be a good QoL improvement.
Are we up for merging it? TY

@nojnhuh
Copy link
Contributor

nojnhuh commented Jan 25, 2026

Change looks good overall. Can we update the unit tests?

@damdo damdo force-pushed the fix-machinepools-continuous-reconciliation-w-providerid-sort branch from 513dd29 to 659025e Compare January 26, 2026 19:10
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 26, 2026
@damdo
Copy link
Member Author

damdo commented Jan 26, 2026

@nojnhuh Done thanks!

@damdo damdo force-pushed the fix-machinepools-continuous-reconciliation-w-providerid-sort branch from 659025e to 39b272d Compare January 27, 2026 20:49
@damdo
Copy link
Member Author

damdo commented Jan 28, 2026

/test pull-cluster-api-provider-azure-apiversion-upgrade

@k8s-ci-robot
Copy link
Contributor

@damdo: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-azure-apiversion-upgrade 39b272d link true /test pull-cluster-api-provider-azure-apiversion-upgrade

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

7 participants