-
Notifications
You must be signed in to change notification settings - Fork 462
fix: (azure)machinepools: stop continuous reconciliation of resources by sorting ProviderIDs #6011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Depends on #6010 |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #6011 +/- ##
=======================================
Coverage 44.41% 44.41%
=======================================
Files 280 280
Lines 25377 25379 +2
=======================================
+ Hits 11271 11273 +2
Misses 13293 13293
Partials 813 813 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I'd like to see this test pass a few times in a row to make sure we're confident this is fixing what we think it is. /test pull-cluster-api-provider-azure-apiversion-upgrade |
|
@damdo anything interesting from the last test failure (thank you so much for looking into this! 🍻) |
|
/retest |
|
/test pull-cluster-api-provider-azure-apiversion-upgrade 1 passed. |
|
/test pull-cluster-api-provider-azure-apiversion-upgrade |
1 similar comment
|
/test pull-cluster-api-provider-azure-apiversion-upgrade |
|
It looks like this test should be working better now. /test pull-cluster-api-provider-azure-apiversion-upgrade |
|
/test pull-cluster-api-provider-azure-apiversion-upgrade |
|
Timedout for unrelated reasons /test pull-cluster-api-provider-azure-apiversion-upgrade |
1 similar comment
|
Timedout for unrelated reasons /test pull-cluster-api-provider-azure-apiversion-upgrade |
|
I think it's safe to say this isn't the cause of the upgrade test flakes since the AzureMachinePools created in those tests only have one replica (so one provider ID). The more likely cause is from CAPZ updating the cloud-init data for the VMSS when CAPI periodically updates the KubeadmConfig's bootstrap data and storing the hash in an annotation on the AzureMachinePool. If that update every 10 minutes happens within the 2-minute window that the framework expects resourceVersions to remain stable, then the test fails. I think this change is still worth it, but we shouldn't expect it to fix the upgrade test. |
For the purposes of this test should we manually update the bootstrap data refresh time to be greater than the P50 total time duration of the test run (plus some extra overhead) so that we can get the test outcome that this test is actually looking for? |
SGTM |
Testing this out in #6031 |
|
Thanks @nojnhuh @jackfrancis Would it make sense to base #6031 on top of this PR to make sure we cover both issues? |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
0d33307 to
513dd29
Compare
|
@mboersma @nojnhuh @jackfrancis even though the main cause of the upgrade issue has been identified and fixed, I think this could still be a good QoL improvement. |
|
Change looks good overall. Can we update the unit tests? |
513dd29 to
659025e
Compare
|
@nojnhuh Done thanks! |
… by sorting ProviderIDs
659025e to
39b272d
Compare
|
/test pull-cluster-api-provider-azure-apiversion-upgrade |
|
@damdo: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Fixes continuous reconciliation of
AzureMachinePoolandMachinePoolresources that causes the E2Eclusterctl upgradetest to fail with:resourceVersions didn't stay stableThe
ProviderIDListfield inAzureMachinePool.Specwas being populated from results ofclient.List()calls, which do not guarantee a deterministic order. When the same set of machines is returned in a different order across reconciliations, theProviderIDListslice changes, causing a spec update and bumping theresourceVersion—even though the underlying set of provider IDs is identical.Why did this start failing now ?
This issue only started showing up after the CAPI v1.11 bump.
This is because CAPI v1.11 introduced a new validation step in the
clusterctl upgradeE2E test framework via kubernetes-sigs/cluster-api#12546 that checks resourceVersions remain stable after upgrade completion:The underlying bug has likely always existed, but was never detected before because we didn't have this stability check in the upgrade test.
How does this PR solve it?
Sorts
providerIDsslices before assigning them toProviderIDListfields, ensuring deterministic ordering regardless of the order returned byclient.List()or Azure API calls.This PR stacks on top of #6010 so:
/hold
until #6010 is merged
Release note: