Update our ADR on Azure differencing with latest discussions #4962

theunrepentantgeek · 2025-09-29T01:14:49Z

What this PR does

Updates the existing ADR with information from several recent discussions and makes a recommendation on how to proceed.

Recommend using the markdown view to review.

How does this PR make you feel?

Checklist

this PR contains documentation

docs/hugo/content/design/ADR-2022-11-Change-Detection/_index.md

matthchr · 2025-10-06T23:28:23Z

docs/hugo/content/design/ADR-2022-11-Change-Detection.md

+
+**Potential mitigations**:
+
+* Generally speaking, Azure identifiers are expected to be "case-preserving, case-insensitive", so we should compare those string fields in a case-insensitive manner.


Only required for IDs (names, etc). Enum values and/or regions as we call out (also VM SKUs, which are somewhat enum-y) are all supposed to not be normalized but sometimes are.

tl;dr: I think we'll be playing whack-a-mole with fields if our approach doesn't handle them in a holistic fashion.

matthchr · 2025-10-06T23:30:32Z

docs/hugo/content/design/ADR-2022-11-Change-Detection.md


-Some resources have arrays of items. These arrays may not be guaranteed to be returned in the same order as they were specified. 
+* In the common case where the items in an array have a known identifier (e.g., when they are sub-resources with an `id` field), use that identifier to match them up. Where the items in an array do not have a known identifier, compare them by index.
+* Add specific configuration for array properties to specify how they should be compared — whether by an id field or by index. Consistent with our similar configuration used elsewhere, we'd require every array property to be annotated to ensure we don't do the wrong thing by default.


FWIW Swagger actually has this field. It's called x-ms-identifiers and it can be used to call out which field to use as the key.

I think it has pretty wide adoption because there's a linter that checks for it. With that said, you can say [] in which case we'd have to assume "by index"

matthchr · 2025-10-06T23:37:15Z

docs/hugo/content/design/ADR-2022-11-Change-Detection.md

-TBC
+Recommendation: Implement differencing as follows:
+
+* Update our code generator to add a `DiffWith()` method to every spec type that accumulates a list of the differences found


Can we expand on the actual algorithm here some more? We cover above for example the fact that we'll cover the status type back to spec and discard the readonly fields, but that's not really covered here.

Expanded with considerably more detail.

matthchr · 2025-10-06T23:54:17Z

docs/hugo/content/design/ADR-2022-11-Change-Detection.md

+
+* Update our code generator to add a `DiffWith()` method to every spec type that accumulates a list of the differences found
+* Add helper methods to a package `genruntime/diff` to help with common comparison tasks
+* Introduce `operatorStatus` on each top level status type, with a `knownDifferences map[string]string` field to capture the differences found.


I'm not sure that I follow the need for knownDifferences (or maybe I am misunderstanding what it is doing?)

My thinking was the algo was basically:

PUT, get result of PUT and store in status (we already do this today so no change from today). Store the last observed generation in condition (We also already do this today). OR write a hash of spec to an annotation (I don't think this is needed I think generation is enough but maybe I'm missing some case where it is not).

Next time around reconcile loop, there are two cases: case A: generation has changed from observedGeneration of condition, and case B: generation has not changed.

Case A: generation has changed from observedGeneration of condition. Means user has changed the spec so we issue a PUT (effectively same codepath as today).

Case B: generation has not changed from observedGeneration. Convert the current status to the spec shape (so discarding known readonly fields, etc) and hold that in-memory. This is expected . Issue a GET and update status. Now convert the newly updated status to the spec shape (discarding known readonly fields, etc). This is actual. Diff expected and actual. If there are any differences, log them and issue a PUT, otherwise no-op.

I think this is basically what Terraform does and what I outlined in this comment, although I think I've explained it better here than I did there.

IMO this has a lot of advantages:

No extra storage required.

Only diffing Azure with itself, so we don't need to worry about normalization or policy because they should be being applied universally on all requests. We're effectively just diffing "last GET" with "current GET" rather than true "goal -> actual", but the result is the same if we use generation to get us the rest of the way there, which Kubernetes already computes for us.

There are some small issues like for example using generation doesn't account for the fact that there are fields in spec that don't contribute to the upstream Azure shape, such as the CEL expression stuff, so if users change that we'll re-PUT. This is where using a hash + excluding those fields from the hash for case A would maybe be a slight improvement, but we could tackle that later if we really wanted to.

When everything works properly, I think this approach will work - but I think it falls down if/when something goes wrong, as follows:

Scenario: Resource in Azure has recently drifted from desired state

spec₀ is our desired configuration
status₀ was observed before the resource drifted

Time passes.
Time to reconcile.

Work out what we expect the resource to look like
spec_expected := convertToSpec( status₀ )

status₁ := convertToStatus( GET₁ )
spec_actual := convertToSpec( status₁ )

Diff( spec_expected, spec_actual) indicates there are changes, so Azure needs updating.

But then updating Azure fails.

Maybe we never issue the PUT, having already saved the resource state

Maybe the PUT is rejected by ARM

Maybe the PUT succeeds, but the RP crashes

Regardless, the resource has NOT been updated

Time passes.
Time to reconcile.

Work out what we expect the resource to look like
spec_expected := convertToSpec( status₁ )

status₂ := convertToStatus( GET₂ )
spec_actual := convertToSpec( status₂ )

Diff( spec_expected, spec_actual) indicates there are NO changes because the resource didn't change between GET₁ and GET₂.

busted.

We can try to finesse the failure modes to make sure we only commit status changes after we've updated Azure - but that has two problems:

We don't update status until after we've finished a potentially long running operation (which users would notice)

We run the risk of missing a failure mode.

If we are always comparing with the original status of the resource - status₀ - then we can't be caught out by odd failure modes.

Introducing OperatorStatus.knownDifferences is an attempt to mitigate the large size of status₀ by only keeping the bits that matter.

We don't update status until after we've finished a potentially long running operation (which users would notice)

I don't think this is an issue. This is already the behavior we have today and it's also the behavior that Azure "wants" us to have. See handleCreateorUpdateSuccess for example where we only write the status after the create/update LRO completes.

What I mean by this is the behavior Azure "wants" us to have: it's the way Azure's LRO/APIs are built - you issue a PUT, you get a polling URL, you poll that URL to completion and at the end of that polling the current resource state is returned to you. Obviously that doesn't stop you from breaking out of that flow and issuing separate GETs while the resource is in a transitioning state, but if you look at what all of the SDKs do automatically it's always "update, wait, get status" - get status happens at the end..

The point you've raised about Azure interactions failing is a good one, we can't know if the changes we made landed or not in the case we get some connection error or internal server error, but I think as you say that is taken care of by not committing the status until the end. Though it's worth noting that it seems very likely that unless the service behavior is very weird, the update we're making to drive back to goal state is going to result in a status whose spec component is the same as before, so for the purposes of the diffing algorithm there may not be a difference between the "old status" and the "new status". There are probably some resources where this isn't true for but it definitely won't be the majority of them.

We run the risk of missing a failure mode.

What failure mode though? The above pattern seems safe to me? In fact I think it's actually "the same" as using known differences, isn't it? The two patterns are either:

Get Status after PUT and store someplace (in status)

Get Status after PUT, compute knownDifferences and store someplace.

In both cases, we rely on getting the status after the PUT, and rely on making an update via kube-apiserver to our object (either an annotation or status), so I am not clear on how there's any real difference between the two from a reliability perspective? The only case that I can see where there would be a difference would be if somebody went and updated the status in k8s to something that wasn't derived from Azure, which they could also do to knownDifferences (either in status or as an annotation).

Maybe we should have a meeting where we talk this over more? The more I look at the two proposals the more they actually look like variants of the same idea (which is good), but I'm still not really seeing what paying the storage (and documentation) cost of adding a field for KnownDifferences actually buys us - we already have the data that was derived from (the whole status), so I don't think there's any reason to actually use it?

spec₀ + knownDifferences₀ = status₀
or (moving spec to the other side of the equation)
knownDifferences₀ = status₀ - spec₀

We already have status₀ and spec₀ (and we have to have them), so we can compute knownDifferences anytime we like on the fly, can't we?

In the case where the spec hasn't changed the two are (I think) identical. Where they maybe differ is when the spec has changed, with knownDifferences you would add those differences to the new spec₁ and compare w/ the GET and see a difference (at least in theory, though there are maybe some edge-cases where the new spec just sets one of the KnownDifferences so there is no diff - it's not clear to me if we want to send to Azure in this case or not. On the one hand it feels safe to avoid a PUT, OTOH that assumes that there's no difference between a field being defaulted versus a field being set by the user, and I suspect there are cases where those two things are different and we would need to send a PUT to realize it), whereas if we don't keep knownDifferences we would just check generation of the status versus generation of the spec and blindly PUT regardless of diff.

TBH I see flaws with both of those approaches so it's not clear to me that one is objectively better than the other. I think PUT-ing whenever user changes generation is maybe a bit safer but obviously does eat a PUT request quota. Still, I'm not convinced the difference here is worth the cost of storing knownDifferences (especially because I think actually doing generation-based is safer)

theunrepentantgeek added 2 commits September 28, 2025 23:03

Revise with current thinking

ba08694

Revise draft

6de9a8b

theunrepentantgeek requested review from davefellows, jakjang, juan-lee and matthchr as code owners September 29, 2025 01:14

github-project-automation bot added this to Azure Service Operator Roadmap Sep 29, 2025

github-project-automation bot moved this to In Progress in Azure Service Operator Roadmap Sep 29, 2025

theunrepentantgeek self-assigned this Sep 29, 2025

matthchr reviewed Oct 6, 2025

View reviewed changes

Update ADR with PR feedback

269e68e

matthchr mentioned this pull request Nov 17, 2025

Ignore volatile properties when deciding whether to reconcile a resource #2026

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update our ADR on Azure differencing with latest discussions #4962

Update our ADR on Azure differencing with latest discussions #4962

Uh oh!

theunrepentantgeek commented Sep 29, 2025

Uh oh!

Uh oh!

matthchr Oct 6, 2025

Uh oh!

matthchr Oct 6, 2025

Uh oh!

matthchr Oct 6, 2025

Uh oh!

theunrepentantgeek Oct 15, 2025

Uh oh!

matthchr Oct 6, 2025

Uh oh!

theunrepentantgeek Oct 16, 2025 •

edited

Loading

Uh oh!

matthchr Nov 12, 2025

Uh oh!

matthchr Nov 12, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		Potential mitigations:

		* Generally speaking, Azure identifiers are expected to be "case-preserving, case-insensitive", so we should compare those string fields in a case-insensitive manner.

Update our ADR on Azure differencing with latest discussions #4962

Are you sure you want to change the base?

Update our ADR on Azure differencing with latest discussions #4962

Uh oh!

Conversation

theunrepentantgeek commented Sep 29, 2025

What this PR does

How does this PR make you feel?

Checklist

Uh oh!

Uh oh!

matthchr Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

matthchr Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

matthchr Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

theunrepentantgeek Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

matthchr Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

theunrepentantgeek Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthchr Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

matthchr Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

theunrepentantgeek Oct 16, 2025 •

edited

Loading

matthchr Nov 12, 2025 •

edited

Loading