-
Notifications
You must be signed in to change notification settings - Fork 228
Update our ADR on Azure differencing with latest discussions #4962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
||
| **Potential mitigations**: | ||
|
|
||
| * Generally speaking, Azure identifiers are expected to be "case-preserving, case-insensitive", so we should compare those string fields in a case-insensitive manner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only required for IDs (names, etc). Enum values and/or regions as we call out (also VM SKUs, which are somewhat enum-y) are all supposed to not be normalized but sometimes are.
tl;dr: I think we'll be playing whack-a-mole with fields if our approach doesn't handle them in a holistic fashion.
|
|
||
| Some resources have arrays of items. These arrays may not be guaranteed to be returned in the same order as they were specified. | ||
| * In the common case where the items in an array have a known identifier (e.g., when they are sub-resources with an `id` field), use that identifier to match them up. Where the items in an array do not have a known identifier, compare them by index. | ||
| * Add specific configuration for array properties to specify how they should be compared — whether by an id field or by index. Consistent with our similar configuration used elsewhere, we'd require every array property to be annotated to ensure we don't do the wrong thing by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW Swagger actually has this field. It's called x-ms-identifiers and it can be used to call out which field to use as the key.
I think it has pretty wide adoption because there's a linter that checks for it. With that said, you can say [] in which case we'd have to assume "by index"
| TBC | ||
| Recommendation: Implement differencing as follows: | ||
|
|
||
| * Update our code generator to add a `DiffWith()` method to every spec type that accumulates a list of the differences found |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we expand on the actual algorithm here some more? We cover above for example the fact that we'll cover the status type back to spec and discard the readonly fields, but that's not really covered here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expanded with considerably more detail.
|
|
||
| * Update our code generator to add a `DiffWith()` method to every spec type that accumulates a list of the differences found | ||
| * Add helper methods to a package `genruntime/diff` to help with common comparison tasks | ||
| * Introduce `operatorStatus` on each top level status type, with a `knownDifferences map[string]string` field to capture the differences found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that I follow the need for knownDifferences (or maybe I am misunderstanding what it is doing?)
My thinking was the algo was basically:
- PUT, get result of PUT and store in status (we already do this today so no change from today). Store the last observed generation in condition (We also already do this today). OR write a hash of spec to an annotation (I don't think this is needed I think generation is enough but maybe I'm missing some case where it is not).
- Next time around reconcile loop, there are two cases: case A: generation has changed from observedGeneration of condition, and case B: generation has not changed.
- Case A: generation has changed from observedGeneration of condition. Means user has changed the spec so we issue a PUT (effectively same codepath as today).
- Case B: generation has not changed from observedGeneration. Convert the current status to the spec shape (so discarding known readonly fields, etc) and hold that in-memory. This is
expected. Issue a GET and update status. Now convert the newly updated status to the spec shape (discarding known readonly fields, etc). This isactual. Diffexpectedandactual. If there are any differences, log them and issue aPUT, otherwise no-op.
I think this is basically what Terraform does and what I outlined in this comment, although I think I've explained it better here than I did there.
IMO this has a lot of advantages:
- No extra storage required.
- Only diffing Azure with itself, so we don't need to worry about normalization or policy because they should be being applied universally on all requests. We're effectively just diffing "last GET" with "current GET" rather than true "goal -> actual", but the result is the same if we use generation to get us the rest of the way there, which Kubernetes already computes for us.
There are some small issues like for example using generation doesn't account for the fact that there are fields in spec that don't contribute to the upstream Azure shape, such as the CEL expression stuff, so if users change that we'll re-PUT. This is where using a hash + excluding those fields from the hash for case A would maybe be a slight improvement, but we could tackle that later if we really wanted to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When everything works properly, I think this approach will work - but I think it falls down if/when something goes wrong, as follows:
Scenario: Resource in Azure has recently drifted from desired state
spec0 is our desired configuration
status0 was observed before the resource drifted
Time passes.
Time to reconcile.
Work out what we expect the resource to look like
specexpected := convertToSpec( status0 )
status1 := convertToStatus( GET1 )
specactual := convertToSpec( status1 )
Diff( specexpected, specactual) indicates there are changes, so Azure needs updating.
But then updating Azure fails.
- Maybe we never issue the PUT, having already saved the resource state
- Maybe the PUT is rejected by ARM
- Maybe the PUT succeeds, but the RP crashes
- Regardless, the resource has NOT been updated
Time passes.
Time to reconcile.
Work out what we expect the resource to look like
specexpected := convertToSpec( status1 )
status2 := convertToStatus( GET2 )
specactual := convertToSpec( status2 )
Diff( specexpected, specactual) indicates there are NO changes because the resource didn't change between GET1 and GET2.
busted.
We can try to finesse the failure modes to make sure we only commit status changes after we've updated Azure - but that has two problems:
- We don't update status until after we've finished a potentially long running operation (which users would notice)
- We run the risk of missing a failure mode.
If we are always comparing with the original status of the resource - status0 - then we can't be caught out by odd failure modes.
Introducing OperatorStatus.knownDifferences is an attempt to mitigate the large size of status0 by only keeping the bits that matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't update status until after we've finished a potentially long running operation (which users would notice)
I don't think this is an issue. This is already the behavior we have today and it's also the behavior that Azure "wants" us to have. See handleCreateorUpdateSuccess for example where we only write the status after the create/update LRO completes.
What I mean by this is the behavior Azure "wants" us to have: it's the way Azure's LRO/APIs are built - you issue a PUT, you get a polling URL, you poll that URL to completion and at the end of that polling the current resource state is returned to you. Obviously that doesn't stop you from breaking out of that flow and issuing separate GETs while the resource is in a transitioning state, but if you look at what all of the SDKs do automatically it's always "update, wait, get status" - get status happens at the end..
The point you've raised about Azure interactions failing is a good one, we can't know if the changes we made landed or not in the case we get some connection error or internal server error, but I think as you say that is taken care of by not committing the status until the end. Though it's worth noting that it seems very likely that unless the service behavior is very weird, the update we're making to drive back to goal state is going to result in a status whose spec component is the same as before, so for the purposes of the diffing algorithm there may not be a difference between the "old status" and the "new status". There are probably some resources where this isn't true for but it definitely won't be the majority of them.
We run the risk of missing a failure mode.
What failure mode though? The above pattern seems safe to me? In fact I think it's actually "the same" as using known differences, isn't it? The two patterns are either:
- Get Status after PUT and store someplace (in status)
- Get Status after PUT, compute knownDifferences and store someplace.
In both cases, we rely on getting the status after the PUT, and rely on making an update via kube-apiserver to our object (either an annotation or status), so I am not clear on how there's any real difference between the two from a reliability perspective? The only case that I can see where there would be a difference would be if somebody went and updated the status in k8s to something that wasn't derived from Azure, which they could also do to knownDifferences (either in status or as an annotation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should have a meeting where we talk this over more? The more I look at the two proposals the more they actually look like variants of the same idea (which is good), but I'm still not really seeing what paying the storage (and documentation) cost of adding a field for KnownDifferences actually buys us - we already have the data that was derived from (the whole status), so I don't think there's any reason to actually use it?
spec0 + knownDifferences0 = status0
or (moving spec to the other side of the equation)
knownDifferences0 = status0 - spec0
We already have status0 and spec0 (and we have to have them), so we can compute knownDifferences anytime we like on the fly, can't we?
In the case where the spec hasn't changed the two are (I think) identical. Where they maybe differ is when the spec has changed, with knownDifferences you would add those differences to the new spec1 and compare w/ the GET and see a difference (at least in theory, though there are maybe some edge-cases where the new spec just sets one of the KnownDifferences so there is no diff - it's not clear to me if we want to send to Azure in this case or not. On the one hand it feels safe to avoid a PUT, OTOH that assumes that there's no difference between a field being defaulted versus a field being set by the user, and I suspect there are cases where those two things are different and we would need to send a PUT to realize it), whereas if we don't keep knownDifferences we would just check generation of the status versus generation of the spec and blindly PUT regardless of diff.
TBH I see flaws with both of those approaches so it's not clear to me that one is objectively better than the other. I think PUT-ing whenever user changes generation is maybe a bit safer but obviously does eat a PUT request quota. Still, I'm not convinced the difference here is worth the cost of storing knownDifferences (especially because I think actually doing generation-based is safer)
What this PR does
Updates the existing ADR with information from several recent discussions and makes a recommendation on how to proceed.
Recommend using the markdown view to review.
How does this PR make you feel?
Checklist