-
Notifications
You must be signed in to change notification settings - Fork 181
Reconcile remote datacenters independently #2535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@zimnx: GitHub didn't allow me to request PR reviews from the following users: zimnx. Note that only scylladb members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: zimnx The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
b3235c0
to
433e060
Compare
057822f
to
61cac51
Compare
61cac51
to
110646f
Compare
110646f
to
f8a89a1
Compare
d4b239d
to
978ed02
Compare
d651b70
to
9519be5
Compare
5487493
to
15fbeb0
Compare
15fbeb0
to
606e28a
Compare
clean rebase to pick up dependent PRs, ready to review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My overall perspective - not reflected in the attached comments - is that the MakeRemoteXXX
functions encapsulate significant logic in a way that (1) is not clear from the function name/signature, (2) vastly differs between types.
Attaching some comments, but in subsequent rounds I would like to get the function signatures (including mostly nonexistent docstrings) more in check with what actually the logic is.
} | ||
|
||
func MakeRemoteEndpointSlices(sc *scyllav1alpha1.ScyllaDBCluster, remoteNamespaces map[string]*corev1.Namespace, remoteControllers map[string]metav1.Object, remoteServiceLister remotelister.GenericClusterLister[corev1listers.ServiceLister], remotePodLister remotelister.GenericClusterLister[corev1listers.PodLister], managingClusterDomain string) ([]metav1.Condition, map[string][]*discoveryv1.EndpointSlice, error) { | ||
func MakeRemoteEndpointSlices(sc *scyllav1alpha1.ScyllaDBCluster, dc *scyllav1alpha1.ScyllaDBClusterDatacenter, remoteNamespace *corev1.Namespace, remoteController metav1.Object, remoteNamespaces map[string]*corev1.Namespace, remoteServiceLister remotelister.GenericClusterLister[corev1listers.ServiceLister], remotePodLister remotelister.GenericClusterLister[corev1listers.PodLister], managingClusterDomain string) ([]metav1.Condition, []*discoveryv1.EndpointSlice, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In human speak, what does this function do? We're adding arguments to an already outrageous signature (8 arguments total) and I'm struggling to wrap my head around the semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In human speak, what does this function do?
It returns required remote EndpointSlices objects reconciled by ScyllaDBCluster controller.
We're adding arguments to an already outrageous signature (8 arguments total) and I'm struggling to wrap my head around the semantics.
There's one new argument - the DC which required objects should be returned. Previously these functions were returning a map of required objects with DC as a key. After the change, instead of map, we accept DC as a parameter and return required objects only for the DC requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It returns required remote EndpointSlices objects reconciled by ScyllaDBCluster controller.
I carefully looked into the implementation as this explanation didn't help me understand what the semantics of this function were. This description and the location sounds as if this function was something as simple as a getter, while the implementation does way way way more:
- for each non-local DC:
- computes a set of labels for the remote EndpointSlice (take from the scyllacluster or fall back to a hardocded default)
- populate the conventional protocols/ports
- computes the effective value of broadcast address type (from the cluster spec or the default) and based on it has completely different logic (just once on startup)
- has two different strategies for pod/LB broadcast types
- for podIP-based, go look into the respective pods and hardwire the IPs (effectively almost "reimplement kube-proxy")
- for ingress-based, go look in the ingress status to assemble the host address
- in each of these cases, include readiness changes in the result
The code health problem I'm trying to point at stems from the dissonance between the (overwhelming) complexity and the presentation of this function as a mere getter for a list of objects.
I realize that this existed before this PR (however is only relevant for multi-DC).
In a broader context (beyond this PR) this kind of dissonance should not be happening, and my expectation from newly written code is that pieces of logic are separated in a way that preserves unit-testability and SRP, where possible.
The main takeaway from me is that it took solid 30 minutes of reading to (1) understand this function "before" and (2) understand the change in this function. Even if the reader is new to the codebase (like myself) this is a problem.
Based on this (somewhat lengthy) rationale, the code health expectation is that, given a function, it's clear to the reader what the function does. I would not go as far as enforcing a rule that every func must have a docstring, but set an expectation that if a glance at the name/signature (or a short func body) isn't enough to comprehend the semantics, then something is wrong.
There is no code change required in this comment - I'm seeking shared understanding. Adding a docstring or splitting up the logic / making reasonable parts of it unit-testable sound like valid code health improvements in the current situation, but the fact that the situation was created earlier on and this PR only touches it "a little bit" would make such a request awkward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of if is part of building required object. Common parts could be extracted into shared functions - I'm all for it, but I don't see benefit of extracting every single block into separate function used in one place just for sake of principle.
As discussed offline, I'm not in favor of refactoring any logic other than what this PR has in title, so leaving as is.
remoteConfigMapControllerDegradedCondition = "RemoteConfigMapControllerDegraded" | ||
remoteSecretControllerProgressingCondition = "RemoteSecretControllerProgressing" | ||
remoteSecretControllerDegradedCondition = "RemoteSecretControllerDegraded" | ||
remoteRemoteOwnerControllerDatacenterProgressingConditionFormat = "RemoteRemoteOwnerControllerDatacenter%sProgressing" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting the format argument in the middle of the name doesn't sit right with me. Of all the reasons, it makes grepping very awkward.
Is there a reason why it's not e.g. RemoteXXXProgressing-DC-%s
? (I'm not sure if -
is legal in a condition name, but you get the idea)
The suggestion would be to keep the condition names as they are, and wrap the condition name generator logic in a generic function like return fmt.Sprintf("%s-DC-%s", condName, dcName)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aggregate conditions based on their suffix. The ones having *DC*Progressing
suffix are aggregated into DC*Progressing
condition and ones having *Progressing
are aggreated into final Progressing
condition, same with Degraded
and Available
.
Having it in the middle is more readable imo, as currently it reads as a sentence: RemoteNamespaceController(for)DatacenterXYZ(is)Progressing and is consistent with other types and their conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I've started looking into why this didn't sit right with me and did some more research, I arrived at the below conclusions (and would like to change the request of this comment as follows at the end).
This list is effectively an expanded Cartesian product
["Remote"]
x
["RemoteOwner", "ScyllaDBDatacenter", "Namespace", "Service", "EndpointSlice", "Endpoints", "ConfigMap", "Secret"]x
["ControllerDatacenter"]
x
["%s" (to be filled in on a different level, but contributes to cardinality)]
x
["Progressing", "Degraded"]
and any place that interacts with this list which means at least:
- the ladder of
RunSync
calls - the ladder of
SyncRemoteNamespacedObject
calls - the ladder of condition types (suffixes) in
aggregateDatacenterStatusConditions
need to be unrolled, leading to substantial repetition of logic, which I see as difficult to accept from the maintainability point of view:
- over time leading to slight divergence in logic between cases that should be solved by design instead (note the "Available" case in
aggregateDatacenterStatusConditions
), - making unit-testing of the logic in a general case not possible (or at least much more difficult because there is no pattern),
- requiring ongoing maintenance of all instances of the unrolled Cartesian product (e.g. if an additional value gets added) which also poses the risk of ommissions.
As said in the 1st paragraph, I arrived at a conclusion that the order of the fields in the product is not the biggest issue here - it's that the unrolling/repetition is necessary all over the place.
This problem predates this specific PR, but is prevalent across the ScyllaDBCluster implementation (note an earlier discussion on this topic #2576).
To make it actionable in a single PR and to keep my overall goal in mind (which is to curb the cardinality explosion of just about every piece of logic) reasonable, I suggest that:
- cardinality explosion in the list of condition names be resolved: by breaking up into chunks according to the algebra above (and assembled at call site)
- no (or little) new logic copy-pasta caused by the necessity of unrolling be added in this PR (the
aggregateDatacenterStatusConditions
looks like a good example)
I'm happy to offer some refactoring energy if that proves very complex to do on the code written so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored conditions added in this PR to suggested "abstraction".
@@ -414,6 +422,55 @@ func (scc *Controller) sync(ctx context.Context, key string) error { | |||
return utilerrors.NewAggregate(errs) | |||
} | |||
|
|||
func (scc *Controller) aggregateDatacenterStatusConditions(generation int64, conditions *[]metav1.Condition, dc *scyllav1alpha1.ScyllaDBClusterDatacenter) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the signature of this function is broader than necessary:
- the receiver is unused; suggestion: make it a free function instead of a method
- this function does not use the entire
dc
, justdc.Name
; suggestion: replace thedc
argument withdcName string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no longer there, I reused existing function aggregating condition by their suffixes
remoteConfigMapControllerDegradedCondition = "RemoteConfigMapControllerDegraded" | ||
remoteSecretControllerProgressingCondition = "RemoteSecretControllerProgressing" | ||
remoteSecretControllerDegradedCondition = "RemoteSecretControllerDegraded" | ||
remoteRemoteOwnerControllerDatacenterProgressingConditionFormat = "RemoteRemoteOwnerControllerDatacenter%sProgressing" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I've started looking into why this didn't sit right with me and did some more research, I arrived at the below conclusions (and would like to change the request of this comment as follows at the end).
This list is effectively an expanded Cartesian product
["Remote"]
x
["RemoteOwner", "ScyllaDBDatacenter", "Namespace", "Service", "EndpointSlice", "Endpoints", "ConfigMap", "Secret"]x
["ControllerDatacenter"]
x
["%s" (to be filled in on a different level, but contributes to cardinality)]
x
["Progressing", "Degraded"]
and any place that interacts with this list which means at least:
- the ladder of
RunSync
calls - the ladder of
SyncRemoteNamespacedObject
calls - the ladder of condition types (suffixes) in
aggregateDatacenterStatusConditions
need to be unrolled, leading to substantial repetition of logic, which I see as difficult to accept from the maintainability point of view:
- over time leading to slight divergence in logic between cases that should be solved by design instead (note the "Available" case in
aggregateDatacenterStatusConditions
), - making unit-testing of the logic in a general case not possible (or at least much more difficult because there is no pattern),
- requiring ongoing maintenance of all instances of the unrolled Cartesian product (e.g. if an additional value gets added) which also poses the risk of ommissions.
As said in the 1st paragraph, I arrived at a conclusion that the order of the fields in the product is not the biggest issue here - it's that the unrolling/repetition is necessary all over the place.
This problem predates this specific PR, but is prevalent across the ScyllaDBCluster implementation (note an earlier discussion on this topic #2576).
To make it actionable in a single PR and to keep my overall goal in mind (which is to curb the cardinality explosion of just about every piece of logic) reasonable, I suggest that:
- cardinality explosion in the list of condition names be resolved: by breaking up into chunks according to the algebra above (and assembled at call site)
- no (or little) new logic copy-pasta caused by the necessity of unrolling be added in this PR (the
aggregateDatacenterStatusConditions
looks like a good example)
I'm happy to offer some refactoring energy if that proves very complex to do on the code written so far.
Refactors ScyllaDBCluster controller to reconcile each datacenter independenly. Errors or connection issues of down datacenter no longer affects reconcilation of other datacenter.
606e28a
to
645b2c5
Compare
…center Existing Controller Conditions were changed to represent status of reconcilation of particular Datacenter. On top of these, aggregated per Datacenter Conditions were added.
Test verifies if Operator is able to reconcile healthy datacenters when any is down, and whether it fully reconciles the cluster when down DC is restored.
645b2c5
to
2a9c192
Compare
Refactors ScyllaDBCluster controller to reconcile each datacenter independenly.
Errors or connection issues of down datacenter no longer affects reconcilation of other datacenter.
Requires:
Fixes #2494