DEP: Design proposal for OnDelete Strategy#879
DEP: Design proposal for OnDelete Strategy#879anveshreddy18 wants to merge 6 commits intogardener:masterfrom
Conversation
b15721c to
6261c9b
Compare
6261c9b to
4670638
Compare
…ondelete update process. * The options being: a) statefulset component b) new controller that watches sts updates
4670638 to
bd52516
Compare
|
/assign @shreyas-s-rao @unmarshall |
|
The Gardener project currently lacks enough active contributors to adequately respond to all PRs.
You can:
/lifecycle stale |
|
The Gardener project currently lacks enough active contributors to adequately respond to all PRs.
You can:
/lifecycle rotten |
|
The Gardener project currently lacks enough active contributors to adequately respond to all PRs.
You can:
/lifecycle stale |
ishan16696
left a comment
There was a problem hiding this comment.
Thanks @anveshreddy18 , few comments from my initial look.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
shreyas-s-rao
left a comment
There was a problem hiding this comment.
@anveshreddy18 thanks a lot for the well-written PR. Some comments from me:
|
|
||
| ## Summary | ||
|
|
||
| This proposal recommends changing the StatefulSet update strategy used by etcd-druid from `RollingUpdate` to `OnDelete`. With `OnDelete`, the Kubernetes StatefulSet controller no longer automatically restarts pods when the pod template changes. Instead, a new dedicated controller in etcd-druid takes responsibility for deleting and recreating pods in a carefully chosen order that accounts for etcd member health and cluster role. The goal is to prevent unintended quorum loss during pod updates. |
There was a problem hiding this comment.
| This proposal recommends changing the StatefulSet update strategy used by etcd-druid from `RollingUpdate` to `OnDelete`. With `OnDelete`, the Kubernetes StatefulSet controller no longer automatically restarts pods when the pod template changes. Instead, a new dedicated controller in etcd-druid takes responsibility for deleting and recreating pods in a carefully chosen order that accounts for etcd member health and cluster role. The goal is to prevent unintended quorum loss during pod updates. | |
| This proposal recommends supporting the `OnDelete` update strategy to be used by etcd-druid to update the StatefulSet pods backing the Etcd cluster members. With `OnDelete`, the Kubernetes StatefulSet controller no longer automatically restarts pods when the pod template changes. Instead, a new dedicated controller in etcd-druid takes responsibility for deleting and recreating pods in a carefully chosen order that accounts for etcd member health and cluster role. The goal is to prevent unintended quorum loss during spec updates to the Etcd cluster. |
| - **Leader**: The etcd member responsible for handling client write requests and coordinating replication. | ||
| - **Follower**: An etcd member that replicates data from the leader and can serve linearizable reads. | ||
| - **Participating pod**: A pod whose etcd container is part of the quorum (the member is either a leader or a follower). | ||
| - **Non-participating pod**: A pod whose etcd container is not part of the quorum (the member may be down, restarting, or not yet joined). |
There was a problem hiding this comment.
Still a bit unclear about this. A member may be restarting, but would still be part of the quorum right? Wouldn't that make it a participating pod? A member that's not yet joined (or is a learner) is clearly a non-participating pod, but it seems like a member that's down (unhealthy) would also still be part of the quorum - just that if one member is down in a 3-member cluster, then only 2 members are healthy, and if 2 are down, then the cluster has lost quorum.
Please clarify this.
There was a problem hiding this comment.
Nit: there's a "YES" missing on the arrow from Is this the original unhealthy pod and Wait for pod to be ready
| updateStrategy: OnDelete # or RollingUpdate | ||
| ``` | ||
|
|
||
| - **Default value**: `OnDelete` |
There was a problem hiding this comment.
Shouldn't the default be RollingUpdate, to not break the current behaviour? Eventually it can be defaulted to OnDelete, maybe a few releases later, but not from the very beginning. Can you please mention this as a note maybe?
|
|
||
| This is a per-cluster choice. Operators can set different strategies for different Etcd clusters. Changing the field on a live cluster is supported and triggers a seamless transition (see [Transitioning Between Strategies](#transitioning-between-strategies)). | ||
|
|
||
| The decision to use a spec field instead of a feature gate is documented in [Decision Record 003](../decisions/003-ondelete-as-spec-field-not-feature-gate.md). The key reasons are: per-cluster control, no forced migration at maturity, and both strategies remaining available indefinitely. |
There was a problem hiding this comment.
I don't see this decision record file in the PR or in your fork. Please add it.
|
|
||
| Transitioning between `RollingUpdate` and `OnDelete` is seamless and requires no manual intervention beyond changing the `spec.updateStrategy` field on the Etcd custom resource. | ||
|
|
||
| **Switching from RollingUpdate to OnDelete:** |
There was a problem hiding this comment.
Same question as above. If there's an ongoing rolling update (that's stuck for some reason) and the strategy is changed to OnDelete, how would OnDelete controller handle it? I would assume this should work fine since it checks the pod controller-revision-hash to know which ones were already updated, and only considers the remaining un-updated ones for ondelete-style update, correct?
|
|
||
| The OnDelete controller exposes the following metrics: | ||
|
|
||
| - `etcd_druid_ondelete_update_duration_seconds`: Time from the first detection of an `updateRevision` change to the completion of all pod updates. Labeled by etcd cluster name. |
There was a problem hiding this comment.
| - `etcd_druid_ondelete_update_duration_seconds`: Time from the first detection of an `updateRevision` change to the completion of all pod updates. Labeled by etcd cluster name. | |
| - `etcddruid_ondelete_update_duration_seconds`: Time from the first detection of an `updateRevision` change to the completion of all pod updates. Labeled by etcd cluster name. |
This is the current convention, but if it seems wrong, we can discuss changing it in the current code too, during implementation.
| The OnDelete controller exposes the following metrics: | ||
|
|
||
| - `etcd_druid_ondelete_update_duration_seconds`: Time from the first detection of an `updateRevision` change to the completion of all pod updates. Labeled by etcd cluster name. | ||
| - `etcd_druid_ondelete_reconcile_cycles_total`: Number of reconciliation cycles required to complete a full pod update. Labeled by etcd cluster name. |
| 3. The OnDelete controller's predicate no longer matches this StatefulSet. The Kubernetes StatefulSet controller resumes managing pod updates in its default ordinal order. | ||
| 4. If there were outdated pods that the OnDelete controller had not yet processed, the StatefulSet controller picks them up and rolls them in the standard highest-to-lowest ordinal order. | ||
|
|
||
| ### VPA and HVPA Interaction |
There was a problem hiding this comment.
Thanks for adding the section. There's no ned to mention HVPA, since it's no longer used.
| - **Backup-restore container health in update ordering**: The current design does not consider the health of the backup-restore sidecar container when deciding which pod to update next. The rationale is that backup-restore health does not affect quorum, and prioritizing it could lead to unnecessary leader elections (for example, if a pod with an unhealthy backup-restore happens to be the leader). If future operational experience shows value in considering backup-restore health as a secondary sorting criterion, the priority order can be extended. The following state diagram illustrates what a backup-restore-aware ordering would look like: | ||
|
|
||
| <div align="center"> | ||
| <img src="assets/06-OnDelete-StateDiagram-With-Etcdbr-Health.png" alt="OnDelete state diagram with backup-restore health awareness (future scope)" width="700"> |
There was a problem hiding this comment.
I feel we don't need this diagram, it would just confuse readers of this DEP, since you state we don't worry about etcdbr health in this DEP.
|
@anveshreddy18 please rebase the PR as well, to fix failing tests. |
|
|
||
| ## Summary | ||
|
|
||
| This proposal recommends changing the StatefulSet update strategy used by etcd-druid from `RollingUpdate` to `OnDelete`. With `OnDelete`, the Kubernetes StatefulSet controller no longer automatically restarts pods when the pod template changes. Instead, a new dedicated controller in etcd-druid takes responsibility for deleting and recreating pods in a carefully chosen order that accounts for etcd member health and cluster role. The goal is to prevent unintended quorum loss during pod updates. |
There was a problem hiding this comment.
| This proposal recommends changing the StatefulSet update strategy used by etcd-druid from `RollingUpdate` to `OnDelete`. With `OnDelete`, the Kubernetes StatefulSet controller no longer automatically restarts pods when the pod template changes. Instead, a new dedicated controller in etcd-druid takes responsibility for deleting and recreating pods in a carefully chosen order that accounts for etcd member health and cluster role. The goal is to prevent unintended quorum loss during pod updates. | |
| This proposal recommends changing the StatefulSet update strategy used by etcd-druid from `RollingUpdate` to `OnDelete`. With `OnDelete`, the Kubernetes StatefulSet controller no longer automatically rollout the pods when the pod template changes. Instead, a new dedicated controller in etcd-druid takes responsibility for deleting and recreating pods in a carefully chosen order that accounts for etcd member health and cluster role. The goal is to prevent unintended quorum loss during pod updates. |
|
|
||
| ## Terminology | ||
|
|
||
| - **OnDelete**: A StatefulSet update strategy where pods are only updated when they are explicitly deleted. The StatefulSet controller does not automatically roll pods on template changes. |
There was a problem hiding this comment.
| - **OnDelete**: A StatefulSet update strategy where pods are only updated when they are explicitly deleted. The StatefulSet controller does not automatically roll pods on template changes. | |
| - **OnDelete**: A StatefulSet update strategy where pods are only updated when they are explicitly deleted. The StatefulSet controller does not automatically rollout the pods on template changes. |
| - **Participating pod**: A pod whose etcd container is part of the quorum (the member is either a leader or a follower). | ||
| - **Non-participating pod**: A pod whose etcd container is not part of the quorum (the member may be down, restarting, or not yet joined). |
There was a problem hiding this comment.
I'm not very much convinced with these 2 terminologies and their definitions. I understand what you're trying to say by defining "Participating and non-Participating pods" but IMO it's hard to define such state and to take decision on that basis as member pods can be temporary down as well as permanently down.
Now, Let's take case by case:
- If member is temporary down (pod restart) then it will come up ... nothing to done from our side 🤷♂️ .
- If member is permanently down then there will be 2 cases again:
- If there is only 1 member down then single restoration will/should trigger, and it should come up as healthy.
- If there is 1+ member are unhealthy then it's permanent quorum loss there nobody can't do anything, it require manual intervention.
| <img src="assets/06-rolling-update-state-diagram.png" alt="RollingUpdate state diagram showing quorum loss when an unhealthy pod exists" width="500"> | ||
| </div> | ||
|
|
||
| The StatefulSet controller starts from Pod N (the highest ordinal), terminates it, and waits for the new pod to become ready. If the terminated pod is not the originally unhealthy one, quorum is lost with 2 members down. |
There was a problem hiding this comment.
Most likely it will be a transient quorum loss .. right as nth pod (highest ordinal) will eventually come-up ?
|
|
||
| The StatefulSet controller starts from Pod N (the highest ordinal), terminates it, and waits for the new pod to become ready. If the terminated pod is not the originally unhealthy one, quorum is lost with 2 members down. | ||
|
|
||
| For a single-node etcd cluster, both `RollingUpdate` and `OnDelete` produce the same outcome since there is only one pod to update. The benefit of `OnDelete` is specific to multi-node clusters where update ordering matters. |
There was a problem hiding this comment.
should we keep using RollingUpdate for singleton cluster as there is no effect of moving to OnDelete then why to make it complicated ?
On scale-up just the update the updateStrategy to OnDelete, wdyt ?
/cc @shreyas-s-rao
There was a problem hiding this comment.
For single-node it doesn't make much of a difference. But we can avoid the switching between update strategies when scale-up happens. And of course, if the Etcd spec says "replicas:1, updateStrategy: OnDelete`, then druid must respect that and set it in the sts spec accordingly.
|
|
||
| The OnDelete controller detects outdated pods by comparing `controller-revision-hash` labels. This hash is computed from the pod template spec only and does not include `volumeClaimTemplates`. This means that a change to `storageCapacity` or `storageClass` alone (without any pod template change) will not be detected by the OnDelete controller. The PVC resize flow must handle pod deletion independently in such cases. | ||
|
|
||
| The details of the PVC resize flow (orphan-delete of the StatefulSet, per-pod PVC replacement, interaction with the OnDelete controller) will be covered in a separate proposal. |
There was a problem hiding this comment.
lets not mention the extra details which is not decided yet, may be just mention the issue no. #481 to follow.
How to categorize this PR?
/area documentation
/kind enhancement
What this PR does / why we need it:
This PR adds the enhancement proposal for
OnDeleteupdate strategy undertaken byetcd-druid.Which issue(s) this PR fixes:
Writes proposal for #588
Special notes for your reviewer:
Release note: