DEP: Design proposal for OnDelete Strategy by anveshreddy18 · Pull Request #879 · gardener/etcd-druid

anveshreddy18 · 2024-09-23T08:52:37Z

How to categorize this PR?

/area documentation
/kind enhancement

What this PR does / why we need it:
This PR adds the enhancement proposal for OnDelete update strategy undertaken by etcd-druid.

Which issue(s) this PR fixes:
Writes proposal for #588

Special notes for your reviewer:

Release note:

Introduce DEP-06: Druid controlled pod updates with the help of StatefulSet OnDelete Strategy

…ondelete update process. * The options being: a) statefulset component b) new controller that watches sts updates

shreyas-s-rao · 2025-03-10T10:34:12Z

/assign @shreyas-s-rao @unmarshall

gardener-ci-robot · 2026-01-31T17:51:47Z

The Gardener project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 14d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as active with /lifecycle active
Mark this PR as fresh with /remove-lifecycle stale
Mark this PR as rotten with /lifecycle rotten
Close this PR with /close

/lifecycle stale

gardener-ci-robot · 2026-03-07T16:23:49Z

The Gardener project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 14d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as active with /lifecycle active
Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close

/lifecycle rotten

gardener-ci-robot · 2026-04-10T04:39:23Z

The Gardener project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 14d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as active with /lifecycle active
Mark this PR as fresh with /remove-lifecycle stale
Mark this PR as rotten with /lifecycle rotten
Close this PR with /close

/lifecycle stale

ishan16696

Thanks @anveshreddy18 , few comments from my initial look.

gardener-prow · 2026-04-20T08:49:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from ishan16696. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

shreyas-s-rao

@anveshreddy18 thanks a lot for the well-written PR. Some comments from me:

shreyas-s-rao · 2026-04-23T14:25:54Z

+
+## Summary
+
+This proposal recommends changing the StatefulSet update strategy used by etcd-druid from `RollingUpdate` to `OnDelete`. With `OnDelete`, the Kubernetes StatefulSet controller no longer automatically restarts pods when the pod template changes. Instead, a new dedicated controller in etcd-druid takes responsibility for deleting and recreating pods in a carefully chosen order that accounts for etcd member health and cluster role. The goal is to prevent unintended quorum loss during pod updates.


Suggested change

This proposal recommends changing the StatefulSet update strategy used by etcd-druid from `RollingUpdate` to `OnDelete`. With `OnDelete`, the Kubernetes StatefulSet controller no longer automatically restarts pods when the pod template changes. Instead, a new dedicated controller in etcd-druid takes responsibility for deleting and recreating pods in a carefully chosen order that accounts for etcd member health and cluster role. The goal is to prevent unintended quorum loss during pod updates.

This proposal recommends supporting the `OnDelete` update strategy to be used by etcd-druid to update the StatefulSet pods backing the Etcd cluster members. With `OnDelete`, the Kubernetes StatefulSet controller no longer automatically restarts pods when the pod template changes. Instead, a new dedicated controller in etcd-druid takes responsibility for deleting and recreating pods in a carefully chosen order that accounts for etcd member health and cluster role. The goal is to prevent unintended quorum loss during spec updates to the Etcd cluster.

shreyas-s-rao · 2026-04-23T14:45:12Z

+- **Leader**: The etcd member responsible for handling client write requests and coordinating replication.
+- **Follower**: An etcd member that replicates data from the leader and can serve linearizable reads.
+- **Participating pod**: A pod whose etcd container is part of the quorum (the member is either a leader or a follower).
+- **Non-participating pod**: A pod whose etcd container is not part of the quorum (the member may be down, restarting, or not yet joined).


Still a bit unclear about this. A member may be restarting, but would still be part of the quorum right? Wouldn't that make it a participating pod? A member that's not yet joined (or is a learner) is clearly a non-participating pod, but it seems like a member that's down (unhealthy) would also still be part of the quorum - just that if one member is down in a 3-member cluster, then only 2 members are healthy, and if 2 are down, then the cluster has lost quorum.

Please clarify this.

shreyas-s-rao · 2026-04-23T14:47:27Z

Nit: there's a "YES" missing on the arrow from Is this the original unhealthy pod and Wait for pod to be ready

shreyas-s-rao · 2026-04-23T14:50:21Z

+  updateStrategy: OnDelete  # or RollingUpdate
+```
+
+- **Default value**: `OnDelete`


Shouldn't the default be RollingUpdate, to not break the current behaviour? Eventually it can be defaulted to OnDelete, maybe a few releases later, but not from the very beginning. Can you please mention this as a note maybe?

shreyas-s-rao · 2026-04-23T14:51:54Z

+
+This is a per-cluster choice. Operators can set different strategies for different Etcd clusters. Changing the field on a live cluster is supported and triggers a seamless transition (see [Transitioning Between Strategies](#transitioning-between-strategies)).
+
+The decision to use a spec field instead of a feature gate is documented in [Decision Record 003](../decisions/003-ondelete-as-spec-field-not-feature-gate.md). The key reasons are: per-cluster control, no forced migration at maturity, and both strategies remaining available indefinitely.


I don't see this decision record file in the PR or in your fork. Please add it.

shreyas-s-rao · 2026-04-23T15:07:17Z

+
+Transitioning between `RollingUpdate` and `OnDelete` is seamless and requires no manual intervention beyond changing the `spec.updateStrategy` field on the Etcd custom resource.
+
+**Switching from RollingUpdate to OnDelete:**


Same question as above. If there's an ongoing rolling update (that's stuck for some reason) and the strategy is changed to OnDelete, how would OnDelete controller handle it? I would assume this should work fine since it checks the pod controller-revision-hash to know which ones were already updated, and only considers the remaining un-updated ones for ondelete-style update, correct?

shreyas-s-rao · 2026-04-23T15:08:01Z

+
+The OnDelete controller exposes the following metrics:
+
+- `etcd_druid_ondelete_update_duration_seconds`: Time from the first detection of an `updateRevision` change to the completion of all pod updates. Labeled by etcd cluster name.


Suggested change

- `etcd_druid_ondelete_update_duration_seconds`: Time from the first detection of an `updateRevision` change to the completion of all pod updates. Labeled by etcd cluster name.

- `etcddruid_ondelete_update_duration_seconds`: Time from the first detection of an `updateRevision` change to the completion of all pod updates. Labeled by etcd cluster name.

This is the current convention, but if it seems wrong, we can discuss changing it in the current code too, during implementation.

shreyas-s-rao · 2026-04-23T15:08:11Z

+The OnDelete controller exposes the following metrics:
+
+- `etcd_druid_ondelete_update_duration_seconds`: Time from the first detection of an `updateRevision` change to the completion of all pod updates. Labeled by etcd cluster name.
+- `etcd_druid_ondelete_reconcile_cycles_total`: Number of reconciliation cycles required to complete a full pod update. Labeled by etcd cluster name.


Same as above

shreyas-s-rao · 2026-04-23T15:09:27Z

+3. The OnDelete controller's predicate no longer matches this StatefulSet. The Kubernetes StatefulSet controller resumes managing pod updates in its default ordinal order.
+4. If there were outdated pods that the OnDelete controller had not yet processed, the StatefulSet controller picks them up and rolls them in the standard highest-to-lowest ordinal order.
+
+### VPA and HVPA Interaction


Thanks for adding the section. There's no ned to mention HVPA, since it's no longer used.

shreyas-s-rao · 2026-04-23T15:12:06Z

+- **Backup-restore container health in update ordering**: The current design does not consider the health of the backup-restore sidecar container when deciding which pod to update next. The rationale is that backup-restore health does not affect quorum, and prioritizing it could lead to unnecessary leader elections (for example, if a pod with an unhealthy backup-restore happens to be the leader). If future operational experience shows value in considering backup-restore health as a secondary sorting criterion, the priority order can be extended. The following state diagram illustrates what a backup-restore-aware ordering would look like:
+
+  <div align="center">
+  <img src="assets/06-OnDelete-StateDiagram-With-Etcdbr-Health.png" alt="OnDelete state diagram with backup-restore health awareness (future scope)" width="700">


I feel we don't need this diagram, it would just confuse readers of this DEP, since you state we don't worry about etcdbr health in this DEP.

shreyas-s-rao · 2026-04-23T15:14:42Z

@anveshreddy18 please rebase the PR as well, to fix failing tests.

ishan16696 · 2026-04-23T09:22:30Z

+
+## Summary
+
+This proposal recommends changing the StatefulSet update strategy used by etcd-druid from `RollingUpdate` to `OnDelete`. With `OnDelete`, the Kubernetes StatefulSet controller no longer automatically restarts pods when the pod template changes. Instead, a new dedicated controller in etcd-druid takes responsibility for deleting and recreating pods in a carefully chosen order that accounts for etcd member health and cluster role. The goal is to prevent unintended quorum loss during pod updates.


Suggested change

This proposal recommends changing the StatefulSet update strategy used by etcd-druid from `RollingUpdate` to `OnDelete`. With `OnDelete`, the Kubernetes StatefulSet controller no longer automatically restarts pods when the pod template changes. Instead, a new dedicated controller in etcd-druid takes responsibility for deleting and recreating pods in a carefully chosen order that accounts for etcd member health and cluster role. The goal is to prevent unintended quorum loss during pod updates.

This proposal recommends changing the StatefulSet update strategy used by etcd-druid from `RollingUpdate` to `OnDelete`. With `OnDelete`, the Kubernetes StatefulSet controller no longer automatically rollout the pods when the pod template changes. Instead, a new dedicated controller in etcd-druid takes responsibility for deleting and recreating pods in a carefully chosen order that accounts for etcd member health and cluster role. The goal is to prevent unintended quorum loss during pod updates.

ishan16696 · 2026-04-23T09:24:15Z

+
+## Terminology
+
+- **OnDelete**: A StatefulSet update strategy where pods are only updated when they are explicitly deleted. The StatefulSet controller does not automatically roll pods on template changes.


Suggested change

- **OnDelete**: A StatefulSet update strategy where pods are only updated when they are explicitly deleted. The StatefulSet controller does not automatically roll pods on template changes.

- **OnDelete**: A StatefulSet update strategy where pods are only updated when they are explicitly deleted. The StatefulSet controller does not automatically rollout the pods on template changes.

ishan16696 · 2026-04-23T09:42:06Z

+- **Participating pod**: A pod whose etcd container is part of the quorum (the member is either a leader or a follower).
+- **Non-participating pod**: A pod whose etcd container is not part of the quorum (the member may be down, restarting, or not yet joined).


I'm not very much convinced with these 2 terminologies and their definitions. I understand what you're trying to say by defining "Participating and non-Participating pods" but IMO it's hard to define such state and to take decision on that basis as member pods can be temporary down as well as permanently down.
Now, Let's take case by case:

If member is temporary down (pod restart) then it will come up ... nothing to done from our side 🤷‍♂️ .

If member is permanently down then there will be 2 cases again:

If there is only 1 member down then single restoration will/should trigger, and it should come up as healthy.

If there is 1+ member are unhealthy then it's permanent quorum loss there nobody can't do anything, it require manual intervention.

ishan16696 · 2026-04-23T10:52:51Z

+<img src="assets/06-rolling-update-state-diagram.png" alt="RollingUpdate state diagram showing quorum loss when an unhealthy pod exists" width="500">
+</div>
+
+The StatefulSet controller starts from Pod N (the highest ordinal), terminates it, and waits for the new pod to become ready. If the terminated pod is not the originally unhealthy one, quorum is lost with 2 members down.


Most likely it will be a transient quorum loss .. right as nth pod (highest ordinal) will eventually come-up ?

ishan16696 · 2026-04-23T10:56:09Z

+
+The StatefulSet controller starts from Pod N (the highest ordinal), terminates it, and waits for the new pod to become ready. If the terminated pod is not the originally unhealthy one, quorum is lost with 2 members down.
+
+For a single-node etcd cluster, both `RollingUpdate` and `OnDelete` produce the same outcome since there is only one pod to update. The benefit of `OnDelete` is specific to multi-node clusters where update ordering matters.


should we keep using RollingUpdate for singleton cluster as there is no effect of moving to OnDelete then why to make it complicated ?
On scale-up just the update the updateStrategy to OnDelete, wdyt ?
/cc @shreyas-s-rao

For single-node it doesn't make much of a difference. But we can avoid the switching between update strategies when scale-up happens. And of course, if the Etcd spec says "replicas:1, updateStrategy: OnDelete`, then druid must respect that and set it in the sts spec accordingly.

ishan16696 · 2026-04-23T14:58:21Z

+
+The OnDelete controller detects outdated pods by comparing `controller-revision-hash` labels. This hash is computed from the pod template spec only and does not include `volumeClaimTemplates`. This means that a change to `storageCapacity` or `storageClass` alone (without any pod template change) will not be detected by the OnDelete controller. The PVC resize flow must handle pod deletion independently in such cases.
+
+The details of the PVC resize flow (orphan-delete of the StatefulSet, per-pod PVC replacement, interaction with the OnDelete controller) will be covered in a separate proposal.


lets not mention the extra details which is not decided yet, may be just mention the issue no. #481 to follow.

anveshreddy18 requested a review from a team as a code owner September 23, 2024 08:52

gardener-robot added needs/review Needs review area/documentation Documentation related kind/enhancement Enhancement, improvement, extension size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 23, 2024

ghost added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Sep 23, 2024

gardener-robot-ci-2 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Sep 23, 2024

anveshreddy18 force-pushed the proposal/ondelete branch from b15721c to 6261c9b Compare September 23, 2024 09:33

gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Sep 23, 2024

ghost removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Sep 23, 2024

anveshreddy18 force-pushed the proposal/ondelete branch from 6261c9b to 4670638 Compare September 23, 2024 17:29

gardener-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. needs/second-opinion Needs second review by someone else and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 23, 2024

ghost added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Sep 23, 2024

gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Sep 23, 2024

anveshreddy18 added 2 commits November 12, 2024 18:28

DEP: Design proposal for OnDelete Strategy

cf9996d

remove pod component & update proposal with other options to trigger …

bd52516

…ondelete update process. * The options being: a) statefulset component b) new controller that watches sts updates

anveshreddy18 force-pushed the proposal/ondelete branch from 4670638 to bd52516 Compare November 18, 2024 04:17

ghost added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Nov 18, 2024

improve state diagram

8616fac

ghost added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Nov 19, 2024

gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Nov 19, 2024

WIP

254e7fe

gardener-robot assigned shreyas-s-rao and unmarshall Mar 10, 2025

ghost added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Apr 2, 2025

shreyas-s-rao reopened this Jan 1, 2026

gardener-robot added status/accepted Issue was accepted as something we need to work on and removed status/closed Issue is closed (either delivered or triaged) labels Jan 1, 2026

ghost added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 1, 2026

github-actions Bot removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 1, 2026

gardener-prow Bot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. labels Jan 31, 2026

gardener-prow Bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 7, 2026

anveshreddy18 removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 11, 2026

gardener-prow Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 10, 2026

ishan16696 added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 10, 2026

ishan16696 requested changes Apr 14, 2026

View reviewed changes

anveshreddy18 mentioned this pull request Apr 15, 2026

StatefulSet: Fix OnDelete strategy not updating CurrentRevision kubernetes/kubernetes#136833

Open

ishan16696 self-assigned this Apr 16, 2026

update proposal

38a3c4d

gardener-prow Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 20, 2026

ishan16696 unassigned unmarshall Apr 20, 2026

add missing links and new state diagram

57d2a0a

shreyas-s-rao requested changes Apr 23, 2026

View reviewed changes

shreyas-s-rao removed their assignment Apr 23, 2026

ishan16696 requested changes Apr 23, 2026

View reviewed changes

gardener-prow Bot requested a review from shreyas-s-rao April 23, 2026 16:53


		## Summary

		This proposal recommends changing the StatefulSet update strategy used by etcd-druid from `RollingUpdate` to `OnDelete`. With `OnDelete`, the Kubernetes StatefulSet controller no longer automatically restarts pods when the pod template changes. Instead, a new dedicated controller in etcd-druid takes responsibility for deleting and recreating pods in a carefully chosen order that accounts for etcd member health and cluster role. The goal is to prevent unintended quorum loss during pod updates.


		This is a per-cluster choice. Operators can set different strategies for different Etcd clusters. Changing the field on a live cluster is supported and triggers a seamless transition (see [Transitioning Between Strategies](#transitioning-between-strategies)).

		The decision to use a spec field instead of a feature gate is documented in [Decision Record 003](../decisions/003-ondelete-as-spec-field-not-feature-gate.md). The key reasons are: per-cluster control, no forced migration at maturity, and both strategies remaining available indefinitely.


		Transitioning between `RollingUpdate` and `OnDelete` is seamless and requires no manual intervention beyond changing the `spec.updateStrategy` field on the Etcd custom resource.

		Switching from RollingUpdate to OnDelete:


		The OnDelete controller exposes the following metrics:

		- `etcd_druid_ondelete_update_duration_seconds`: Time from the first detection of an `updateRevision` change to the completion of all pod updates. Labeled by etcd cluster name.

	- `etcd_druid_ondelete_update_duration_seconds`: Time from the first detection of an `updateRevision` change to the completion of all pod updates. Labeled by etcd cluster name.
	- `etcddruid_ondelete_update_duration_seconds`: Time from the first detection of an `updateRevision` change to the completion of all pod updates. Labeled by etcd cluster name.


		## Terminology

		- OnDelete: A StatefulSet update strategy where pods are only updated when they are explicitly deleted. The StatefulSet controller does not automatically roll pods on template changes.

		- Participating pod: A pod whose etcd container is part of the quorum (the member is either a leader or a follower).
		- Non-participating pod: A pod whose etcd container is not part of the quorum (the member may be down, restarting, or not yet joined).


		The StatefulSet controller starts from Pod N (the highest ordinal), terminates it, and waits for the new pod to become ready. If the terminated pod is not the originally unhealthy one, quorum is lost with 2 members down.

		For a single-node etcd cluster, both `RollingUpdate` and `OnDelete` produce the same outcome since there is only one pod to update. The benefit of `OnDelete` is specific to multi-node clusters where update ordering matters.


		The OnDelete controller detects outdated pods by comparing `controller-revision-hash` labels. This hash is computed from the pod template spec only and does not include `volumeClaimTemplates`. This means that a change to `storageCapacity` or `storageClass` alone (without any pod template change) will not be detected by the OnDelete controller. The PVC resize flow must handle pod deletion independently in such cases.

		The details of the PVC resize flow (orphan-delete of the StatefulSet, per-pod PVC replacement, interaction with the OnDelete controller) will be covered in a separate proposal.

Conversation

anveshreddy18 commented Sep 23, 2024

Uh oh!

shreyas-s-rao commented Mar 10, 2025

Uh oh!

gardener-ci-robot commented Jan 31, 2026

Uh oh!

gardener-ci-robot commented Mar 7, 2026

Uh oh!

gardener-ci-robot commented Apr 10, 2026

Uh oh!

ishan16696 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gardener-prow Bot commented Apr 20, 2026

Uh oh!

shreyas-s-rao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shreyas-s-rao commented Apr 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants