Skip to content

DEP: Design proposal for OnDelete Strategy#879

Open
anveshreddy18 wants to merge 6 commits intogardener:masterfrom
anveshreddy18:proposal/ondelete
Open

DEP: Design proposal for OnDelete Strategy#879
anveshreddy18 wants to merge 6 commits intogardener:masterfrom
anveshreddy18:proposal/ondelete

Conversation

@anveshreddy18
Copy link
Copy Markdown
Member

How to categorize this PR?

/area documentation
/kind enhancement

What this PR does / why we need it:
This PR adds the enhancement proposal for OnDelete update strategy undertaken by etcd-druid.

Which issue(s) this PR fixes:
Writes proposal for #588

Special notes for your reviewer:

Release note:

Introduce DEP-06: Druid controlled pod updates with the help of StatefulSet OnDelete Strategy

@anveshreddy18 anveshreddy18 requested a review from a team as a code owner September 23, 2024 08:52
@gardener-robot gardener-robot added needs/review Needs review area/documentation Documentation related kind/enhancement Enhancement, improvement, extension size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 23, 2024
@ghost ghost added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Sep 23, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Sep 23, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Sep 23, 2024
@ghost ghost removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Sep 23, 2024
@gardener-robot gardener-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. needs/second-opinion Needs second review by someone else and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 23, 2024
@ghost ghost added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Sep 23, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Sep 23, 2024
…ondelete update process.

* The options being: a) statefulset component b) new controller that watches sts updates
@ghost ghost added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Nov 18, 2024
@ghost ghost added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Nov 19, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Nov 19, 2024
@shreyas-s-rao
Copy link
Copy Markdown
Member

/assign @shreyas-s-rao @unmarshall

@ghost ghost added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Apr 2, 2025
@shreyas-s-rao shreyas-s-rao reopened this Jan 1, 2026
@gardener-robot gardener-robot added status/accepted Issue was accepted as something we need to work on and removed status/closed Issue is closed (either delivered or triaged) labels Jan 1, 2026
@ghost ghost added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 1, 2026
@github-actions github-actions Bot removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jan 1, 2026
@gardener-ci-robot
Copy link
Copy Markdown

The Gardener project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 14d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as active with /lifecycle active
  • Mark this PR as fresh with /remove-lifecycle stale
  • Mark this PR as rotten with /lifecycle rotten
  • Close this PR with /close

/lifecycle stale

@gardener-prow gardener-prow Bot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. labels Jan 31, 2026
@gardener-ci-robot
Copy link
Copy Markdown

The Gardener project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 14d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as active with /lifecycle active
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close

/lifecycle rotten

@gardener-prow gardener-prow Bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 7, 2026
@anveshreddy18 anveshreddy18 removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 11, 2026
@gardener-ci-robot
Copy link
Copy Markdown

The Gardener project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 14d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as active with /lifecycle active
  • Mark this PR as fresh with /remove-lifecycle stale
  • Mark this PR as rotten with /lifecycle rotten
  • Close this PR with /close

/lifecycle stale

@gardener-prow gardener-prow Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 10, 2026
@ishan16696 ishan16696 added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 10, 2026
Copy link
Copy Markdown
Member

@ishan16696 ishan16696 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @anveshreddy18 , few comments from my initial look.

Comment thread docs/proposals/06-sts-ondelete-strategy.md Outdated
Comment thread docs/proposals/06-sts-ondelete-strategy.md Outdated
Comment thread docs/proposals/06-sts-ondelete-strategy.md Outdated
Comment thread docs/proposals/06-sts-ondelete-strategy.md Outdated
Comment thread docs/proposals/06-sts-ondelete-strategy.md Outdated
Comment thread docs/proposals/06-sts-ondelete-strategy.md Outdated
Comment thread docs/proposals/06-sts-ondelete-strategy.md Outdated
Comment thread docs/proposals/06-sts-ondelete-strategy.md Outdated
Comment thread docs/proposals/06-sts-ondelete-strategy.md Outdated
@gardener-prow
Copy link
Copy Markdown

gardener-prow Bot commented Apr 20, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from ishan16696. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 20, 2026
Copy link
Copy Markdown
Member

@shreyas-s-rao shreyas-s-rao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anveshreddy18 thanks a lot for the well-written PR. Some comments from me:


## Summary

This proposal recommends changing the StatefulSet update strategy used by etcd-druid from `RollingUpdate` to `OnDelete`. With `OnDelete`, the Kubernetes StatefulSet controller no longer automatically restarts pods when the pod template changes. Instead, a new dedicated controller in etcd-druid takes responsibility for deleting and recreating pods in a carefully chosen order that accounts for etcd member health and cluster role. The goal is to prevent unintended quorum loss during pod updates.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This proposal recommends changing the StatefulSet update strategy used by etcd-druid from `RollingUpdate` to `OnDelete`. With `OnDelete`, the Kubernetes StatefulSet controller no longer automatically restarts pods when the pod template changes. Instead, a new dedicated controller in etcd-druid takes responsibility for deleting and recreating pods in a carefully chosen order that accounts for etcd member health and cluster role. The goal is to prevent unintended quorum loss during pod updates.
This proposal recommends supporting the `OnDelete` update strategy to be used by etcd-druid to update the StatefulSet pods backing the Etcd cluster members. With `OnDelete`, the Kubernetes StatefulSet controller no longer automatically restarts pods when the pod template changes. Instead, a new dedicated controller in etcd-druid takes responsibility for deleting and recreating pods in a carefully chosen order that accounts for etcd member health and cluster role. The goal is to prevent unintended quorum loss during spec updates to the Etcd cluster.

- **Leader**: The etcd member responsible for handling client write requests and coordinating replication.
- **Follower**: An etcd member that replicates data from the leader and can serve linearizable reads.
- **Participating pod**: A pod whose etcd container is part of the quorum (the member is either a leader or a follower).
- **Non-participating pod**: A pod whose etcd container is not part of the quorum (the member may be down, restarting, or not yet joined).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still a bit unclear about this. A member may be restarting, but would still be part of the quorum right? Wouldn't that make it a participating pod? A member that's not yet joined (or is a learner) is clearly a non-participating pod, but it seems like a member that's down (unhealthy) would also still be part of the quorum - just that if one member is down in a 3-member cluster, then only 2 members are healthy, and if 2 are down, then the cluster has lost quorum.

Please clarify this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: there's a "YES" missing on the arrow from Is this the original unhealthy pod and Wait for pod to be ready

updateStrategy: OnDelete # or RollingUpdate
```

- **Default value**: `OnDelete`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the default be RollingUpdate, to not break the current behaviour? Eventually it can be defaulted to OnDelete, maybe a few releases later, but not from the very beginning. Can you please mention this as a note maybe?


This is a per-cluster choice. Operators can set different strategies for different Etcd clusters. Changing the field on a live cluster is supported and triggers a seamless transition (see [Transitioning Between Strategies](#transitioning-between-strategies)).

The decision to use a spec field instead of a feature gate is documented in [Decision Record 003](../decisions/003-ondelete-as-spec-field-not-feature-gate.md). The key reasons are: per-cluster control, no forced migration at maturity, and both strategies remaining available indefinitely.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this decision record file in the PR or in your fork. Please add it.


Transitioning between `RollingUpdate` and `OnDelete` is seamless and requires no manual intervention beyond changing the `spec.updateStrategy` field on the Etcd custom resource.

**Switching from RollingUpdate to OnDelete:**
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above. If there's an ongoing rolling update (that's stuck for some reason) and the strategy is changed to OnDelete, how would OnDelete controller handle it? I would assume this should work fine since it checks the pod controller-revision-hash to know which ones were already updated, and only considers the remaining un-updated ones for ondelete-style update, correct?


The OnDelete controller exposes the following metrics:

- `etcd_druid_ondelete_update_duration_seconds`: Time from the first detection of an `updateRevision` change to the completion of all pod updates. Labeled by etcd cluster name.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `etcd_druid_ondelete_update_duration_seconds`: Time from the first detection of an `updateRevision` change to the completion of all pod updates. Labeled by etcd cluster name.
- `etcddruid_ondelete_update_duration_seconds`: Time from the first detection of an `updateRevision` change to the completion of all pod updates. Labeled by etcd cluster name.

This is the current convention, but if it seems wrong, we can discuss changing it in the current code too, during implementation.

The OnDelete controller exposes the following metrics:

- `etcd_druid_ondelete_update_duration_seconds`: Time from the first detection of an `updateRevision` change to the completion of all pod updates. Labeled by etcd cluster name.
- `etcd_druid_ondelete_reconcile_cycles_total`: Number of reconciliation cycles required to complete a full pod update. Labeled by etcd cluster name.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

3. The OnDelete controller's predicate no longer matches this StatefulSet. The Kubernetes StatefulSet controller resumes managing pod updates in its default ordinal order.
4. If there were outdated pods that the OnDelete controller had not yet processed, the StatefulSet controller picks them up and rolls them in the standard highest-to-lowest ordinal order.

### VPA and HVPA Interaction
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the section. There's no ned to mention HVPA, since it's no longer used.

- **Backup-restore container health in update ordering**: The current design does not consider the health of the backup-restore sidecar container when deciding which pod to update next. The rationale is that backup-restore health does not affect quorum, and prioritizing it could lead to unnecessary leader elections (for example, if a pod with an unhealthy backup-restore happens to be the leader). If future operational experience shows value in considering backup-restore health as a secondary sorting criterion, the priority order can be extended. The following state diagram illustrates what a backup-restore-aware ordering would look like:

<div align="center">
<img src="assets/06-OnDelete-StateDiagram-With-Etcdbr-Health.png" alt="OnDelete state diagram with backup-restore health awareness (future scope)" width="700">
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we don't need this diagram, it would just confuse readers of this DEP, since you state we don't worry about etcdbr health in this DEP.

@shreyas-s-rao
Copy link
Copy Markdown
Member

@anveshreddy18 please rebase the PR as well, to fix failing tests.

@shreyas-s-rao shreyas-s-rao removed their assignment Apr 23, 2026

## Summary

This proposal recommends changing the StatefulSet update strategy used by etcd-druid from `RollingUpdate` to `OnDelete`. With `OnDelete`, the Kubernetes StatefulSet controller no longer automatically restarts pods when the pod template changes. Instead, a new dedicated controller in etcd-druid takes responsibility for deleting and recreating pods in a carefully chosen order that accounts for etcd member health and cluster role. The goal is to prevent unintended quorum loss during pod updates.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This proposal recommends changing the StatefulSet update strategy used by etcd-druid from `RollingUpdate` to `OnDelete`. With `OnDelete`, the Kubernetes StatefulSet controller no longer automatically restarts pods when the pod template changes. Instead, a new dedicated controller in etcd-druid takes responsibility for deleting and recreating pods in a carefully chosen order that accounts for etcd member health and cluster role. The goal is to prevent unintended quorum loss during pod updates.
This proposal recommends changing the StatefulSet update strategy used by etcd-druid from `RollingUpdate` to `OnDelete`. With `OnDelete`, the Kubernetes StatefulSet controller no longer automatically rollout the pods when the pod template changes. Instead, a new dedicated controller in etcd-druid takes responsibility for deleting and recreating pods in a carefully chosen order that accounts for etcd member health and cluster role. The goal is to prevent unintended quorum loss during pod updates.


## Terminology

- **OnDelete**: A StatefulSet update strategy where pods are only updated when they are explicitly deleted. The StatefulSet controller does not automatically roll pods on template changes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **OnDelete**: A StatefulSet update strategy where pods are only updated when they are explicitly deleted. The StatefulSet controller does not automatically roll pods on template changes.
- **OnDelete**: A StatefulSet update strategy where pods are only updated when they are explicitly deleted. The StatefulSet controller does not automatically rollout the pods on template changes.

Comment on lines +25 to +26
- **Participating pod**: A pod whose etcd container is part of the quorum (the member is either a leader or a follower).
- **Non-participating pod**: A pod whose etcd container is not part of the quorum (the member may be down, restarting, or not yet joined).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very much convinced with these 2 terminologies and their definitions. I understand what you're trying to say by defining "Participating and non-Participating pods" but IMO it's hard to define such state and to take decision on that basis as member pods can be temporary down as well as permanently down.
Now, Let's take case by case:

  1. If member is temporary down (pod restart) then it will come up ... nothing to done from our side 🤷‍♂️ .
  2. If member is permanently down then there will be 2 cases again:
    • If there is only 1 member down then single restoration will/should trigger, and it should come up as healthy.
    • If there is 1+ member are unhealthy then it's permanent quorum loss there nobody can't do anything, it require manual intervention.

<img src="assets/06-rolling-update-state-diagram.png" alt="RollingUpdate state diagram showing quorum loss when an unhealthy pod exists" width="500">
</div>

The StatefulSet controller starts from Pod N (the highest ordinal), terminates it, and waits for the new pod to become ready. If the terminated pod is not the originally unhealthy one, quorum is lost with 2 members down.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most likely it will be a transient quorum loss .. right as nth pod (highest ordinal) will eventually come-up ?


The StatefulSet controller starts from Pod N (the highest ordinal), terminates it, and waits for the new pod to become ready. If the terminated pod is not the originally unhealthy one, quorum is lost with 2 members down.

For a single-node etcd cluster, both `RollingUpdate` and `OnDelete` produce the same outcome since there is only one pod to update. The benefit of `OnDelete` is specific to multi-node clusters where update ordering matters.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we keep using RollingUpdate for singleton cluster as there is no effect of moving to OnDelete then why to make it complicated ?
On scale-up just the update the updateStrategy to OnDelete, wdyt ?
/cc @shreyas-s-rao

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For single-node it doesn't make much of a difference. But we can avoid the switching between update strategies when scale-up happens. And of course, if the Etcd spec says "replicas:1, updateStrategy: OnDelete`, then druid must respect that and set it in the sts spec accordingly.


The OnDelete controller detects outdated pods by comparing `controller-revision-hash` labels. This hash is computed from the pod template spec only and does not include `volumeClaimTemplates`. This means that a change to `storageCapacity` or `storageClass` alone (without any pod template change) will not be detected by the OnDelete controller. The PVC resize flow must handle pod deletion independently in such cases.

The details of the PVC resize flow (orphan-delete of the StatefulSet, per-pod PVC replacement, interaction with the OnDelete controller) will be covered in a separate proposal.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets not mention the extra details which is not decided yet, may be just mention the issue no. #481 to follow.

@gardener-prow gardener-prow Bot requested a review from shreyas-s-rao April 23, 2026 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/documentation Documentation related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/enhancement Enhancement, improvement, extension lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/review Needs review needs/second-opinion Needs second review by someone else size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. status/accepted Issue was accepted as something we need to work on

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants