From 0190222c7b32fbe6ec84c5e62ec7cb10eaf58044 Mon Sep 17 00:00:00 2001 From: ravisantoshgudimetla Date: Tue, 14 Jan 2025 15:17:08 -0800 Subject: [PATCH] PDB based taint tolerance --- keps/prod-readiness/sig-apps/4987.yaml | 3 + .../4987-taint-tolerance-pdb/README.md | 824 ++++++++++++++++++ .../4987-taint-tolerance-pdb/kep.yaml | 36 + 3 files changed, 863 insertions(+) create mode 100644 keps/prod-readiness/sig-apps/4987.yaml create mode 100644 keps/sig-apps/4987-taint-tolerance-pdb/README.md create mode 100644 keps/sig-apps/4987-taint-tolerance-pdb/kep.yaml diff --git a/keps/prod-readiness/sig-apps/4987.yaml b/keps/prod-readiness/sig-apps/4987.yaml new file mode 100644 index 00000000000..799de1f4b29 --- /dev/null +++ b/keps/prod-readiness/sig-apps/4987.yaml @@ -0,0 +1,3 @@ +kep-number: 4987 +alpha: + approver: "@soltysh" diff --git a/keps/sig-apps/4987-taint-tolerance-pdb/README.md b/keps/sig-apps/4987-taint-tolerance-pdb/README.md new file mode 100644 index 00000000000..4fa62c37898 --- /dev/null +++ b/keps/sig-apps/4987-taint-tolerance-pdb/README.md @@ -0,0 +1,824 @@ + +# KEP-4987: PDBBasedTaintTolerance + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary +Currently, tolerations allow users to specify how long a pod can tolerate a taint. This proposal aims to introduce an additional dimension of toleration based on the Pod Disruption Budget. + + + +## Motivation +The current approach of using tolerationSeconds in pod specifications has two key limitations: +- Most application developers and cluster operators lack the detailed knowledge to set tolerationSeconds accurately, leading to guesswork and potential issues. Some clusters have had to implement validating webhooks to prevent indefinite tolerations. +- Taint-based "evictions" ultimately result in [pod deletion](https://github.com/kubernetes/kubernetes/blob/1b3d7d06c59c44de2775975a599130a2f89539fb/pkg/controller/tainteviction/taint_eviction.go#L146), but some workloads require stronger disruption handling guarantees. Cluster admins and developers have had to write custom webhooks to intercept these deletion requests. + + +### Goals +- Allow users to express the intention that PDB should be honored instead of directly deleting the pod once the tolerationSeconds are up. This means that when a pod's toleration period is not specified and instead the new api is chosen, the system will first check the Pod Disruption Budget (PDB) constraints before deciding to evict the pod. If the PDB allows for the disruption, the pod can be safely evicted; otherwise, the pod will remain running to ensure application availability and stability. +- Reduce the need for custom webhooks and other complex solutions by offering built-in support for PDB-based tolerations, simplifying cluster management and reducing operational overhead. +- Enhance the ability to maintain high availability for critical workloads by ensuring that PDB constraints are respected during node maintenance, scaling activities, and other disruptive events. + + +### Non-Goals +- Changes to Pod Disruption Budget APIs. This proposal does not aim to modify the existing Pod Disruption Budget (PDB) APIs or their functionality. Instead, it focuses on leveraging the current PDB mechanisms to enhance toleration strategies. The goal is to integrate PDB checks into the toleration process without altering how PDBs are defined, managed, or enforced. This ensures backward compatibility and avoids introducing additional complexity to the existing PDB framework. +- While the proposal aims to improve the reliability and control of pod evictions, it does not specifically target performance optimization of the eviction process. + + +## Proposal +Historically, the tolerations API specified [tolerationSeconds](https://github.com/kubernetes/kubernetes/blob/1b3d7d06c59c44de2775975a599130a2f89539fb/pkg/apis/core/types.go#L3242-#L3247), which defines the duration for which a pod can tolerate a taint before it is deleted. +As part of this proposal, a new boolean field will be added to the tolerations API to indicate whether eviction API should be used of deletion. + + +When this field is set, the NoExecute taint manager will use the eviction API instead of directly deleting the pod. The outcome of this change is that when a pod is running something really important and the taint, while important enough to move workloads, is not important enough to disrupt those workloads. This will be useful in cases where we want stronger guarantees around the availability of workloads. + + + + +### User Stories (Optional) + +#### Story 1 +As a cluster operator, I want to ensure that critical workloads are not disrupted during node maintenance. For example, in a deployment of Cassandra, which requires high availability and consistency, I can use the PDB-based toleration to configure my pods to respect PDB constraints. Previously, I had to disable the taint manager altogether because it used the delete API instead of the eviction API. + +#### Story 2 +As an application developer, I want to avoid writing custom webhooks to handle pod evictions. For example, in a deployment of Elasticsearch, which requires careful handling of node disruptions to maintain data availability and cluster health, I can use PDB-based toleration. This allows Kubernetes to manage pod evictions based on PDB constraints, simplifying my deployment process and reducing operational overhead. + + +### Notes/Constraints/Caveats (Optional) + + + +### Risks and Mitigations +We are proposing a new field called `UseEvictionAPI` whose default value is false. In this mode, taint manager will behave exactly like current behavior. It's possible we introduce a bug in the implementation. The bug can cause: +- Pod not being deleted because eviction is triggered and there is a PDB associated, breaking the user expectation +- Pod to be deleted even `UseEvictionAPI` is set to true + +The mitigation currently is that this field will be disabled by default in alpha phase behind a feature gate for people to try out and give feedback. In beta phase when it is enabled by default, people will only see issues or bugs when `UseEvictionAPI` is set to true. Since people would have tried this feature in beta, we would have had time to fix any issues. + + +## Design Details +```go +// Toleration represents the toleration object that can be attached to a pod. +// The pod this Toleration is attached to tolerates any taint that matches +// the triple using the matching operator . +type Toleration struct { + // Key is the taint key that the toleration applies to. Empty means match all taint keys. + // If the key is empty, operator must be Exists; this combination means to match all values and all keys. + // +optional + Key string + // Operator represents a key's relationship to the value. + // Valid operators are Exists and Equal. Defaults to Equal. + // Exists is equivalent to wildcard for value, so that a pod can + // tolerate all taints of a particular category. + // +optional + Operator TolerationOperator + // Value is the taint value the toleration matches to. + // If the operator is Exists, the value should be empty, otherwise just a regular string. + // +optional + Value string + // Effect indicates the taint effect to match. Empty means match all taint effects. + // When specified, allowed values are NoSchedule, PreferNoSchedule and NoExecute. + // +optional + Effect TaintEffect + // TolerationSeconds represents the period of time the toleration (which must be + // of effect NoExecute, otherwise this field is ignored) tolerates the taint. By default, + // it is not set, which means tolerate the taint forever (do not evict). Zero and + // negative values will be treated as 0 (evict immediately) by the system. + // +optional + TolerationSeconds *int64 + // UseEvictionAPI indicates that the taint manager should use the eviction API instead of directly deleting the pod. + // This field cannot be used in conjunction with tolerationSeconds, as the pod will continue to run until the matching + // Pod Disruption Budget (PDB) allows for its disruption. This ensures that critical workloads are not disrupted + // unnecessarily and that PDB constraints are respected. + // + // Example scenarios where UseEvictionAPI would be beneficial: + // 1. During node maintenance, where you want to ensure that critical pods are not evicted unless the PDB allows it. + // 2. When dealing with taints that are important but not critical enough to disrupt essential services immediately. + // + // +optional + UseEvictionAPI bool +} +``` + + + +### Test Plan +Unit, integration and E2E tests cover the existing taint-manager mechanics. Additionally, unit and integration tests will be added to cover the api validation, behavioral change of taint-manager with feature gate enabled and disabled. + + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- ``: `` - `` + +##### Integration tests + + + + + +- : + +##### e2e tests + + + +- : + +### Graduation Criteria +This will be added as a alpha field enhancement to tolerations API with a backward compatible default and we'd make sure that tolerationSeconds and this new field are mutually exclusive. After sufficient exposure this field would be promoted to beta, and then to GA in successive releases. The feature gate for this field will be `PDBBasedTaintTolerance`. + +#### Alpha +- Complete feature behind a featuregate +- Have proper unit and integration tests + +#### Beta +- Gather feedback from developers and end users +- Additional tests are in Testgrid and linked in KEP + +#### GA +Gather examples from users who are benefitting from this feature + + + +### Upgrade / Downgrade Strategy +- Upgrades: When upgrading from a release without this feature to a release with this feature, we will set `UseEvictionAPI` to false. This would give users the same default behavior. +- Downgrades: When downgrading from a release with this feature to a release without this feature, there are 2 cases + - If `UseEvictionAPI` is set to true, the taint manager wouldn't honor this field which is expected. + - If `UseEvictionAPI` is set to false, the user won't see any difference in behavior. + +We will ensure that the `UseEvictionAPI` is properly validated before persisting. + + +### Version Skew Strategy + + +If the feature gate is enabled, the taint controller will honor PDB instead of directly deleting the pods. If the feature gate is not enabled, the taint controller will run as usual. This feature has no node runtime or network implications. + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + +- [ ] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: "PDBBasedTaintTolerance" + - Components depending on the feature gate: + - kube-controller-manager + - kube-apiserver + +###### Does enabling the feature change any default behavior? +No, the user still needs to `UseEvictionAPI` in the tolerations spec + + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? +Yes. Using the featuregate is the only way to enable/disable this feature + + +###### What happens if we reenable the feature if it was previously rolled back? +The taint manager starts honoring PDBs again +###### Are there any tests for feature enablement/disablement? +Yes, unit, integration tests for feature on, off + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? +Already running workloads can be impacted if this feature is enabled. However the user needs to explicitly set the `useEvictionAPI` on the tolerations of the pod spec to use this feature. + + +###### What specific metrics should inform a rollback? +If the count of pods that are not getting is deleted increasing, we can start issuing a rollback. `kube_pod_deletion_timestamp` metric also tracks when the pod got deleted but it is experimental. + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? +No + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? +None + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? +We can add 2 new metrics: +- pdb_blocked_successful_evictions: To count the number of successful evictions +- pdb_blocked_failed_evictions: To count the number of failed evictions by taint controller, and use these metrics to see if the feature is in use by workloads. + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [x] Events + - Event Reason: Pod eviction failed +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [x] Metrics + - Metric name: `pdb_blocked_successful_evictions` and `pdb_blocked_failed_evictions` + - [Optional] Aggregation method: + - Components exposing the metric: kube-controller-manager +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? +None for now but this can revisited. + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? +None. It is part of taint controller in kube-controller-manager + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + +Instead of delete, it'll be evict API call +###### Will enabling / using this feature result in introducing new API types? +No + + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + +No +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + +Yes. API type(s): pod.Spec.tolerations +New bool field which is 1 byte +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? +No unless the operation covers both eviction and deletion instead of just deletion + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? +No + + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? +No + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? +The controller won't be able to make progress, all currently queued resources are re-queued. This feature doesn't change current behavior of the controller in this regard. +###### What are other known failure modes? + - + + +###### What steps should be taken if SLOs are not being met to determine the problem? +- Check if the `pdb_blocked_successful_evictions` and `pdb_blocked_failed_evictions`. See if they are varying for with and without feature gate +- Verify if kube-apiserver is healthy +- Check if taint-manager is enabled or not +## Implementation History + + + +## Drawbacks + + +Add more complexity to taint controller and can put strain on the PDB controller. This can also block cluster upgrades if the customer starts using this feature. +## Alternatives +Leave the feature as it is and allow third-party operators to implement it. This is the current situation now, however with native PDB support in taint controller, the third-party operators can focus on application-level APIs. + + +## Infrastructure Needed (Optional) + + \ No newline at end of file diff --git a/keps/sig-apps/4987-taint-tolerance-pdb/kep.yaml b/keps/sig-apps/4987-taint-tolerance-pdb/kep.yaml new file mode 100644 index 00000000000..0b8d8e4f65d --- /dev/null +++ b/keps/sig-apps/4987-taint-tolerance-pdb/kep.yaml @@ -0,0 +1,36 @@ +title: Taint Tolerance Tied to Pod Disruption Budget +kep-number: 4987 +authors: + - "@ravisantoshgudimetla" +owning-sig: sig-apps +status: implementable +creation-date: 2024-12-10 +reviewers: + - "@soltysh" + - "@atosatto" +approvers: + - "@soltysh" +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.33" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.33" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: PDBBasedTaintTolerance + components: + - kube-apiserver + - kube-controller-manager +disable-supported: true + +# # The following PRR answers are required at beta release +# metrics: +# - my_feature_metric