kubernetes
diff --git a/‎keps/sig-api-machinery/4988-serve-pagination-from-cache/README.md
Lines changed: 380 additions & 0 deletions b/‎keps/sig-api-machinery/4988-serve-pagination-from-cache/README.md
Lines changed: 380 additions & 0 deletions
@@ -0,0 +1,380 @@
+# KEP-4988 Serve pagination from cache
+
+<!-- toc -->
+- [Release Signoff Checklist](#release-signoff-checklist)
+- [Summary](#summary)
+- [Motivation](#motivation)
+  - [Goals](#goals)
+  - [Non-Goals](#non-goals)
+- [Proposal](#proposal)
+  - [Risks and Mitigations](#risks-and-mitigations)
+    - [Client setting limit while not supporting pagination](#client-setting-limit-while-not-supporting-pagination)
+    - [Memory overhead](#memory-overhead)
+    - [Pagination request hitting another apiserver](#pagination-request-hitting-another-apiserver)
+    - [Delegating slow pagination to etcd](#delegating-slow-pagination-to-etcd)
+    - [Increased watch contention](#increased-watch-contention)
+  - [Test Plan](#test-plan)
+      - [Prerequisite testing updates](#prerequisite-testing-updates)
+      - [Unit tests](#unit-tests)
+      - [Integration tests](#integration-tests)
+      - [e2e tests](#e2e-tests)
+  - [Graduation Criteria](#graduation-criteria)
+    - [Alpha](#alpha)
+    - [Beta](#beta)
+    - [GA](#ga)
+  - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
+  - [Version Skew Strategy](#version-skew-strategy)
+- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
+  - [Feature Enablement and Rollback](#feature-enablement-and-rollback)
+  - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
+  - [Monitoring Requirements](#monitoring-requirements)
+  - [Dependencies](#dependencies)
+  - [Scalability](#scalability)
+  - [Troubleshooting](#troubleshooting)
+- [Implementation History](#implementation-history)
+- [Drawbacks](#drawbacks)
+- [Alternatives](#alternatives)
+- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
+<!-- /toc -->
+
+## Release Signoff Checklist
+
+Items marked with (R) are required *prior to targeting to a milestone / release*.
+
+- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
+- [ ] (R) KEP approvers have approved the KEP status as `implementable`
+- [ ] (R) Design details are appropriately documented
+- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
+  - [ ] e2e Tests for all Beta API Operations (endpoints)
+  - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
+  - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
+- [ ] (R) Graduation criteria is in place
+  - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
+- [ ] (R) Production readiness review completed
+- [ ] (R) Production readiness review approved
+- [ ] "Implementation History" section is up-to-date for milestone
+- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
+- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
+
+[kubernetes.io]: https://kubernetes.io/
+[kubernetes/enhancements]: https://git.k8s.io/enhancements
+[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
+[kubernetes/website]: https://git.k8s.io/website
+
+## Summary
+
+The kube-apiserver's caching mechanism (watchcache) efficiently serves requests
+for the latest observed state. However, `LIST` requests for previous states,
+either via pagination or by specifying a `resourceVersion`, bypass the cache and
+are served directly from etcd. This significantly increases the performance cost,
+and in aggregate, can cause stability issues. This is especially pronounced when
+dealing with large resources, as transferring large data blobs through multiple
+systems can create significant memory pressure. This document proposes an
+enhancement to the kube-apiserver's caching layer to enable efficient serving all
+`LIST` requests from the cache.
+
+## Motivation
+
+When the API server serves a `LIST` requests directly from etcd, it introduces
+significant stability and reliability concerns:
+
+*   **Unpredictable Memory Pressure:** Retrieving data from etcd and constructing
+    responses involves significant memory allocations on the API server.
+    The volume of data retrieved from etcd can vary drastically depending on
+    object sizes. This results in unpredictable memory pressure, making it difficult
+    to provision resources effectively and increasing the risk of Out-of-Memory (OOM) errors.
+*   **Ineffective API Priority and Fairness (APF) Throttling:** The API server's
+    overload protection mechanism, API Priority and Fairness (APF), primarily
+    throttles based on the *predicted cost* of a request, which is derived from
+    factors like latency and object count. While these factors provide some
+    indication of computational cost, they do not accurately reflect the memory
+    footprint. Crucially, we lack visibility into the per-request memory allocations.
+    Therefore, APF cannot effectively throttle requests based on actual memory usage,
+    leaving the API server vulnerable to memory exhaustion.
+
+These issues with serving data directly from etcd lead to unpredictable and volatile API server memory usage.
+
+Remarkably, the API server already maintains all the necessary data in the watchcache.
+By enabling all `LIST` requests to be served from the watchcache, we can
+significantly reduce memory pressure and improve the effectiveness of APF throttling,
+leading to a more stable and reliable API server.
+
+### Goals
+
+- Reduce memory allocations by supporting all types of LIST requests from cache
+
+### Non-Goals
+
+- Change semantics of the `LIST` request
+- Support indexing when serving for all types of requests.
+- Enforce that no client requests are served from etcd
+
+## Proposal
+
+Leveraging the recent rewrite of the watchcache storage layer to use a B-tree
+(https://github.com/kubernetes/kubernetes/pull/126754), we propose to utilize
+B-tree snapshots to serve remaining types of LIST request.
+
+While the we will propose a mechanism that can serve all types of request, we
+limit the enablement to pagination for now.
+
+Mechanism:
+1. **Snapshot Creation:** When a watch event is received, the cacher will create
+   a snapshot of the B-tree based cache using the efficient [Clone()] method.
+   This creates a lazy copy, only duplicating the necessary tree structure, resulting in
+   minimal overhead. Watch cache already stores the history of watch events, so
+   B-tree will contain pointers to in-use memory without need for not actual copies.
+2. **Snapshot Storage:** The snapshot will be stored in a tree data structure,
+   keyed by resourceVersion. Tree will help with efficient lookup of nextSmaller element,
+   as resourceVersions is not continuous.
+3. **Serving Subsequent Pages:** When a subsequent request with a continue token
+   arrives, the API server will:
+  - Extract the resourceVersion from the continue token.
+  - Lookup nextSmaller snapshot and return response based on it.
+  - There are two edge cases relating to requested resource:
+    - It's smaller than any available snapshot, meaning it was cleaned up (look below).
+      In that case we fall back to serving from etcd.
+    - It's larger than the latest snapshot, meaning it's a future resourceVersion or
+      watch cache is behind. In that case can execute a consistent read from etcd,
+      to confirm a future resourceVersion or know we can wait for watch cache to catch up.
+4. **Snapshot Cleanup:** Snapshots will be subject to a Time-To-Live (TTL)
+   mechanism same as watch events. We will reuse the process, which limits
+   events to 10`000 and 75s window (can be overwritten by request timeout).
+   We also need to remember to purge the snapshots during cache re-initialization.
+
+[Clone()]: https://pkg.go.dev/github.com/google/btree#BTree.Clone
+
+### Risks and Mitigations
+
+#### Client setting limit while not supporting pagination
+
+#### Memory overhead
+
+No, B-tree only store pointers the actual objects, not the object themselves.
+The objects are already cached to serve watch, so it should only add a small
+overhead for the B-tree structure itself, which is negligible compared to the
+size of the cached objects.
+
+#### Delegating slow pagination to etcd
+
+To avoid breaking users the proposal still allows pagination requests older than
+75s to pass to etcd. This can have a huge performance impact if the resource is
+large. However, this seems still safer than:
+* Increasing the watch cache size 4 times to match etcd.
+* Block requests older than 75s
+
+### Test Plan
+
+[x] I/we understand the owners of the involved components may require updates to
+existing tests to make this code solid enough prior to committing the changes necessary
+to implement this enhancement.
+
+##### Prerequisite testing updates
+
+- Ensure the pagination is well tested
+
+##### Unit tests
+
+- `k8s/apiserver/pkg/storage/cache`: `2024-12-12` - `<test coverage>`
+
+##### Integration tests
+
+<!--
+Integration tests are contained in k8s.io/kubernetes/test/integration.
+Integration tests allow control of the configuration parameters used to start the binaries under test.
+This is different from e2e tests which do not allow configuration of parameters.
+Doing this allows testing non-default options and multiple different and potentially conflicting command line options.
+-->
+
+<!--
+This question should be filled when targeting a release.
+For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
+
+For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
+https://storage.googleapis.com/k8s-triage/index.html
+-->
+
+- <test>: <link to test coverage>
+
+##### e2e tests
+
+Given we're only modifying kube-apiserver, integration tests are sufficient.
+
+### Graduation Criteria
+
+#### Alpha
+
+- Feature implemented behind a feature gate
+- Feature is covered with unit and integration tests
+
+#### Beta
+
+- Feature is enabled by default
+
+#### GA
+
+TODO
+
+### Upgrade / Downgrade Strategy
+
+The feature is purely in-memory so update/downgrade doesn't require any
+specific considerations.
+
+### Version Skew Strategy
+
+Feature touches only kube-apiserver and coordination between individual
+instances is not needed.
+
+## Production Readiness Review Questionnaire
+
+### Feature Enablement and Rollback
+
+###### How can this feature be enabled / disabled in a live cluster?
+
+- [X] Feature gate (also fill in values in `kep.yaml`)
+  - Feature gate name: PaginationFromCache
+  - Components depending on the feature gate: kube-apiserver
+- [ ] Other
+  - Describe the mechanism:
+  - Will enabling / disabling the feature require downtime of the control
+    plane?
+  - Will enabling / disabling the feature require downtime or reprovisioning
+    of a node?
+
+###### Does enabling the feature change any default behavior?
+
+Yes, kube-apiserver paginating LIST requests will no longer require request to etcd.
+
+###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
+
+Yes, via disabling the feature-gate in kube-apiserver.
+
+###### What happens if we reenable the feature if it was previously rolled back?
+
+The feature is purely in-memory so it will just work as enabled for the first time.
+
+###### Are there any tests for feature enablement/disablement?
+
+The feature is purely in-memory so feature enablement/disablement will not provide
+additional value on top of feature tests themselves.
+
+### Rollout, Upgrade and Rollback Planning
+
+###### How can a rollout or rollback fail? Can it impact already running workloads?
+
+
+###### What specific metrics should inform a rollback?
+
+<!--
+What signals should users be paying attention to when the feature is young
+that might indicate a serious problem?
+-->
+
+###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
+
+<!--
+Describe manual testing that was done and the outcomes.
+Longer term, we may want to require automated upgrade/rollback tests, but we
+are missing a bunch of machinery and tooling and can't do that now.
+-->
+
+###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
+
+NO
+
+### Monitoring Requirements
+
+###### How can an operator determine if the feature is in use by workloads?
+
+This is control-plane feature, not a workload feature.
+
+###### How can someone using this feature know that it is working for their instance?
+
+This is control-plane feature, not a workload feature.
+
+###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
+
+[API call latency SLO](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/api_call_latency.md)
+
+###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
+
+[API call latency SLI](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/api_call_latency.md)
+
+###### Are there any missing metrics that would be useful to have to improve observability of this feature?
+
+### Dependencies
+
+###### Does this feature depend on any specific services running in the cluster?
+
+No
+
+### Scalability
+
+###### Will enabling / using this feature result in any new API calls?
+
+No
+
+###### Will enabling / using this feature result in introducing new API types?
+
+No
+
+###### Will enabling / using this feature result in any new calls to the cloud provider?
+
+No
+
+###### Will enabling / using this feature result in increasing size or count of the existing API objects?
+
+No
+
+###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
+
+No, we expect the [API call latency SLI](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/api_call_latency.md) to improve.
+
+
+###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
+
+Overall we expect that cost of serving pagination will go down, however caching
+might increase RAM usage, if the client reads the first page, but never
+paginates. We expect that most controllers will read all pages.
+
+###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
+
+No
+
+### Troubleshooting
+
+###### How does this feature react if the API server and/or etcd is unavailable?
+
+The feature is kube-apiserver feature - it just doesn't work if kube-apiserver is unavailable.
+
+###### What are other known failure modes?
+
+No
+
+###### What steps should be taken if SLOs are not being met to determine the problem?
+
+Disabling the feature-gate.
+
+## Implementation History
+
+## Drawbacks
+
+<!--
+Why should this KEP _not_ be implemented?
+-->
+
+## Alternatives
+
+<!--
+What other approaches did you consider, and why did you rule them out? These do
+not need to be as detailed as the proposal, but should include enough
+information to express the idea and why it was not acceptable.
+-->
+
+## Infrastructure Needed (Optional)
+
+<!--
+Use this section if you need things from the project/SIG. Examples include a
+new subproject, repos requested, or GitHub details. Listing these here allows a
+SIG to get the process for these resources started right away.
+-->