Skip to content

Commit 9e39954

Browse files
committed
Snapshottable API server cache
1 parent 3fb4087 commit 9e39954

File tree

2 files changed

+407
-0
lines changed

2 files changed

+407
-0
lines changed
Lines changed: 380 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,380 @@
1+
# KEP-4988 Serve pagination from cache
2+
3+
<!-- toc -->
4+
- [Release Signoff Checklist](#release-signoff-checklist)
5+
- [Summary](#summary)
6+
- [Motivation](#motivation)
7+
- [Goals](#goals)
8+
- [Non-Goals](#non-goals)
9+
- [Proposal](#proposal)
10+
- [Risks and Mitigations](#risks-and-mitigations)
11+
- [Client setting limit while not supporting pagination](#client-setting-limit-while-not-supporting-pagination)
12+
- [Memory overhead](#memory-overhead)
13+
- [Pagination request hitting another apiserver](#pagination-request-hitting-another-apiserver)
14+
- [Delegating slow pagination to etcd](#delegating-slow-pagination-to-etcd)
15+
- [Increased watch contention](#increased-watch-contention)
16+
- [Test Plan](#test-plan)
17+
- [Prerequisite testing updates](#prerequisite-testing-updates)
18+
- [Unit tests](#unit-tests)
19+
- [Integration tests](#integration-tests)
20+
- [e2e tests](#e2e-tests)
21+
- [Graduation Criteria](#graduation-criteria)
22+
- [Alpha](#alpha)
23+
- [Beta](#beta)
24+
- [GA](#ga)
25+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
26+
- [Version Skew Strategy](#version-skew-strategy)
27+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
28+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
29+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
30+
- [Monitoring Requirements](#monitoring-requirements)
31+
- [Dependencies](#dependencies)
32+
- [Scalability](#scalability)
33+
- [Troubleshooting](#troubleshooting)
34+
- [Implementation History](#implementation-history)
35+
- [Drawbacks](#drawbacks)
36+
- [Alternatives](#alternatives)
37+
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
38+
<!-- /toc -->
39+
40+
## Release Signoff Checklist
41+
42+
Items marked with (R) are required *prior to targeting to a milestone / release*.
43+
44+
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
45+
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
46+
- [ ] (R) Design details are appropriately documented
47+
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
48+
- [ ] e2e Tests for all Beta API Operations (endpoints)
49+
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
50+
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
51+
- [ ] (R) Graduation criteria is in place
52+
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
53+
- [ ] (R) Production readiness review completed
54+
- [ ] (R) Production readiness review approved
55+
- [ ] "Implementation History" section is up-to-date for milestone
56+
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
57+
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
58+
59+
[kubernetes.io]: https://kubernetes.io/
60+
[kubernetes/enhancements]: https://git.k8s.io/enhancements
61+
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
62+
[kubernetes/website]: https://git.k8s.io/website
63+
64+
## Summary
65+
66+
The kube-apiserver's caching mechanism (watchcache) efficiently serves requests
67+
for the latest observed state. However, `LIST` requests for previous states,
68+
either via pagination or by specifying a `resourceVersion`, bypass the cache and
69+
are served directly from etcd. This significantly increases the performance cost,
70+
and in aggregate, can cause stability issues. This is especially pronounced when
71+
dealing with large resources, as transferring large data blobs through multiple
72+
systems can create significant memory pressure. This document proposes an
73+
enhancement to the kube-apiserver's caching layer to enable efficient serving all
74+
`LIST` requests from the cache.
75+
76+
## Motivation
77+
78+
When the API server serves a `LIST` requests directly from etcd, it introduces
79+
significant stability and reliability concerns:
80+
81+
* **Unpredictable Memory Pressure:** Retrieving data from etcd and constructing
82+
responses involves significant memory allocations on the API server.
83+
The volume of data retrieved from etcd can vary drastically depending on
84+
object sizes. This results in unpredictable memory pressure, making it difficult
85+
to provision resources effectively and increasing the risk of Out-of-Memory (OOM) errors.
86+
* **Ineffective API Priority and Fairness (APF) Throttling:** The API server's
87+
overload protection mechanism, API Priority and Fairness (APF), primarily
88+
throttles based on the *predicted cost* of a request, which is derived from
89+
factors like latency and object count. While these factors provide some
90+
indication of computational cost, they do not accurately reflect the memory
91+
footprint. Crucially, we lack visibility into the per-request memory allocations.
92+
Therefore, APF cannot effectively throttle requests based on actual memory usage,
93+
leaving the API server vulnerable to memory exhaustion.
94+
95+
These issues with serving data directly from etcd lead to unpredictable and volatile API server memory usage.
96+
97+
Remarkably, the API server already maintains all the necessary data in the watchcache.
98+
By enabling all `LIST` requests to be served from the watchcache, we can
99+
significantly reduce memory pressure and improve the effectiveness of APF throttling,
100+
leading to a more stable and reliable API server.
101+
102+
### Goals
103+
104+
- Reduce memory allocations by supporting all types of LIST requests from cache
105+
106+
### Non-Goals
107+
108+
- Change semantics of the `LIST` request
109+
- Support indexing when serving for all types of requests.
110+
- Enforce that no client requests are served from etcd
111+
112+
## Proposal
113+
114+
Leveraging the recent rewrite of the watchcache storage layer to use a B-tree
115+
(https://github.com/kubernetes/kubernetes/pull/126754), we propose to utilize
116+
B-tree snapshots to serve remaining types of LIST request.
117+
118+
While the we will propose a mechanism that can serve all types of request, we
119+
limit the enablement to pagination for now.
120+
121+
Mechanism:
122+
1. **Snapshot Creation:** When a watch event is received, the cacher will create
123+
a snapshot of the B-tree based cache using the efficient [Clone()] method.
124+
This creates a lazy copy, only duplicating the necessary tree structure, resulting in
125+
minimal overhead. Watch cache already stores the history of watch events, so
126+
B-tree will contain pointers to in-use memory without need for not actual copies.
127+
2. **Snapshot Storage:** The snapshot will be stored in a tree data structure,
128+
keyed by resourceVersion. Tree will help with efficient lookup of nextSmaller element,
129+
as resourceVersions is not continuous.
130+
3. **Serving Subsequent Pages:** When a subsequent request with a continue token
131+
arrives, the API server will:
132+
- Extract the resourceVersion from the continue token.
133+
- Lookup nextSmaller snapshot and return response based on it.
134+
- There are two edge cases relating to requested resource:
135+
- It's smaller than any available snapshot, meaning it was cleaned up (look below).
136+
In that case we fall back to serving from etcd.
137+
- It's larger than the latest snapshot, meaning it's a future resourceVersion or
138+
watch cache is behind. In that case can execute a consistent read from etcd,
139+
to confirm a future resourceVersion or know we can wait for watch cache to catch up.
140+
4. **Snapshot Cleanup:** Snapshots will be subject to a Time-To-Live (TTL)
141+
mechanism same as watch events. We will reuse the process, which limits
142+
events to 10`000 and 75s window (can be overwritten by request timeout).
143+
We also need to remember to purge the snapshots during cache re-initialization.
144+
145+
[Clone()]: https://pkg.go.dev/github.com/google/btree#BTree.Clone
146+
147+
### Risks and Mitigations
148+
149+
#### Client setting limit while not supporting pagination
150+
151+
#### Memory overhead
152+
153+
No, B-tree only store pointers the actual objects, not the object themselves.
154+
The objects are already cached to serve watch, so it should only add a small
155+
overhead for the B-tree structure itself, which is negligible compared to the
156+
size of the cached objects.
157+
158+
#### Delegating slow pagination to etcd
159+
160+
To avoid breaking users the proposal still allows pagination requests older than
161+
75s to pass to etcd. This can have a huge performance impact if the resource is
162+
large. However, this seems still safer than:
163+
* Increasing the watch cache size 4 times to match etcd.
164+
* Block requests older than 75s
165+
166+
### Test Plan
167+
168+
[x] I/we understand the owners of the involved components may require updates to
169+
existing tests to make this code solid enough prior to committing the changes necessary
170+
to implement this enhancement.
171+
172+
##### Prerequisite testing updates
173+
174+
- Ensure the pagination is well tested
175+
176+
##### Unit tests
177+
178+
- `k8s/apiserver/pkg/storage/cache`: `2024-12-12` - `<test coverage>`
179+
180+
##### Integration tests
181+
182+
<!--
183+
Integration tests are contained in k8s.io/kubernetes/test/integration.
184+
Integration tests allow control of the configuration parameters used to start the binaries under test.
185+
This is different from e2e tests which do not allow configuration of parameters.
186+
Doing this allows testing non-default options and multiple different and potentially conflicting command line options.
187+
-->
188+
189+
<!--
190+
This question should be filled when targeting a release.
191+
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
192+
193+
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
194+
https://storage.googleapis.com/k8s-triage/index.html
195+
-->
196+
197+
- <test>: <link to test coverage>
198+
199+
##### e2e tests
200+
201+
Given we're only modifying kube-apiserver, integration tests are sufficient.
202+
203+
### Graduation Criteria
204+
205+
#### Alpha
206+
207+
- Feature implemented behind a feature gate
208+
- Feature is covered with unit and integration tests
209+
210+
#### Beta
211+
212+
- Feature is enabled by default
213+
214+
#### GA
215+
216+
TODO
217+
218+
### Upgrade / Downgrade Strategy
219+
220+
The feature is purely in-memory so update/downgrade doesn't require any
221+
specific considerations.
222+
223+
### Version Skew Strategy
224+
225+
Feature touches only kube-apiserver and coordination between individual
226+
instances is not needed.
227+
228+
## Production Readiness Review Questionnaire
229+
230+
### Feature Enablement and Rollback
231+
232+
###### How can this feature be enabled / disabled in a live cluster?
233+
234+
- [X] Feature gate (also fill in values in `kep.yaml`)
235+
- Feature gate name: PaginationFromCache
236+
- Components depending on the feature gate: kube-apiserver
237+
- [ ] Other
238+
- Describe the mechanism:
239+
- Will enabling / disabling the feature require downtime of the control
240+
plane?
241+
- Will enabling / disabling the feature require downtime or reprovisioning
242+
of a node?
243+
244+
###### Does enabling the feature change any default behavior?
245+
246+
Yes, kube-apiserver paginating LIST requests will no longer require request to etcd.
247+
248+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
249+
250+
Yes, via disabling the feature-gate in kube-apiserver.
251+
252+
###### What happens if we reenable the feature if it was previously rolled back?
253+
254+
The feature is purely in-memory so it will just work as enabled for the first time.
255+
256+
###### Are there any tests for feature enablement/disablement?
257+
258+
The feature is purely in-memory so feature enablement/disablement will not provide
259+
additional value on top of feature tests themselves.
260+
261+
### Rollout, Upgrade and Rollback Planning
262+
263+
###### How can a rollout or rollback fail? Can it impact already running workloads?
264+
265+
266+
###### What specific metrics should inform a rollback?
267+
268+
<!--
269+
What signals should users be paying attention to when the feature is young
270+
that might indicate a serious problem?
271+
-->
272+
273+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
274+
275+
<!--
276+
Describe manual testing that was done and the outcomes.
277+
Longer term, we may want to require automated upgrade/rollback tests, but we
278+
are missing a bunch of machinery and tooling and can't do that now.
279+
-->
280+
281+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
282+
283+
NO
284+
285+
### Monitoring Requirements
286+
287+
###### How can an operator determine if the feature is in use by workloads?
288+
289+
This is control-plane feature, not a workload feature.
290+
291+
###### How can someone using this feature know that it is working for their instance?
292+
293+
This is control-plane feature, not a workload feature.
294+
295+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
296+
297+
[API call latency SLO](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/api_call_latency.md)
298+
299+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
300+
301+
[API call latency SLI](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/api_call_latency.md)
302+
303+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
304+
305+
### Dependencies
306+
307+
###### Does this feature depend on any specific services running in the cluster?
308+
309+
No
310+
311+
### Scalability
312+
313+
###### Will enabling / using this feature result in any new API calls?
314+
315+
No
316+
317+
###### Will enabling / using this feature result in introducing new API types?
318+
319+
No
320+
321+
###### Will enabling / using this feature result in any new calls to the cloud provider?
322+
323+
No
324+
325+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
326+
327+
No
328+
329+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
330+
331+
No, we expect the [API call latency SLI](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/api_call_latency.md) to improve.
332+
333+
334+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
335+
336+
Overall we expect that cost of serving pagination will go down, however caching
337+
might increase RAM usage, if the client reads the first page, but never
338+
paginates. We expect that most controllers will read all pages.
339+
340+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
341+
342+
No
343+
344+
### Troubleshooting
345+
346+
###### How does this feature react if the API server and/or etcd is unavailable?
347+
348+
The feature is kube-apiserver feature - it just doesn't work if kube-apiserver is unavailable.
349+
350+
###### What are other known failure modes?
351+
352+
No
353+
354+
###### What steps should be taken if SLOs are not being met to determine the problem?
355+
356+
Disabling the feature-gate.
357+
358+
## Implementation History
359+
360+
## Drawbacks
361+
362+
<!--
363+
Why should this KEP _not_ be implemented?
364+
-->
365+
366+
## Alternatives
367+
368+
<!--
369+
What other approaches did you consider, and why did you rule them out? These do
370+
not need to be as detailed as the proposal, but should include enough
371+
information to express the idea and why it was not acceptable.
372+
-->
373+
374+
## Infrastructure Needed (Optional)
375+
376+
<!--
377+
Use this section if you need things from the project/SIG. Examples include a
378+
new subproject, repos requested, or GitHub details. Listing these here allows a
379+
SIG to get the process for these resources started right away.
380+
-->

0 commit comments

Comments
 (0)