Skip to content

Commit

Permalink
Snapshottable API server cache
Browse files Browse the repository at this point in the history
  • Loading branch information
serathius committed Jan 16, 2025
1 parent 3fb4087 commit 9e39954
Show file tree
Hide file tree
Showing 2 changed files with 407 additions and 0 deletions.
380 changes: 380 additions & 0 deletions keps/sig-api-machinery/4988-serve-pagination-from-cache/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,380 @@
# KEP-4988 Serve pagination from cache

<!-- toc -->
- [Release Signoff Checklist](#release-signoff-checklist)
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [Risks and Mitigations](#risks-and-mitigations)
- [Client setting limit while not supporting pagination](#client-setting-limit-while-not-supporting-pagination)
- [Memory overhead](#memory-overhead)
- [Pagination request hitting another apiserver](#pagination-request-hitting-another-apiserver)
- [Delegating slow pagination to etcd](#delegating-slow-pagination-to-etcd)
- [Increased watch contention](#increased-watch-contention)
- [Test Plan](#test-plan)
- [Prerequisite testing updates](#prerequisite-testing-updates)
- [Unit tests](#unit-tests)
- [Integration tests](#integration-tests)
- [e2e tests](#e2e-tests)
- [Graduation Criteria](#graduation-criteria)
- [Alpha](#alpha)
- [Beta](#beta)
- [GA](#ga)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Version Skew Strategy](#version-skew-strategy)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
- [Monitoring Requirements](#monitoring-requirements)
- [Dependencies](#dependencies)
- [Scalability](#scalability)
- [Troubleshooting](#troubleshooting)
- [Implementation History](#implementation-history)
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
<!-- /toc -->

## Release Signoff Checklist

Items marked with (R) are required *prior to targeting to a milestone / release*.

- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
- [ ] (R) Design details are appropriately documented
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- [ ] e2e Tests for all Beta API Operations (endpoints)
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
- [ ] (R) Graduation criteria is in place
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [ ] (R) Production readiness review completed
- [ ] (R) Production readiness review approved
- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

[kubernetes.io]: https://kubernetes.io/
[kubernetes/enhancements]: https://git.k8s.io/enhancements
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
[kubernetes/website]: https://git.k8s.io/website

## Summary

The kube-apiserver's caching mechanism (watchcache) efficiently serves requests
for the latest observed state. However, `LIST` requests for previous states,
either via pagination or by specifying a `resourceVersion`, bypass the cache and
are served directly from etcd. This significantly increases the performance cost,
and in aggregate, can cause stability issues. This is especially pronounced when
dealing with large resources, as transferring large data blobs through multiple
systems can create significant memory pressure. This document proposes an
enhancement to the kube-apiserver's caching layer to enable efficient serving all
`LIST` requests from the cache.

## Motivation

When the API server serves a `LIST` requests directly from etcd, it introduces
significant stability and reliability concerns:

* **Unpredictable Memory Pressure:** Retrieving data from etcd and constructing
responses involves significant memory allocations on the API server.
The volume of data retrieved from etcd can vary drastically depending on
object sizes. This results in unpredictable memory pressure, making it difficult
to provision resources effectively and increasing the risk of Out-of-Memory (OOM) errors.
* **Ineffective API Priority and Fairness (APF) Throttling:** The API server's
overload protection mechanism, API Priority and Fairness (APF), primarily
throttles based on the *predicted cost* of a request, which is derived from
factors like latency and object count. While these factors provide some
indication of computational cost, they do not accurately reflect the memory
footprint. Crucially, we lack visibility into the per-request memory allocations.
Therefore, APF cannot effectively throttle requests based on actual memory usage,
leaving the API server vulnerable to memory exhaustion.

These issues with serving data directly from etcd lead to unpredictable and volatile API server memory usage.

Remarkably, the API server already maintains all the necessary data in the watchcache.
By enabling all `LIST` requests to be served from the watchcache, we can
significantly reduce memory pressure and improve the effectiveness of APF throttling,
leading to a more stable and reliable API server.

### Goals

- Reduce memory allocations by supporting all types of LIST requests from cache

### Non-Goals

- Change semantics of the `LIST` request
- Support indexing when serving for all types of requests.
- Enforce that no client requests are served from etcd

## Proposal

Leveraging the recent rewrite of the watchcache storage layer to use a B-tree
(https://github.com/kubernetes/kubernetes/pull/126754), we propose to utilize
B-tree snapshots to serve remaining types of LIST request.

While the we will propose a mechanism that can serve all types of request, we
limit the enablement to pagination for now.

Mechanism:
1. **Snapshot Creation:** When a watch event is received, the cacher will create
a snapshot of the B-tree based cache using the efficient [Clone()] method.
This creates a lazy copy, only duplicating the necessary tree structure, resulting in
minimal overhead. Watch cache already stores the history of watch events, so
B-tree will contain pointers to in-use memory without need for not actual copies.
2. **Snapshot Storage:** The snapshot will be stored in a tree data structure,
keyed by resourceVersion. Tree will help with efficient lookup of nextSmaller element,
as resourceVersions is not continuous.
3. **Serving Subsequent Pages:** When a subsequent request with a continue token
arrives, the API server will:
- Extract the resourceVersion from the continue token.
- Lookup nextSmaller snapshot and return response based on it.
- There are two edge cases relating to requested resource:
- It's smaller than any available snapshot, meaning it was cleaned up (look below).
In that case we fall back to serving from etcd.
- It's larger than the latest snapshot, meaning it's a future resourceVersion or
watch cache is behind. In that case can execute a consistent read from etcd,
to confirm a future resourceVersion or know we can wait for watch cache to catch up.
4. **Snapshot Cleanup:** Snapshots will be subject to a Time-To-Live (TTL)
mechanism same as watch events. We will reuse the process, which limits
events to 10`000 and 75s window (can be overwritten by request timeout).
We also need to remember to purge the snapshots during cache re-initialization.

[Clone()]: https://pkg.go.dev/github.com/google/btree#BTree.Clone

### Risks and Mitigations

#### Client setting limit while not supporting pagination

#### Memory overhead

No, B-tree only store pointers the actual objects, not the object themselves.
The objects are already cached to serve watch, so it should only add a small
overhead for the B-tree structure itself, which is negligible compared to the
size of the cached objects.

#### Delegating slow pagination to etcd

To avoid breaking users the proposal still allows pagination requests older than
75s to pass to etcd. This can have a huge performance impact if the resource is
large. However, this seems still safer than:
* Increasing the watch cache size 4 times to match etcd.
* Block requests older than 75s

### Test Plan

[x] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.

##### Prerequisite testing updates

- Ensure the pagination is well tested

##### Unit tests

- `k8s/apiserver/pkg/storage/cache`: `2024-12-12` - `<test coverage>`

##### Integration tests

<!--
Integration tests are contained in k8s.io/kubernetes/test/integration.
Integration tests allow control of the configuration parameters used to start the binaries under test.
This is different from e2e tests which do not allow configuration of parameters.
Doing this allows testing non-default options and multiple different and potentially conflicting command line options.
-->

<!--
This question should be filled when targeting a release.
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
https://storage.googleapis.com/k8s-triage/index.html
-->

- <test>: <link to test coverage>

##### e2e tests

Given we're only modifying kube-apiserver, integration tests are sufficient.

### Graduation Criteria

#### Alpha

- Feature implemented behind a feature gate
- Feature is covered with unit and integration tests

#### Beta

- Feature is enabled by default

#### GA

TODO

### Upgrade / Downgrade Strategy

The feature is purely in-memory so update/downgrade doesn't require any
specific considerations.

### Version Skew Strategy

Feature touches only kube-apiserver and coordination between individual
instances is not needed.

## Production Readiness Review Questionnaire

### Feature Enablement and Rollback

###### How can this feature be enabled / disabled in a live cluster?

- [X] Feature gate (also fill in values in `kep.yaml`)
- Feature gate name: PaginationFromCache
- Components depending on the feature gate: kube-apiserver
- [ ] Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control
plane?
- Will enabling / disabling the feature require downtime or reprovisioning
of a node?

###### Does enabling the feature change any default behavior?

Yes, kube-apiserver paginating LIST requests will no longer require request to etcd.

###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, via disabling the feature-gate in kube-apiserver.

###### What happens if we reenable the feature if it was previously rolled back?

The feature is purely in-memory so it will just work as enabled for the first time.

###### Are there any tests for feature enablement/disablement?

The feature is purely in-memory so feature enablement/disablement will not provide
additional value on top of feature tests themselves.

### Rollout, Upgrade and Rollback Planning

###### How can a rollout or rollback fail? Can it impact already running workloads?


###### What specific metrics should inform a rollback?

<!--
What signals should users be paying attention to when the feature is young
that might indicate a serious problem?
-->

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

<!--
Describe manual testing that was done and the outcomes.
Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->

###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

NO

### Monitoring Requirements

###### How can an operator determine if the feature is in use by workloads?

This is control-plane feature, not a workload feature.

###### How can someone using this feature know that it is working for their instance?

This is control-plane feature, not a workload feature.

###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?

[API call latency SLO](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/api_call_latency.md)

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

[API call latency SLI](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/api_call_latency.md)

###### Are there any missing metrics that would be useful to have to improve observability of this feature?

### Dependencies

###### Does this feature depend on any specific services running in the cluster?

No

### Scalability

###### Will enabling / using this feature result in any new API calls?

No

###### Will enabling / using this feature result in introducing new API types?

No

###### Will enabling / using this feature result in any new calls to the cloud provider?

No

###### Will enabling / using this feature result in increasing size or count of the existing API objects?

No

###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No, we expect the [API call latency SLI](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/api_call_latency.md) to improve.


###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

Overall we expect that cost of serving pagination will go down, however caching
might increase RAM usage, if the client reads the first page, but never
paginates. We expect that most controllers will read all pages.

###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No

### Troubleshooting

###### How does this feature react if the API server and/or etcd is unavailable?

The feature is kube-apiserver feature - it just doesn't work if kube-apiserver is unavailable.

###### What are other known failure modes?

No

###### What steps should be taken if SLOs are not being met to determine the problem?

Disabling the feature-gate.

## Implementation History

## Drawbacks

<!--
Why should this KEP _not_ be implemented?
-->

## Alternatives

<!--
What other approaches did you consider, and why did you rule them out? These do
not need to be as detailed as the proposal, but should include enough
information to express the idea and why it was not acceptable.
-->

## Infrastructure Needed (Optional)

<!--
Use this section if you need things from the project/SIG. Examples include a
new subproject, repos requested, or GitHub details. Listing these here allows a
SIG to get the process for these resources started right away.
-->
Loading

0 comments on commit 9e39954

Please sign in to comment.