Scale testing

We recently merged the initial iteration of the in-memory provider (https://github.com/kubernetes-sigs/cluster-api/pull/8799). But this was just the first step of the scale test implementation. This issue provides an overview over ongoing and upcoming tasks around scale testing.

In-Memory provider features:
* High-level:
    * [x] P0 ClusterClass support (#8807 @ykakarap)
    * [x] P0 Deletion (https://github.com/kubernetes-sigs/cluster-api/pull/8818 @fabriziopandini)
    * [x] Upgrade (@killianmuldoon)
    * [x] KCP kube-proxy and CoreDNS reconciliation (#8899 @killianmuldoon)
    * Make it behave like a real infra provider:
        * [x] P0 Provisioning duration (https://github.com/kubernetes-sigs/cluster-api/pull/8831 @fabriziopandini)
        * [ ] Errors
        * [ ] Configurable apiserver/etcd latency
* Low-level:
    * [x] P0 apiserver: watches (https://github.com/kubernetes-sigs/cluster-api/pull/8851 @killianmuldoon)
    * [ ] apiserver: label selector for list calls
        * Not a problem for cached resources KCP. Label selectors for cached resources are evaluated client-side in CR.
    * [ ] apiserver: improve field selector calls. Return an error if the field selector is not supported (https://github.com/kubernetes-sigs/cluster-api/pull/8938#discussion_r1252238351 @killianmuldoon )

e2e test and test framework:
* Implement scale test automation:
    * Cluster topologies:
        * [x] Small workload cluster:  x * (1 control-plane + 1 worker node)
        * [x] Small medium workload cluster:  x * (3 control-plane + 10 worker node)
        * [x] Medium Workload Cluster: x * (3 control-plane + 50 worker nodes)
        * [x] Large Workload Cluster:  x * (3 control-plane + 500 worker nodes)
            * Dimensions: \# of MachineDeployments
    * Scenarios:
        * [x] P0 Create & delete (https://github.com/kubernetes-sigs/cluster-api/pull/8833 @ykakarap)
        * [ ] Create, upgrade & delete @killianmuldoon 
        * [x] Long-lived clusters (~ a few hours or a day, to catch memory leaks etc.)
        * [ ] Chaos testing: e.g. injecting failures like cluster not reachable, machine failures
        * [ ] More complex scenarios: e.g. topology is actively changed (MD scale up etc.)
        * [x] Add MachineHealthCheck to the scaling test (@ykakarap)
* [ ] Automate scale testing in CI: (prior art KCP, k/k)
    * Metric collection and consumption after test completion
    * Test should fail based on SLA's (e.g. machine creation slower than x minutes)

Metrics & observability:
* [x] P0 Cluster API state metrics & dashboard: (https://github.com/kubernetes-sigs/cluster-api/pull/8834 @sbueringer)
* [x] In-memory provider metrics & dashboard:
    * apiserver & etcd: server-side request metrics (prior art: kube-apiserver)
* [ ] Consider exposing more metrics in core CAPI, e.g.:
  * time until Machine is running
  * queue additions (to figure out who is doing it)
* [ ] Consider writing alerts for problematic conditions


### Performance improvements

* https://github.com/kubernetes-sigs/cluster-api/pull/8579
* https://github.com/kubernetes-sigs/cluster-api/pull/8617
* https://github.com/kubernetes-sigs/cluster-api/pull/8744
* https://github.com/kubernetes-sigs/cluster-api/pull/8743
* https://github.com/kubernetes-sigs/cluster-api/pull/8850
* https://github.com/kubernetes-sigs/cluster-api/pull/8855
* https://github.com/kubernetes-sigs/cluster-api/pull/8894
* https://github.com/kubernetes-sigs/cluster-api/pull/8852
* https://github.com/kubernetes-sigs/cluster-api/pull/8896
* https://github.com/kubernetes-sigs/cluster-api/pull/8867
* https://github.com/kubernetes-sigs/cluster-api/pull/8900
* https://github.com/kubernetes-sigs/cluster-api/pull/8912
* https://github.com/kubernetes-sigs/cluster-api/pull/8913
* https://github.com/kubernetes-sigs/cluster-api/pull/8916
* https://github.com/kubernetes-sigs/cluster-api/pull/8918
* https://github.com/kubernetes-sigs/cluster-api/pull/8922
* https://github.com/kubernetes-sigs/cluster-api/pull/8926
* https://github.com/kubernetes-sigs/cluster-api/pull/8934
* https://github.com/kubernetes-sigs/cluster-api/pull/8936
* https://github.com/kubernetes-sigs/cluster-api/pull/8940
* https://github.com/kubernetes-sigs/cluster-api/issues/8835
* https://github.com/kubernetes-sigs/cluster-api/issues/8893

### Follow-up

Anomalies found that we should further triage:
* [ ]  /convert gets called a lot (even though we never use old apiVersions)
* [ ] When deploying > 1k clusters into a namespace "list machines" in KCP becomes pretty slow and apiserver CPU usage was very high (8-14 CPUs) (Debug ideas: cpu profile, [apiserver tracing](https://opentelemetry.io/blog/2023/k8s-runtime-observability/))

Backlog improvement ideas:
* KCP:
  * (breaking change): create issue that all KCP secrets must have cluster-name label => then configure KCP cache & client to only cache secrets with cluster-name label
  * EnsureResource: Resources are cached atm. Consider only caching PartialObjectMeta instead.
  * Consider caching the pods we care about (at least CP, check if we access other pods, kube-proxy, core-dns)
  * GetMachinesForCluster: cached call + wait for cache safeguards
  * Optimize etcd client creation (cache instead of recreate)
* Others:
  * Change all CAPI controllers to cache unstructured per default, use APIReader for uncached calls (like for regular typed objects)
  * Audit all usages of APIReader if they are actually necessary
  * Run certain operations less frequently (e.g. apiVersion bump, reconcile labels)
  * Customize controller work queue rate-limiter
  * Buffered reconciling (avoid frequent reconcile of the same item within a short period of time)
  * Resync items over time instead of all at once at resyncPeriod
    * Investigate if a Reconciler re-reconciles all objects for every type it is watching (because resync is implemented on the informer level), e.g. KCP controller reconciles aver the KCP and after the Cluster resync.
  * Priority queue
  * Use CR transform option to strip parts of objects we don't use (fields which are not part of the contract)
    * trade-off: memory vs. processing time to strip fields, also not sure how to configure up front before we know the CRDs
    * => Based on data we don't know if it's worth it at the moment, so we won't do it for now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scale testing #8814

Performance improvements

Follow-up

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scale testing #8814

Description

Performance improvements

Follow-up

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions