Open
Description
We recently merged the initial iteration of the in-memory provider (#8799). But this was just the first step of the scale test implementation. This issue provides an overview over ongoing and upcoming tasks around scale testing.
In-Memory provider features:
- High-level:
- P0 ClusterClass support (🌱 add ClusterClass support for in-memory provider #8807 @ykakarap)
- P0 Deletion (🐛 fix cluster deletion in the in-memory API server #8818 @fabriziopandini)
- Upgrade (@killianmuldoon)
- KCP kube-proxy and CoreDNS reconciliation (🌱 CAPIM: Enable update for coreDNS and kube-proxy #8899 @killianmuldoon)
- Make it behave like a real infra provider:
- P0 Provisioning duration (🌱 Add startup timeout to the in memory provider #8831 @fabriziopandini)
- Errors
- Configurable apiserver/etcd latency
- Low-level:
- P0 apiserver: watches (🌱 Add watch to in-memory server multiplexer #8851 @killianmuldoon)
- apiserver: label selector for list calls
- Not a problem for cached resources KCP. Label selectors for cached resources are evaluated client-side in CR.
- apiserver: improve field selector calls. Return an error if the field selector is not supported (✨ Enable Kubernetes upgrades in CAPIM #8938 (comment) @killianmuldoon )
e2e test and test framework:
- Implement scale test automation:
- Cluster topologies:
- Small workload cluster: x * (1 control-plane + 1 worker node)
- Small medium workload cluster: x * (3 control-plane + 10 worker node)
- Medium Workload Cluster: x * (3 control-plane + 50 worker nodes)
- Large Workload Cluster: x * (3 control-plane + 500 worker nodes)
- Dimensions: # of MachineDeployments
- Scenarios:
- P0 Create & delete (🌱 Add Scale e2e - development only #8833 @ykakarap)
- Create, upgrade & delete @killianmuldoon
- Long-lived clusters (~ a few hours or a day, to catch memory leaks etc.)
- Chaos testing: e.g. injecting failures like cluster not reachable, machine failures
- More complex scenarios: e.g. topology is actively changed (MD scale up etc.)
- Add MachineHealthCheck to the scaling test (@ykakarap)
- Cluster topologies:
- Automate scale testing in CI: (prior art KCP, k/k)
- Metric collection and consumption after test completion
- Test should fail based on SLA's (e.g. machine creation slower than x minutes)
Metrics & observability:
- P0 Cluster API state metrics & dashboard: (🌱 hack/observability: Add Grafana state dashboard, improve metrics #8834 @sbueringer)
- In-memory provider metrics & dashboard:
- apiserver & etcd: server-side request metrics (prior art: kube-apiserver)
- Consider exposing more metrics in core CAPI, e.g.:
- time until Machine is running
- queue additions (to figure out who is doing it)
- Consider writing alerts for problematic conditions
Performance improvements
- ✨ Add flags for configuring rate limits #8579
- 🐛 Prevent KCP to create many private keys for each reconcile #8617
- 🌱 Use ClusterCacheTracker consistently (instead of NewClusterClient) #8744
- 🌱 Remove unnecessary requeues #8743
- 🐛 ClusterCacheTracker: Stop pod caching when checking workload cluster #8850
- 🌱 Deprecate DefaultIndex usage and remove where not needed #8855
- 🌱 Use rest config from ClusterCacheTracker consistently #8894
- 🌱 optimize
reconcileInterruptibleNodeLabel
of machine controller #8852 - 🌱 controller/machine: use unstructured caching client #8896
- ✨ Use caching read for bootstrap config owner #8867
- 🌱 Kcp use one workload cluster for reconcile #8900
- 🌱 KCP: drop redundant get machines #8912
- 🌱 KCP: cache unstructured #8913
- 🌱 Cache unstructured in Cluster, MD and MS controller #8916
- 🌱 util: cache list calls in cluster to objects mapper #8918
- 🌱 cluster/topology: use cached MD list in get current state #8922
- 🌱 KCP: cache secrets between LookupOrGenerate and ensureCertificatesOwnerRef #8926
- 🌱 all: Add flags to enable block profiling #8934
- 🌱 cluster/topology: use cached Cluster get in Reconcile #8936
- 🌱 cache secrets in KCP, CABPK and ClusterCacheTracker #8940
- Speed up provisioning of the first set of worker machines by improving predicates on cluster watch #8835
- Watches on remote cluster expires every 10s #8893
Follow-up
Anomalies found that we should further triage:
- /convert gets called a lot (even though we never use old apiVersions)
- When deploying > 1k clusters into a namespace "list machines" in KCP becomes pretty slow and apiserver CPU usage was very high (8-14 CPUs) (Debug ideas: cpu profile, apiserver tracing)
Backlog improvement ideas:
- KCP:
- (breaking change): create issue that all KCP secrets must have cluster-name label => then configure KCP cache & client to only cache secrets with cluster-name label
- EnsureResource: Resources are cached atm. Consider only caching PartialObjectMeta instead.
- Consider caching the pods we care about (at least CP, check if we access other pods, kube-proxy, core-dns)
- GetMachinesForCluster: cached call + wait for cache safeguards
- Optimize etcd client creation (cache instead of recreate)
- Others:
- Change all CAPI controllers to cache unstructured per default, use APIReader for uncached calls (like for regular typed objects)
- Audit all usages of APIReader if they are actually necessary
- Run certain operations less frequently (e.g. apiVersion bump, reconcile labels)
- Customize controller work queue rate-limiter
- Buffered reconciling (avoid frequent reconcile of the same item within a short period of time)
- Resync items over time instead of all at once at resyncPeriod
- Investigate if a Reconciler re-reconciles all objects for every type it is watching (because resync is implemented on the informer level), e.g. KCP controller reconciles aver the KCP and after the Cluster resync.
- Priority queue
- Use CR transform option to strip parts of objects we don't use (fields which are not part of the contract)
- trade-off: memory vs. processing time to strip fields, also not sure how to configure up front before we know the CRDs
- => Based on data we don't know if it's worth it at the moment, so we won't do it for now.