Control Plane Stabilization & Auto-Scaling to 500 Nodes

**Feature (What you would like to be added):**
For the time being, we collect our findings about unstable control planes and what we need to do about them here (only a few cover HVPA directly):
- [ ] VPA has many, many shortcoming, e.g. it is unable to deal properly with spikes or recommends requests below current usage, which is one of the main reasons we cannot scale to larger clusters, see https://github.com/gardener/autoscaler/issues/47
- [ ] HVPA will, once one `OOMKilled` pod triggers new VPA recommendations, roll that out to all replicas, recreating all of them and thereby terminating all connections and putting stress on the system to reinitialise (e.g. taking down ETCD)
- [ ] We use HVPA to mitigate glaring issues with VPA and also for horizontal and vertical pod auto-scaling on the same metric, but possibly we should switch to request-based horizontal autoscaling once we have improved VPA and can drop HVPA completely
- [ ] Once a large cluster control plane fails, it cannot recover by itself anymore as the components restart in a vicious cycle and nodes need to be onboarded in a controlled way for which standard Kubernetes provides no solution yet (batched/staged node onboarding to not overload the starting control plane again and again)
- [ ] Clustered ETCD is required to make the cluster more resilient and don't let it die in a downward-spiral should we update ETCD or something happens to the instance we run
- [ ] Core DNS is not stable, we see unbalanced load patterns that we must address that by means of node local DNS or better vertical pod autoscaling (as horizontal pod autoscaling is pretty much pointless)
- [ ] Calico Typha is recommended to be used together with the cluster-proportional auto-scaler, but that's more a community bandaid as it only scales based on the number of nodes, whatever the size/load, so it boils down again to a better VPA to get that problem under control
- [ ] Our monitoring/logging stacks have a fixed size (also to control the costs), but while we do not want to "pay" for excessive logging, the sizing should be more reasonable and match the basic cluster needs for the control plane and the Kubelets

**Motivation (Why is this needed?):**
- Stable control plane, even if spikes or load tests stress it
- Support for large clusters of 500 nodes (or more)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control Plane Stabilization & Auto-Scaling to 500 Nodes #79

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Control Plane Stabilization & Auto-Scaling to 500 Nodes #79

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions