Skip to content
This repository was archived by the owner on Jan 13, 2025. It is now read-only.
This repository was archived by the owner on Jan 13, 2025. It is now read-only.

Control Plane Stabilization & Auto-Scaling to 500 Nodes #79

@vlerenc

Description

@vlerenc

Feature (What you would like to be added):
For the time being, we collect our findings about unstable control planes and what we need to do about them here (only a few cover HVPA directly):

  • VPA has many, many shortcoming, e.g. it is unable to deal properly with spikes or recommends requests below current usage, which is one of the main reasons we cannot scale to larger clusters, see ☂️ Improve VPA Recommendations gardener/autoscaler#47
  • HVPA will, once one OOMKilled pod triggers new VPA recommendations, roll that out to all replicas, recreating all of them and thereby terminating all connections and putting stress on the system to reinitialise (e.g. taking down ETCD)
  • We use HVPA to mitigate glaring issues with VPA and also for horizontal and vertical pod auto-scaling on the same metric, but possibly we should switch to request-based horizontal autoscaling once we have improved VPA and can drop HVPA completely
  • Once a large cluster control plane fails, it cannot recover by itself anymore as the components restart in a vicious cycle and nodes need to be onboarded in a controlled way for which standard Kubernetes provides no solution yet (batched/staged node onboarding to not overload the starting control plane again and again)
  • Clustered ETCD is required to make the cluster more resilient and don't let it die in a downward-spiral should we update ETCD or something happens to the instance we run
  • Core DNS is not stable, we see unbalanced load patterns that we must address that by means of node local DNS or better vertical pod autoscaling (as horizontal pod autoscaling is pretty much pointless)
  • Calico Typha is recommended to be used together with the cluster-proportional auto-scaler, but that's more a community bandaid as it only scales based on the number of nodes, whatever the size/load, so it boils down again to a better VPA to get that problem under control
  • Our monitoring/logging stacks have a fixed size (also to control the costs), but while we do not want to "pay" for excessive logging, the sizing should be more reasonable and match the basic cluster needs for the control plane and the Kubelets

Motivation (Why is this needed?):

  • Stable control plane, even if spikes or load tests stress it
  • Support for large clusters of 500 nodes (or more)

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/auto-scalingAuto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) relatedarea/robustnessRobustness, reliability, resilience relatedcomponent/hvpaHVPAkind/epicLarge multi-story topickind/roadmapRoadmap BLIlifecycle/rottenNobody worked on this for 12 months (final aging stage)topology/shootAffects Shoot clusters

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions