Skip to content

[feature] Maximum Nodes Per Zone #6940

Open
@rrangith

Description

Which component are you using?:

cluster-autoscaler

Problem

The cluster-autoscaler currently supports a maximum number of nodes per cluster, but imposes no restrictions on the number of nodes per zone. There are scenarios, however, where it would be useful for the CA to restrict the maximum number of nodes per zone. For example, if there is an application with heavy bias towards one zone (or more generally if the cluster is unbalanced by any means), this can result in IP exhaustion in that zone. With the CA’s current behavior, it would not be aware that a zone is out of IPs, and therefore will continue to attempt scale ups in the exhausted zone. This would lead to build up of not ready nodes, leaving the cluster in a degraded state.

The autoscaler should allow for users to specify a maximum number of nodes per zone. With this feature, the autoscaler could prevent scaleups beyond the maximum number of nodes per zone, and scenarios such as IP exhaustion.

More generally, we would like a way to limit the number of nodes in a more granular way beyond just the maximum number of nodes per cluster. With the solutions below, we propose a way for users to customize how the autoscaler can limit the maximum number of nodes for a nodegroup.

Proposed Solution

Our first proposed solution is to allow for users to limit the maximum number of nodes per nodegroup via the autoscaler’s gRPC expander. In order to do so, the expander would include a list of similar nodegroups with its bestOption. This gives the ability for the gRPC expander to have custom logic that can filter out nodegroups based on certain characteristics such as if a zone is at capacity.

Currently during a scaleup CA will compute valid options, and will compute similar nodegroups for each option. Then it will ask the expander for the best option.

Then similar nodegroups get recomputed. The recomputation used to only occur when the bestOption nodegroup did not exist and got created. However it was changed in this PR to always recompute. This means that if an expander changes the SimilarNodeGroups of the bestOption, the result will be replaced by the recomputation.

In order to allow for users to limit the maximum number of nodes per nodegroup via the gRPC expander, we propose two changes made to the autoscaler.

The first, is to add a field to the Autoscaler’s gRPC Option request to include SimilarNodegroupIds. The Autoscaler’s Options struct already included SimilarNodegroups, we would just need to populate the proto request with the Similar Nodegroup IDs. Here is a PR which implements this. This would allow for user’s to filter both Options and the Options Similar Nodegroups in the gRPC expander. For example they could remove options that have reached their max nodes in their zone.

The second change is to allow the autoscaler to trust the SimilarNodegroups returned by the Expander, rather than recomputing the similar nodegroups after getting the best option. In order to do so, we should have a CLI option to trust expander’s similar NGs and skip the recomputation as long as the bestOption nodegroup exists. If it doesn’t exist, then we can create it and compute the similar options. If a user does not enable this option, then by default the behaviour will stay the same. This will only skip similar nodegroup recomputation for users who enable this option. Here is a PR which implements this.

With both of these changes, users can have their own max nodes per zone logic in their gRPC expander, by filtering out nodegroups from a zone that already has reached the max, while the default behaviour of the autoscaler would remain the same.

Overall, this gives more flexibility to gRPC expander users when picking a best option and its similar nodegroups. This flexibility can be used in a variety of usecases beyond just max nodes per zone. The disadvantage is that users will need to implement this logic on their side rather than relying to cluster-autoscaler to do it.

Alternative Solution

Another solution we considered involves putting the control logic in cluster-autoscaler. Overall, this solution requires much more code on the cluster-autoscaler side.

In order to enforce max nodes per zone, we must first understand that the autoscaler currently has no concept of a zone. It only has knowledge of nodegroups (ASG, VMSS, etc.) and the nodes that belong to them. Depending on the cloud provider, these nodegroups may or may not contain metadata that indicate in which zone they belong. Therefore, to enforce max nodes per zone, the implementation would be a more general “max nodes per nodegroup tag”.

This general feature can be applied to many other use cases which include:

  • Max number of nodes with a “spot” tag
  • Max number of nodes with a “gpu” tag
  • Max number of nodes for a certain instance type
  • Many more options

If a user does not specify any tags, then CA must behave the same as it currently does.

Implementation:
First, we filter out invalid nodegroups such as ones that have reached their maxsize. We would need to change this to also include nodegroups that belong to a “tag set” that has exceeded its max size.

Next, CA balances the desired nodes across similar nodegroups here. This function also checks the max size, so we would also add in the check to see if the tag set has enough space. If not, there will be a scaleup failure.

After this the scaleup can succeed.

In order for this implementation to hold, there are a few additional implementation details we would have to cover:
We need a cloud agnostic way to access the Tags on the Nodegroups
We need to keep track of (or efficiently compute) the count of nodes grouped by the specified tag set. We are not counting nodes by their Kubernetes labels, but instead their nodegroup tags.
With each ScaleUp call, we would have to know the current state of how many nodes there are per tag set.

Metadata

Assignees

No one assigned

    Labels

    area/cluster-autoscalerkind/featureCategorizes issue or PR as related to a new feature.lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions