Skip to content

PROPOSAL - extend 'scale' subresource API to support pod-deletion-cost #123541

Open
@thesuperzapper

Description

@thesuperzapper

Background

The idea of letting users customize the way Deployments (ReplicaSets) remove Pods when replicas are decreased has been floating around since at least 2017, with other issues dating back to 2015.

Since Kubernetes 1.22, the controller.kubernetes.io/pod-deletion-cost annotation proposed in KEP-2255 is available in BETA.

There have been several other proposals, but this one should supersede them:

Problem

Problem 1: It's too hard to get/update pod-deletion cost

It is currently too hard to get/update the controller.kubernetes.io/pod-deletion-cost annotation for all Pods in a Deployment/ReplicaSet. This makes it difficult to use pod-deletion-cost in practice.

The main issue is that the pod-deletion-cost annotation must be updated BEFORE the replicas count is decreased, this means that any system that wants to use pod-deletion-cost must:

  • Track which Pods are currently in the Deployment (possibly under multiple ReplicaSets)
  • Clean up any pod-deletion-cost annotations that were not used.
  • When scaling down, update the pod-deletion-cost of the Pods that will be deleted, and THEN update the replicas count.

This difficulty often prompts people to use the pod-deletion-cost annotation in a way that is NOT recommended, such as making a controller to update the pod-deletion-cost annotation even when no scale-down is happening (which is a stated anti-pattern).

Problem 2: HorizontalPodAutoscaler cant use pod-deletion-cost

There is no sensible way to extend the HorizontalPodAutoscaler resource to be able to make use of pod-deletion-cost when scaling Deployments. This is because introducing complicated Pod-specific logic to update pod-deletion-cost annotations is inevitably going to be brittle.

Proposal

Overview

The general idea is to make it easier to read/write the controller.kubernetes.io/pod-deletion-cost annotation for all Pods in the Deployment/ReplicaSet. To achieve this, we can extend the existing Scale v1 subresource to be able to read/write the controller.kubernetes.io/pod-deletion-cost annotations of Pods in the Deployment/ReplicaSet.

Current State

We already have a special Scale v1 subresource, which can be used by autoscalers to do things like:

  • GET: the current replicas of a Deployment
  • GET: the current label selector of a Deployment
  • PATCH: the current replicas of a Deployment

Example 1: GET: /apis/apps/v1/namespaces/{namespace}/deployments/{name}/scale:

> curl -s localhost:8001/apis/apps/v1/namespaces/my-namespace/deployments/my-deployment/scale | yq .
kind: Scale
apiVersion: autoscaling/v1
metadata:
  name: my-deployment
  namespace: my-namespace
spec:
    replicas: 2
status:
    replicas: 2
    selector: my-label=my-label-value

NOTE: the HorizontalPodAutoscaler already uses this API to do its scaling in a resource-agnostic way

Future State

We can extend the Scale v1 subresource with two new fields:

  • spec.podDeletionCosts: used to PATCH the controller.kubernetes.io/pod-deletion-cost annotation on specific Pods
  • status.podDeletionCosts: used to GET the current pod-deletion-cost of Pods in the Deployment

Example 1: GET: /apis/apps/v1/namespaces/{namespace}/deployments/{name}/scale:

> curl -s localhost:8001/apis/apps/v1/namespaces/my-namespace/deployments/my-deployment/scale | yq .
kind: Scale
apiVersion: autoscaling/v1
metadata:
  name: my-deployment
  namespace: my-namespace
spec:
  replicas: 3

  ## NOTE: this is empty, because this is a GET, not a PATCH 
  ##.      (we could make this be non-empty, but it would be redundant)
  podDeletionCosts: {}

status:
  replicas: 3
  selector: my-label=my-label-value

  ## all pods with non-empty `controller.kubernetes.io/pod-deletion-cost` annotations
  ## NOTE: this might include Pods from multiple ReplicaSets, if the Deployment is rolling out
  ## NOTE: a value of 0 means the annotation is explicitly set to 0, not that it is missing
  ##.      (pods with no annotation are NOT included in this list, for efficiency)
  podDeletionCosts:
    pod-1: 0
    pod-2: 100
    pod-3: 200

Example 2: kubectl patch ... --subresource=scale:

# this command does two things:
#  1. scale the Deployment to 2 replica
#  2. set the `pod-deletion-cost` of `pod-1` to -100 (making it much more likely to be deleted)
kubectl patch deployment my-deployment \
  --namespace my-namespace \
  --subresource='scale' \
  --type='merge' \
  --patch='{"spec": {"replicas": 2, "podDeletionCosts": {"pod-1": -100}}}'

Benefits / Drawbacks

The main benefits of this approach are:

  • Autoscaler systems can easily check what the current pod-deletion-cost are, and then update them during scale-down as appropriate. No need to make hundreds of Pod GET requests.
  • Autoscaler systems can use a single PATCH to change spec.replicas AND update the pod-deletion-cost of Pods.
  • It does not require significant changes to how scaling is implemented. We can piggyback on the existing work of pod-deletion-cost annotations.

The main drawbacks are:

  • This still only applies to Deployments/ReplicaSets (because pod-deletion-cost is a feature of ReplicaSets)
  • It is slightly strange to automatically update the controller.kubernetes.io/pod-deletion-cost annotation:
    • However, we should remember that the Pods are "managed" by the Deployment, so it IS appropriate for the ReplicaSet controller to update the Pod's definition.

User Stories

User 1: Manual Scaling

As a user, I want to be able to scale down a Deployment and influence which Pods are deleted based on my knowledge of the current state of the system.

For example, say I am running a stateful application with 3 replicas:

  • I know that pod-1 is currently idle, but pod-2 and pod-3 are both busy.
  • I want to scale down to 2 replicas, but I want to make sure that pod-1 is deleted first, because it is idle.

To achieve this, I can do the following:

  1. Verify the Deployment is not currently rolling out (so there is only one active ReplicaSet)
  2. Use kubectl get ... --subresource=scale to see the current pod-deletion-cost of all Pods in the Deployment
  3. Use kubectl patch ... --subresource=scale to BOTH:
    • set set replicas to 2
    • update the pod-deletion-cost of pod-1 to a value that makes it more likely to be deleted

User 2: Custom Autoscalers

As a developer of a custom autoscaler, I want to use application-specific metrics to influence which Pods are deleted during scale-down to minimize the impact on my application and its users.

To achieve this, I can do the following:

  1. Keep track of what the Pods are doing, so I can make informed decisions about which pods are best to delete.
  2. When the time comes to scale down:
    • Use the Scale subresource to read the pod-deletion-cost of all Pods in the Deployment
    • Use the Scale subresource to update the replicas AND the pod-deletion-cost of Pods as appropriate

User 3: HorizontalPodAutoscaler

At least initially, the HorizontalPodAutoscaler will not directly use this feature, because it is primarily concerned with scaling replicas based on a metric, and does not know about application-specific factors that might influence which Pods should be deleted.

However, this feature will make it easier for the HorizontalPodAutoscaler to be extended to have "pod-deletion-cost" awareness in the future.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/appsCategorizes an issue or PR as relevant to SIG Apps.sig/autoscalingCategorizes an issue or PR as relevant to SIG Autoscaling.wg/batchCategorizes an issue or PR as relevant to WG Batch.

    Type

    No type

    Projects

    • Status

      Needs Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions