PROPOSAL

# Background

The idea of letting users customize the way Deployments (ReplicaSets) remove Pods when `replicas` are decreased has been floating around [since at least 2017](https://github.com/kubernetes/kubernetes/issues/45509), with [other issues dating back to 2015](https://github.com/kubernetes/kubernetes/issues/4301). 

Since [Kubernetes 1.22](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.22.md#api-change-8), the [`controller.kubernetes.io/pod-deletion-cost`](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/#pod-deletion-cost) annotation proposed in [KEP-2255](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2255-pod-cost) is available in BETA.

There have been several other proposals, but this one should supersede them:

- https://github.com/kubernetes/kubernetes/issues/107598
- https://github.com/kubernetes/enhancements/issues/3189

# Problem

## Problem 1: It's too hard to get/update pod-deletion cost

It is currently too hard to get/update the `controller.kubernetes.io/pod-deletion-cost` annotation for all Pods in a Deployment/ReplicaSet. This makes it difficult to use `pod-deletion-cost` in practice.

The main issue is that the `pod-deletion-cost` annotation must be updated BEFORE the `replicas` count is decreased, this means that any system that wants to use `pod-deletion-cost` must:

- Track which Pods are currently in the Deployment (possibly under multiple ReplicaSets)
- Clean up any `pod-deletion-cost` annotations that were not used.
- When scaling down, update the `pod-deletion-cost` of the Pods that will be deleted, and THEN update the `replicas` count.

This difficulty often prompts people to use the `pod-deletion-cost` annotation in a way that is NOT recommended, such as making a controller to update the `pod-deletion-cost` annotation even when no scale-down is happening (which is a [stated anti-pattern](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/#pod-deletion-cost)).

## Problem 2: HorizontalPodAutoscaler cant use pod-deletion-cost

There is no sensible way to extend the [HorizontalPodAutoscaler](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/) resource to be able to make use of `pod-deletion-cost` when scaling Deployments. This is because introducing complicated Pod-specific logic to update `pod-deletion-cost` annotations is inevitably going to be brittle.


# Proposal

## Overview

The general idea is to make it easier to read/write the [`controller.kubernetes.io/pod-deletion-cost`](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/#pod-deletion-cost) annotation for all Pods in the Deployment/ReplicaSet. To achieve this, we can extend the existing [`Scale` v1](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.29/#scalestatus-v1-autoscaling) subresource to be able to read/write the `controller.kubernetes.io/pod-deletion-cost` annotations of Pods in the Deployment/ReplicaSet.

## Current State

We already have a special [`Scale` v1](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.29/#scalestatus-v1-autoscaling) subresource, which can be used by autoscalers to do things like:

- GET: the current replicas of a Deployment
- GET: the current label selector of a Deployment
- PATCH: the current replicas of a Deployment

__Example 1:__ [GET: `/apis/apps/v1/namespaces/{namespace}/deployments/{name}/scale`](https://dev-k8sref-io.web.app/docs/workloads/scale-v1/):

```shell
> curl -s localhost:8001/apis/apps/v1/namespaces/my-namespace/deployments/my-deployment/scale | yq .
kind: Scale
apiVersion: autoscaling/v1
metadata:
  name: my-deployment
  namespace: my-namespace
spec:
    replicas: 2
status:
    replicas: 2
    selector: my-label=my-label-value
```

_NOTE: the HorizontalPodAutoscaler already uses this API to do its scaling in a resource-agnostic way_

## Future State

We can extend the [`Scale` v1](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.29/#scalestatus-v1-autoscaling) subresource with two new fields:

- `spec.podDeletionCosts`: used to PATCH the `controller.kubernetes.io/pod-deletion-cost` annotation on specific Pods
- `status.podDeletionCosts`: used to GET the current `pod-deletion-cost` of Pods in the Deployment

__Example 1:__ [GET: `/apis/apps/v1/namespaces/{namespace}/deployments/{name}/scale`](https://dev-k8sref-io.web.app/docs/workloads/scale-v1/):

```shell
> curl -s localhost:8001/apis/apps/v1/namespaces/my-namespace/deployments/my-deployment/scale | yq .
kind: Scale
apiVersion: autoscaling/v1
metadata:
  name: my-deployment
  namespace: my-namespace
spec:
  replicas: 3

  ## NOTE: this is empty, because this is a GET, not a PATCH 
  ##.      (we could make this be non-empty, but it would be redundant)
  podDeletionCosts: {}

status:
  replicas: 3
  selector: my-label=my-label-value

  ## all pods with non-empty `controller.kubernetes.io/pod-deletion-cost` annotations
  ## NOTE: this might include Pods from multiple ReplicaSets, if the Deployment is rolling out
  ## NOTE: a value of 0 means the annotation is explicitly set to 0, not that it is missing
  ##.      (pods with no annotation are NOT included in this list, for efficiency)
  podDeletionCosts:
    pod-1: 0
    pod-2: 100
    pod-3: 200
```

__Example 2:__ [`kubectl patch ... --subresource=scale`](https://kubernetes.io/docs/tasks/manage-kubernetes-objects/update-api-object-kubectl-patch/#scale-kubectl-patch):

```shell
# this command does two things:
#  1. scale the Deployment to 2 replica
#  2. set the `pod-deletion-cost` of `pod-1` to -100 (making it much more likely to be deleted)
kubectl patch deployment my-deployment \
  --namespace my-namespace \
  --subresource='scale' \
  --type='merge' \
  --patch='{"spec": {"replicas": 2, "podDeletionCosts": {"pod-1": -100}}}'
```

## Benefits / Drawbacks

__The main benefits of this approach are:__

- Autoscaler systems can easily check what the current `pod-deletion-cost` are, and then update them during scale-down as appropriate. No need to make hundreds of Pod GET requests.
- Autoscaler systems can use a single PATCH to change `spec.replicas` AND update the `pod-deletion-cost` of Pods.
- It does not require significant changes to how scaling is implemented. We can piggyback on the existing work of `pod-deletion-cost` annotations.

__The main drawbacks are:__

- This still only applies to Deployments/ReplicaSets (because `pod-deletion-cost` is a feature of ReplicaSets)
- It is slightly strange to automatically update the `controller.kubernetes.io/pod-deletion-cost` annotation:
    - However, we should remember that the Pods are "managed" by the Deployment, so it IS appropriate for the ReplicaSet controller to update the Pod's definition.

# User Stories

## User 1: Manual Scaling

As a user, I want to be able to scale down a Deployment and influence which Pods are deleted based on my knowledge of the current state of the system.

For example, say I am running a stateful application with 3 replicas:

- I know that `pod-1` is currently idle, but `pod-2` and `pod-3` are both busy.
- I want to scale down to 2 replicas, but I want to make sure that `pod-1` is deleted first, because it is idle.

To achieve this, I can do the following:

1. Verify the Deployment is not currently rolling out (so there is only one active ReplicaSet)
1. Use `kubectl get ... --subresource=scale` to see the current `pod-deletion-cost` of all Pods in the Deployment
1. Use `kubectl patch ... --subresource=scale` to BOTH:
    - set set `replicas` to `2`
    - update the `pod-deletion-cost` of `pod-1` to a value that makes it more likely to be deleted

## User 2: Custom Autoscalers

As a developer of a custom autoscaler, I want to use application-specific metrics to influence which Pods are deleted during scale-down to minimize the impact on my application and its users.

To achieve this, I can do the following:

1. Keep track of what the Pods are doing, so I can make informed decisions about which pods are best to delete.
2. When the time comes to scale down:
     - Use the `Scale` subresource to read the `pod-deletion-cost` of all Pods in the Deployment
     - Use the `Scale` subresource to update the `replicas` AND the `pod-deletion-cost` of Pods as appropriate

## User 3: HorizontalPodAutoscaler

_At least initially, the HorizontalPodAutoscaler will not directly use this feature, because it is primarily concerned with scaling `replicas` based on a metric, and does not know about application-specific factors that might influence which Pods should be deleted._

_However, this feature will make it easier for the HorizontalPodAutoscaler to be extended to have "pod-deletion-cost" awareness in the future._

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PROPOSAL - extend 'scale' subresource API to support `pod-deletion-cost` #123541

Background

Problem

Problem 1: It's too hard to get/update pod-deletion cost

Problem 2: HorizontalPodAutoscaler cant use pod-deletion-cost

Overview

Current State

Future State

Benefits / Drawbacks

User Stories

User 1: Manual Scaling

User 2: Custom Autoscalers

User 3: HorizontalPodAutoscaler

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PROPOSAL - extend 'scale' subresource API to support pod-deletion-cost #123541

Description

Background

Problem

Problem 1: It's too hard to get/update pod-deletion cost

Problem 2: HorizontalPodAutoscaler cant use pod-deletion-cost

Proposal

Overview

Current State

Future State

Benefits / Drawbacks

User Stories

User 1: Manual Scaling

User 2: Custom Autoscalers

User 3: HorizontalPodAutoscaler

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

PROPOSAL - extend 'scale' subresource API to support `pod-deletion-cost` #123541