Description
Background
The idea of letting users customize the way Deployments (ReplicaSets) remove Pods when replicas
are decreased has been floating around since at least 2017, with other issues dating back to 2015.
Since Kubernetes 1.22, the controller.kubernetes.io/pod-deletion-cost
annotation proposed in KEP-2255 is available in BETA.
There have been several other proposals, but this one should supersede them:
- PROPOSAL configurable down-scaling behaviour in ReplicaSets & Deployments #107598
- Deployment/ReplicaSet Downscale Pod Picker enhancements#3189
Problem
Problem 1: It's too hard to get/update pod-deletion cost
It is currently too hard to get/update the controller.kubernetes.io/pod-deletion-cost
annotation for all Pods in a Deployment/ReplicaSet. This makes it difficult to use pod-deletion-cost
in practice.
The main issue is that the pod-deletion-cost
annotation must be updated BEFORE the replicas
count is decreased, this means that any system that wants to use pod-deletion-cost
must:
- Track which Pods are currently in the Deployment (possibly under multiple ReplicaSets)
- Clean up any
pod-deletion-cost
annotations that were not used. - When scaling down, update the
pod-deletion-cost
of the Pods that will be deleted, and THEN update thereplicas
count.
This difficulty often prompts people to use the pod-deletion-cost
annotation in a way that is NOT recommended, such as making a controller to update the pod-deletion-cost
annotation even when no scale-down is happening (which is a stated anti-pattern).
Problem 2: HorizontalPodAutoscaler cant use pod-deletion-cost
There is no sensible way to extend the HorizontalPodAutoscaler resource to be able to make use of pod-deletion-cost
when scaling Deployments. This is because introducing complicated Pod-specific logic to update pod-deletion-cost
annotations is inevitably going to be brittle.
Proposal
Overview
The general idea is to make it easier to read/write the controller.kubernetes.io/pod-deletion-cost
annotation for all Pods in the Deployment/ReplicaSet. To achieve this, we can extend the existing Scale
v1 subresource to be able to read/write the controller.kubernetes.io/pod-deletion-cost
annotations of Pods in the Deployment/ReplicaSet.
Current State
We already have a special Scale
v1 subresource, which can be used by autoscalers to do things like:
- GET: the current replicas of a Deployment
- GET: the current label selector of a Deployment
- PATCH: the current replicas of a Deployment
Example 1: GET: /apis/apps/v1/namespaces/{namespace}/deployments/{name}/scale
:
> curl -s localhost:8001/apis/apps/v1/namespaces/my-namespace/deployments/my-deployment/scale | yq .
kind: Scale
apiVersion: autoscaling/v1
metadata:
name: my-deployment
namespace: my-namespace
spec:
replicas: 2
status:
replicas: 2
selector: my-label=my-label-value
NOTE: the HorizontalPodAutoscaler already uses this API to do its scaling in a resource-agnostic way
Future State
We can extend the Scale
v1 subresource with two new fields:
spec.podDeletionCosts
: used to PATCH thecontroller.kubernetes.io/pod-deletion-cost
annotation on specific Podsstatus.podDeletionCosts
: used to GET the currentpod-deletion-cost
of Pods in the Deployment
Example 1: GET: /apis/apps/v1/namespaces/{namespace}/deployments/{name}/scale
:
> curl -s localhost:8001/apis/apps/v1/namespaces/my-namespace/deployments/my-deployment/scale | yq .
kind: Scale
apiVersion: autoscaling/v1
metadata:
name: my-deployment
namespace: my-namespace
spec:
replicas: 3
## NOTE: this is empty, because this is a GET, not a PATCH
##. (we could make this be non-empty, but it would be redundant)
podDeletionCosts: {}
status:
replicas: 3
selector: my-label=my-label-value
## all pods with non-empty `controller.kubernetes.io/pod-deletion-cost` annotations
## NOTE: this might include Pods from multiple ReplicaSets, if the Deployment is rolling out
## NOTE: a value of 0 means the annotation is explicitly set to 0, not that it is missing
##. (pods with no annotation are NOT included in this list, for efficiency)
podDeletionCosts:
pod-1: 0
pod-2: 100
pod-3: 200
Example 2: kubectl patch ... --subresource=scale
:
# this command does two things:
# 1. scale the Deployment to 2 replica
# 2. set the `pod-deletion-cost` of `pod-1` to -100 (making it much more likely to be deleted)
kubectl patch deployment my-deployment \
--namespace my-namespace \
--subresource='scale' \
--type='merge' \
--patch='{"spec": {"replicas": 2, "podDeletionCosts": {"pod-1": -100}}}'
Benefits / Drawbacks
The main benefits of this approach are:
- Autoscaler systems can easily check what the current
pod-deletion-cost
are, and then update them during scale-down as appropriate. No need to make hundreds of Pod GET requests. - Autoscaler systems can use a single PATCH to change
spec.replicas
AND update thepod-deletion-cost
of Pods. - It does not require significant changes to how scaling is implemented. We can piggyback on the existing work of
pod-deletion-cost
annotations.
The main drawbacks are:
- This still only applies to Deployments/ReplicaSets (because
pod-deletion-cost
is a feature of ReplicaSets) - It is slightly strange to automatically update the
controller.kubernetes.io/pod-deletion-cost
annotation:- However, we should remember that the Pods are "managed" by the Deployment, so it IS appropriate for the ReplicaSet controller to update the Pod's definition.
User Stories
User 1: Manual Scaling
As a user, I want to be able to scale down a Deployment and influence which Pods are deleted based on my knowledge of the current state of the system.
For example, say I am running a stateful application with 3 replicas:
- I know that
pod-1
is currently idle, butpod-2
andpod-3
are both busy. - I want to scale down to 2 replicas, but I want to make sure that
pod-1
is deleted first, because it is idle.
To achieve this, I can do the following:
- Verify the Deployment is not currently rolling out (so there is only one active ReplicaSet)
- Use
kubectl get ... --subresource=scale
to see the currentpod-deletion-cost
of all Pods in the Deployment - Use
kubectl patch ... --subresource=scale
to BOTH:- set set
replicas
to2
- update the
pod-deletion-cost
ofpod-1
to a value that makes it more likely to be deleted
- set set
User 2: Custom Autoscalers
As a developer of a custom autoscaler, I want to use application-specific metrics to influence which Pods are deleted during scale-down to minimize the impact on my application and its users.
To achieve this, I can do the following:
- Keep track of what the Pods are doing, so I can make informed decisions about which pods are best to delete.
- When the time comes to scale down:
- Use the
Scale
subresource to read thepod-deletion-cost
of all Pods in the Deployment - Use the
Scale
subresource to update thereplicas
AND thepod-deletion-cost
of Pods as appropriate
- Use the
User 3: HorizontalPodAutoscaler
At least initially, the HorizontalPodAutoscaler will not directly use this feature, because it is primarily concerned with scaling replicas
based on a metric, and does not know about application-specific factors that might influence which Pods should be deleted.
However, this feature will make it easier for the HorizontalPodAutoscaler to be extended to have "pod-deletion-cost" awareness in the future.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Needs Triage