|
23 | 23 | <!-- /toc -->
|
24 | 24 |
|
25 | 25 | ## Summary
|
26 |
| -This proposal introduces a plugin to allow users to specify the priority of different resources and max resource |
27 |
| -consumption for workload on differnet resources. |
| 26 | +This proposal introduces a plugin that enables users to set priorities for various resources and define maximum resource consumption limits for workloads across different resources. |
28 | 27 |
|
29 | 28 | ## Motivation
|
30 |
| -The machines in a Kubernetes cluster are typically heterogeneous, with varying CPU, memory, GPU, and pricing. To |
| 29 | +A Kubernetes cluster typically consists of heterogeneous machines, with varying SKUs on CPU, memory, GPU, and pricing. To |
31 | 30 | efficiently utilize the different resources available in the cluster, users can set priorities for machines of different
|
32 | 31 | types and configure resource allocations for different workloads. Additionally, they may choose to delete pods running
|
33 | 32 | on low priority nodes instead of high priority ones.
|
34 | 33 |
|
35 | 34 | ### Use Cases
|
36 | 35 |
|
37 |
| -1. As a user of cloud services, there are some stable but expensive ECS instances and some unstable but cheaper Spot |
38 |
| -instances in my cluster. I hope that my workload can be deployed first on stable ECS instances, and during business peak |
39 |
| -periods, the Pods that are scaled out are deployed on Spot instances. At the end of the business peak, the Pods on Spot |
40 |
| -instances are prioritized to be scaled in. |
| 36 | +1. As a user of cloud services, there are some static but expensive VM instances and some dynamic but cheaper Spot |
| 37 | +instances in my cluster. I hope that my workload can be deployed first on static VM instances, and during business peak |
| 38 | +periods, the Pods that are scaled up are deployed on Spot instances. At the end of the business peak, the Pods on Spot |
| 39 | +instances are prioritized to be scaled down. |
41 | 40 |
|
42 | 41 | ### Goals
|
43 | 42 |
|
44 |
| -1. Develop a filter plugin to restrict the resource consumption on each unit for different workloads. |
45 |
| -2. Develop a score plugin to favor nodes matched by a high priority unit. |
| 43 | +1. Develop a filter plugin to restrict the resource consumption on each kind of resource for different workloads. |
| 44 | +2. Develop a score plugin to favor nodes matched by a high priority kind of resource. |
46 | 45 | 3. Automatically setting deletion costs on Pods to control the scaling in sequence of workloads through a controller.
|
47 | 46 |
|
48 | 47 | ### Non-Goals
|
49 | 48 |
|
50 |
| -1. Modify the workload controller to support deletion costs. If the workload don't support deletion costs, scaling in |
51 |
| -sequence will be random. |
52 |
| -2. When creating a ResourcePolicy, if the number of Pods has already violated the quantity constraint of the |
53 |
| -ResourcePolicy, we will not attempt to delete the excess Pods. |
54 |
| - |
| 49 | +1. Scheduler will not delete the pods. |
55 | 50 |
|
56 | 51 | ## Proposal
|
57 | 52 |
|
58 |
| -### CRD API |
| 53 | +### API |
59 | 54 | ```yaml
|
60 | 55 | apiVersion: scheduling.sigs.x-k8s.io/v1alpha1
|
61 | 56 | kind: ResourcePolicy
|
|
67 | 62 | - pod-template-hash
|
68 | 63 | matchPolicy:
|
69 | 64 | ignoreTerminatingPod: true
|
70 |
| - ignorePreviousPod: false |
71 |
| - forceMaxNum: false |
72 | 65 | podSelector:
|
73 | 66 | matchExpressions:
|
74 | 67 | - key: key1
|
@@ -105,49 +98,70 @@ spec:
|
105 | 98 | key1: value3
|
106 | 99 | ```
|
107 | 100 |
|
108 |
| -`Priority` define the priority of each unit. Pods will be scheduled on units with a higher priority. |
| 101 | +```go |
| 102 | +type ResourcePolicy struct { |
| 103 | + ObjectMeta |
| 104 | + TypeMeta |
| 105 | + |
| 106 | + Spec ResourcePolicySpec |
| 107 | +} |
| 108 | +type ResourcePolicySpec struct { |
| 109 | + MatchLabelKeys []string |
| 110 | + MatchPolicy MatchPolicy |
| 111 | + Strategy string |
| 112 | + PodSelector metav1.LabelSelector |
| 113 | + Units []Unit |
| 114 | +} |
| 115 | +type MatchPolicy struct { |
| 116 | + IgnoreTerminatingPod bool |
| 117 | +} |
| 118 | +type Unit struct { |
| 119 | + Priority *int32 |
| 120 | + MaxCount *int32 |
| 121 | + NodeSelector metav1.LabelSelector |
| 122 | +} |
| 123 | +``` |
| 124 | + |
| 125 | +Pods will be matched by the ResourcePolicy in same namespace when the `.spec.podSelector`. And if `.spec.matchPolicy.ignoreTerminatingPod` is `true`, pods with Non-Zero `.spec.deletionTimestamp` will be ignored. |
| 126 | +ResourcePolicies will never match pods in different namesapces. One pod can not be matched by more than one Resource Policies. |
| 127 | + |
| 128 | +Pods can only be scheduled on units defined in `.spec.units` and this behavior can be changed by `.spec.strategy`. Each item in `.spec.units` contains a set of nodes that match the `NodeSelector` which describes a kind of resource in the cluster. |
| 129 | + |
| 130 | +`.spec.units[].priority` define the priority of each unit. Units with higher priority will get higher score in the score plugin. |
109 | 131 | If all units have the same priority, resourcepolicy will only limit the max pod on these units.
|
| 132 | +If the `.spec.units[].priority` is not set, the default value is 0. |
| 133 | +`.spec.units[].maxCount` define the maximum number of pods that can be scheduled on each unit. If `.spec.units[].maxCount` is not set, pods can always be scheduled on the units except there is no enough resource. |
110 | 134 |
|
111 |
| -`Strategy` indicate how we treat the nodes doesn't match any unit. |
| 135 | +`.spec.strategy` indicate how we treat the nodes doesn't match any unit. |
112 | 136 | If strategy is `required`, the pod can only be scheduled on nodes that match the units in resource policy.
|
113 | 137 | If strategy is `prefer`, the pod can be scheduled on all nodes, these nodes not match the units will be
|
114 | 138 | considered after all nodes match the units. So if the strategy is `required`, we will return `unschedulable`
|
115 | 139 | for those nodes not match the units.
|
116 | 140 |
|
117 |
| -`MatchLabelKeys` indicate how we group the pods matched by `podSelector` and `matchPolicy`, its behavior is like |
118 |
| -`MatchLabelKeys` in `PodTopologySpread`. |
119 |
| - |
120 |
| -`matchPolicy` indicate if we should ignore some kind pods when calculate pods in certain unit. |
121 |
| - |
122 |
| -If `forceMaxNum` is set `true`, we will not try the next units when one unit is not full, this property have no effect |
123 |
| -when `max` is not set in units. |
| 141 | +`.spec.matchLabelKeys` indicate how we group the pods matched by `podSelector` and `matchPolicy`, its behavior is like |
| 142 | +`.spec.matchLabelKeys` in `PodTopologySpread`. |
124 | 143 |
|
125 | 144 | ### Implementation Details
|
126 | 145 |
|
127 |
| -#### Scheduler Plugins |
128 |
| - |
129 |
| -For each unit, we will record which pods were scheduled on it to prevent too many pods scheduled on it. |
130 |
| - |
131 |
| -##### PreFilter |
| 146 | +#### PreFilter |
132 | 147 | PreFilter check if the current pods match only one resource policy. If not, PreFilter will reject the pod.
|
133 | 148 | If yes, PreFilter will get the number of pods on each unit to determine which units are available for the pod
|
134 | 149 | and write this information into cycleState.
|
135 | 150 |
|
136 |
| -##### Filter |
| 151 | +#### Filter |
137 | 152 | Filter check if the node belongs to an available unit. If the node doesn't belong to any unit, we will return
|
138 |
| -success if the strategy is `prefer`, otherwise we will return unschedulable. |
| 153 | +success if the `.spec.strategy` is `prefer`, otherwise we will return unschedulable. |
139 | 154 |
|
140 | 155 | Besides, filter will check if the pods that was scheduled on the unit has already violated the quantity constraint.
|
141 |
| -If the number of pods has reach the `maxCount`, all the nodes in unit will be marked unschedulable. |
| 156 | +If the number of pods has reach the `.spec.unit[].maxCount`, all the nodes in unit will be marked unschedulable. |
142 | 157 |
|
143 |
| -##### Score |
144 |
| -If `priority` is set in resource policy, we will schedule pod based on `priority`. Default priority is 1, and minimum |
145 |
| -priority is 1. |
| 158 | +#### Score |
| 159 | +If `.spec.unit[].priority` is set in resource policy, we will schedule pod based on `.spec.unit[].priority`. Default priority is 0, and minimum |
| 160 | +priority is 0. |
146 | 161 |
|
147 | 162 | Score calculation details:
|
148 | 163 |
|
149 |
| -1. calculate priority score, `scorePriority = (priority-1) * 20`, to make sure we give nodes without priority a minimum |
150 |
| -score. |
| 164 | +1. calculate priority score, `scorePriority = (priority) * 20`, to make sure we give nodes without priority a minimum score. |
151 | 165 | 2. normalize score
|
152 | 166 |
|
153 | 167 | #### Resource Policy Controller
|
|
0 commit comments