Skip to content

Commit 431c2e5

Browse files
authored
Prepare release v0.14 (#7091)
1 parent 1667bcb commit 431c2e5

File tree

13 files changed

+232
-29
lines changed

13 files changed

+232
-29
lines changed

CHANGELOG/CHANGELOG-0.14.md

Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
## v0.14.0
2+
3+
Changes since `v0.13.0`:
4+
5+
## Urgent Upgrade Notes
6+
7+
### (No, really, you MUST read this before you upgrade)
8+
9+
- ProvisioningRequest: Remove setting deprecated ProvisioningRequest annotations on Kueue-managed Pods:
10+
- cluster-autoscaler.kubernetes.io/consume-provisioning-request
11+
- cluster-autoscaler.kubernetes.io/provisioning-class-name
12+
13+
If you are implementing a ProvisioningRequest reconciler used by Kueue you should
14+
make sure the new annotations are supported:
15+
- autoscaling.x-k8s.io/consume-provisioning-request
16+
- autoscaling.x-k8s.io/provisioning-class-name (#6381, @kannon92)
17+
- Rename kueue-metrics-certs to kueue-metrics-cert cert-manager.io/v1 Certificate name in cert-manager manifests when installing Kueue using the Kustomize configuration.
18+
19+
If you're using cert-manager and have deployed Kueue using the Kustomize configuration, you must delete the existing kueue-metrics-certs cert-manager.io/v1 Certificate before applying the new changes to avoid conflicts. (#6345, @mbobrovskyi)
20+
- Replace "DeactivatedXYZ" "reason" label values with "Deactivated" and introduce "underlying_cause" label to the following metrics:
21+
- "pods_ready_to_evicted_time_seconds"
22+
- "evicted_workloads_total"
23+
- "local_queue_evicted_workloads_total"
24+
- "evicted_workloads_once_total"
25+
26+
If you rely on the "DeactivatedXYZ" "reason" label values, you can migrate to the "Deactivated" "reason" label value and the following "underlying_cause" label values:
27+
- ""
28+
- "WaitForStart"
29+
- "WaitForRecovery"
30+
- "AdmissionCheck"
31+
- "MaximumExecutionTimeExceeded"
32+
- "RequeuingLimitExceeded" (#6590, @mykysha)
33+
- TAS: Enforce a stricter value of the `kueue.x-k8s.io/podset-group-name` annotation in the creation webhook.
34+
35+
Make sure the values of the `kueue.x-k8s.io/podset-group-name` annotation are not numbers.` (#6708, @kshalot)
36+
37+
## Upgrading steps
38+
39+
### 1. Back Up Topology Resources (skip if you are not using Topology API):
40+
41+
kubectl get topologies.kueue.x-k8s.io -o yaml > topologies.yaml
42+
43+
### 2. Update apiVersion in Backup File (skip if not using Topology API):
44+
Replace `v1alpha1` with `v1beta1` in topologies.yaml for all resources:
45+
46+
sed -i -e 's/v1alpha1/v1beta1/g' topologies.yaml
47+
48+
### 3. Delete Old CRDs:
49+
50+
kubectl delete crd topologies.kueue.x-k8s.io
51+
52+
### 4. Remove Finalizers from Topologies (skip if you are not using Topology API):
53+
54+
kubectl get topology.kueue.x-k8s.io -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' | while read -r name; do
55+
kubectl patch topology.kueue.x-k8s.io "$name" -p '{"metadata":{"finalizers":[]}}' --type='merge'
56+
done
57+
58+
### 5. Install Kueue v0.14.0:
59+
Follow the instructions [here](https://kueue.sigs.k8s.io/docs/installation/#install-a-released-version) to install.
60+
61+
### 6. Restore Topology Resources (skip if not using Topology API):
62+
63+
kubectl apply -f topologies.yaml
64+
65+
## Changes by Kind
66+
67+
### Deprecation
68+
69+
- Stop serving the QueueVisibility feature, but keep APIs (`.status.pendingWorkloadsStatus`) to avoid breaking changes.
70+
71+
If you rely on the QueueVisibility feature (`.status.pendingWorkloadsStatus` in the ClusterQueue), you must migrate to VisibilityOndDemand
72+
(https://kueue.sigs.k8s.io/docs/tasks/manage/monitor_pending_workloads/pending_workloads_on_demand). (#6631, @vladikkuzn)
73+
74+
### API Change
75+
76+
- TAS: Graduated TopologyAwareScheduling to Beta. (#6830, @mbobrovskyi)
77+
- TAS: Support multiple nodes for failure handling by ".status.unhealthyNodes" in Workload. The "alpha.kueue.x-k8s.io/node-to-replace" annotation is no longer used (#6648, @pajakd)
78+
79+
### Feature
80+
81+
- Add an alpha integration for Kubeflow Trainer to Kueue. (#6597, @kaisoz)
82+
- Add an exponential backoff for the TAS scheduler second pass. (#6753, @mykysha)
83+
- Added priority_class label for kueue_local_queue_admitted_workloads_total metric. (#6845, @vladikkuzn)
84+
- Added priority_class label for kueue_local_queue_evicted_workloads_total metric (#6898, @vladikkuzn)
85+
- Added priority_class label for kueue_local_queue_quota_reserved_workloads_total metric. (#6897, @vladikkuzn)
86+
- Added priority_class label for the following metrics:
87+
- kueue_admitted_workloads_total
88+
- kueue_evicted_workloads_total
89+
- kueue_evicted_workloads_once_total
90+
- kueue_quota_reserved_workloads_total
91+
- kueue_admission_wait_time_seconds
92+
- kueue_quota_reserved_wait_time_seconds
93+
- kueue_admission_checks_wait_time_seconds (#6951, @mbobrovskyi)
94+
- Added priority_class to kueue_local_queue_admission_checks_wait_time_seconds (#6902, @vladikkuzn)
95+
- Added priority_class to kueue_local_queue_admission_wait_time_seconds (#6899, @vladikkuzn)
96+
- Added priority_class to kueue_local_queue_quota_reserved_wait_time_seconds (#6900, @vladikkuzn)
97+
- Added workload_priority_class label for optional metrics (if waitForPodsReady is enabled):
98+
99+
- kueue_ready_wait_time_seconds (Histogram)
100+
- kueue_admitted_until_ready_wait_time_seconds (Histogram)
101+
- kueue_local_queue_ready_wait_time_seconds (Histogram)
102+
- kueue_local_queue_admitted_until_ready_wait_time_seconds (Histogram) (#6944, @IrvingMg)
103+
- DRA: Alpha support for Dynamic Resource Allocation in Kueue. (#5873, @alaypatel07)
104+
- ElasticJobs: Support in-tree RayAutoscaler for RayCluster (#6662, @VassilisVassiliadis)
105+
- KueueViz: Enhancing the following endpoint customizations and optimizations:
106+
- The frontend and backend ingress no longer have hardcoded NGINX annotations. You can now set your own annotations in Helm’s values.yaml using kueueViz.backend.ingress.annotations and kueueViz.frontend.ingress.annotations
107+
- The Ingress resources for KueueViz frontend and backend no longer require hardcoded TLS. You can now choose to use HTTP only by not providing kueueViz.backend.ingress.tlsSecretName and kueueViz.frontend.ingress.tlsSecretName
108+
- You can set environment variables like KUEUEVIZ_ALLOWED_ORIGINS directly from values.yaml using kueueViz.backend.env (#6682, @Smuger)
109+
- MultiKueue: Support external frameworks.
110+
Introduced a generic MultiKueue adapter to support external, custom Job-like workloads. This allows users to integrate custom Job-like CRDs (e.g., Tekton PipelineRuns) with MultiKueue for resource management across multiple clusters. This feature is guarded by the `MultiKueueGenericJobAdapter` feature gate. (#6760, @khrm)
111+
- Multikueue × ElasticJobs: The elastic `batchv1/Job` supports MultiKueue. (#6445, @ichekrygin)
112+
- ProvisioningRequest: Graduate ProvisioningACC feature to GA (#6382, @kannon92)
113+
- TAS: Graduated to Beta the following feature gates responsible for enabling and default configuration of the Node Hot Swap mechanism:
114+
TASFailedNodeReplacement, TASFailedNodeReplacementFailFast, TASReplaceNodeOnPodTermination. (#6890, @mbobrovskyi)
115+
- TAS: Implicit mode schedules consecutive indexes as close as possible (rank-ordering). (#6615, @PBundyra)
116+
- TAS: introduce validation against using PodSet grouping and PodSet slicing for the same PodSet,
117+
which is currently not supported. More precisely the `kueue.x-k8s.io/podset-group-name` annotation
118+
cannot be set along with any of: `kueue.x-k8s.io/podset-slice-size`, `kueue.x-k8s.io/podset-slice-required-topology`. (#7051, @kshalot)
119+
- The following limits for ClusterQueue quota specification have been relaxed:
120+
- the number of Flavors per ResourceGroup is increased from 16 to 64
121+
- the number of Resources per Flavor, within a ResourceGroup, is increased from 16 to 64
122+
123+
We also provide the following additional limits:
124+
- the total number of Flavors across all ResourceGroups is <= 256
125+
- the total number of Resources across all ResourceGroups is <= 256
126+
- the total number of (Flavor, Resource) pairs within a ResourceGroup is <= 512 (#6906, @LarsSven)
127+
- Visibility API: Adds support for Securing APIService. (#6798, @MaysaMacedo)
128+
- WorkloadRequestUseMergePatch: allows switching the Status Patch type from Apply to Merge for admission-related patches. (#6765, @mszadkow)
129+
130+
### Bug or Regression
131+
132+
- AFS: Fixed kueue-controller-manager crash when enabled AdmissionFairSharing feature gate without AdmissionFairSharing config. (#6670, @mbobrovskyi)
133+
- ElasticJobs: Fix the bug for the ElasticJobsViaWorkloadSlices feature where in case of Job resize followed by eviction
134+
of the "old" workload, the newly created workload could get admitted along with the "old" workload.
135+
The two workloads would overcommit the quota. (#6221, @ichekrygin)
136+
- ElasticJobs: Fix the bug that scheduling of the Pending workloads was not triggered on scale-down of the running
137+
elastic Job which could result in admitting one or more of the queued workloads. (#6395, @ichekrygin)
138+
- ElasticJobs: workloads correctly trigger workload preemption in response to a scale-up event. (#6973, @ichekrygin)
139+
- FS: Fix the algorithm bug for identifying preemption candidates, as it could return a different
140+
set of preemption target workloads (pseudo random) in consecutive attempts in tie-break scenarios,
141+
resulting in excessive preemptions. (#6764, @PBundyra)
142+
- FS: Fix the following FairSharing bugs:
143+
- Incorrect DominantResourceShare caused by rounding (large quotas or high FairSharing weight)
144+
- Preemption loop caused by zero FairSharing weight (#6925, @gabesaba)
145+
- FS: Fixing a bug where a preemptor ClusterQueue was unable to reclaim its nominal quota when the preemptee ClusterQueue can borrow a large number of resources from the parent ClusterQueue / Cohort (#6617, @pajakd)
146+
- FS: Validate FairSharing.Weight against small values which lose precision (0 < value <= 10^-9) (#6986, @gabesaba)
147+
- Fix accounting for the `evicted_workloads_once_total` metric:
148+
- the metric wasn't incremented for workloads evicted due to stopped LocalQueue (LocalQueueStopped reason)
149+
- the reason used for the metric was "Deactivated" for workloads deactivated by users and Kueue, now the reason label can have the following values: Deactivated, DeactivatedDueToAdmissionCheck, DeactivatedDueToMaximumExecutionTimeExceeded, DeactivatedDueToRequeuingLimitExceeded. This approach aligns the metric with `evicted_workloads_total`.
150+
- the metric was incremented during preemption before the preemption request was issued. Thus, it could be incorrectly over-counted in case of the preemption request failure.
151+
- the metric was not incremented for workload evicted due to NodeFailures (TAS)
152+
153+
The existing and introduced DeactivatedDueToXYZ reason label values will be replaced by the single "Deactivated" reason label value and underlying_cause in the future release. (#6332, @mimowo)
154+
- Fix bug in workload usage removal simulation that results in inaccurate flavor assignment (#7077, @gabesaba)
155+
- Fix support for PodGroup integration used by external controllers, which determine the
156+
the target LocalQueue and the group size only later. In that case the hash would not be
157+
computed resulting in downstream issues for ProvisioningRequest.
158+
159+
Now such an external controller can indicate the control over the PodGroup by adding
160+
the `kueue.x-k8s.io/pod-suspending-parent` annotation, and later patch the Pods by setting
161+
other metadata, like the kueue.x-k8s.io/queue-name label to initiate scheduling of the PodGroup. (#6286, @pawloch00)
162+
- Fix the bug for the StatefulSet integration which would occasionally cause a StatefulSet
163+
to be stuck without workload after renaming the "queue-name" label. (#7028, @IrvingMg)
164+
- Fix the bug that a workload going repeatedly via the preemption and re-admission cycle would accumulate the
165+
"Previously" prefix in the condition message, eg: "Previously: Previously: Previously: Preempted to accommodate a workload ...". (#6819, @amy)
166+
- Fix the bug which could occasionally cause workloads evicted by the built-in AdmissionChecks
167+
(ProvisioningRequest and MultiKueue) to get stuck in the evicted state which didn't allow re-scheduling.
168+
This could happen when the AdmissionCheck controller would trigger eviction by setting the
169+
Admission check state to "Retry". (#6283, @mimowo)
170+
- Fix the validation messages when attempting to remove the queue-name label from a Deployment or StatefulSet. (#6715, @Panlq)
171+
- Fixed a bug that prevented adding the kueue- prefix to the secretName field in cert-manager manifests when installing Kueue using the Kustomize configuration. (#6318, @mbobrovskyi)
172+
- HC: When multiple borrowing flavors are available, prefer the flavor which
173+
results in borrowing more locally (closer to the ClusterQueue, further from the root Cohort).
174+
175+
This fixes the scenario where a flavor would be selected which required borrowing
176+
from the root Cohort in one flavor, while in a second flavor, quota was
177+
available from the nearest parent Cohort. (#7024, @gabesaba)
178+
- Helm: Fix a bug where the internal cert manager assumed that the helm installation name is 'kueue'. (#6869, @cmtly)
179+
- Helm: Fixed a bug preventing Kueue from starting after installing via Helm with a release name other than "kueue" (#6799, @mbobrovskyi)
180+
- Helm: Fixed bug where webhook configurations assumed a helm install name as "kueue". (#6918, @cmtly)
181+
- KueueViz: Fix CORS configuration for development environments (#6603, @yankay)
182+
- KueueViz: Fix a bug that only localhost is an executable domain. (#7011, @kincl)
183+
- Pod-integration now correctly handles pods stuck in the Terminating state within pod groups, preventing them from being counted as active and avoiding blocked quota release. (#6872, @ichekrygin)
184+
- ProvisioningRequest: Fix a bug that Kueue didn't recreate the next ProvisioningRequest instance after the
185+
second (and consecutive) failed attempt. (#6322, @PBundyra)
186+
- Support disabling client-side ratelimiting in Config API clientConnection.qps with a negative value (e.g., -1) (#6300, @tenzen-y)
187+
- TAS: Fix a bug that the node failure controller tries to re-schedule Pods on the failure node even after the Node is recovered and reappears (#6325, @pajakd)
188+
- TAS: Fix a bug where new Workloads starve, caused by inadmissible workloads frequently requeueing due to unrelated Node LastHeartbeatTime update events. (#6570, @utam0k)
189+
- TAS: Fix the scenario when Node Hot Swap cannot find a replacement. In particular, if slices are used
190+
they could result in generating invalid assignment, resulting in panic from TopologyUngater.
191+
Now, such a workload is evicted. (#6914, @PBundyra)
192+
- TAS: Node Hot Swap allows replacing a node for workloads using PodSet slices,
193+
ie. when the `kueue.x-k8s.io/podset-slice-size` annotation is used. (#6942, @pajakd)
194+
- TAS: fix the bug that Kueue is crashing when PodSet has size 0, eg. no workers in LeaderWorkerSet instance. (#6501, @mimowo)
195+
196+
### Other (Cleanup or Flake)
197+
198+
- Promote ConfigurableResourceTransformations feature gate to stable. (#6599, @mbobrovskyi)
199+
- Support for Kubernetes 1.34 (#6689, @mbobrovskyi)
200+
- TAS: stop setting the "kueue.x-k8s.io/tas" label on Pods.
201+
202+
In case the implicit TAS mode is used, then the `kueue.x-k8s.io/podset-unconstrained-topology=true` annotation
203+
is set on Pods. (#6895, @mimowo)

Makefile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -87,8 +87,8 @@ LD_FLAGS += -X '$(version_pkg).BuildDate=$(shell date -u +%Y-%m-%dT%H:%M:%SZ)'
8787

8888
# Update these variables when preparing a new release or a release branch.
8989
# Then run `make prepare-release-branch`
90-
RELEASE_VERSION=v0.13.4
91-
RELEASE_BRANCH=main
90+
RELEASE_VERSION=v0.14.0
91+
RELEASE_BRANCH=release-0.14
9292
# Application version for Helm and npm (strips leading 'v' from RELEASE_VERSION)
9393
APP_VERSION := $(shell echo $(RELEASE_VERSION) | cut -c2-)
9494

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ Read the [overview](https://kueue.sigs.k8s.io/docs/overview/) and watch the Kueu
6363
To install the latest release of Kueue in your cluster, run the following command:
6464

6565
```shell
66-
kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.13.4/manifests.yaml
66+
kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.14.0/manifests.yaml
6767
```
6868

6969
The controller runs in the `kueue-system` namespace.

charts/kueue/Chart.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,9 @@ type: application
1616
# NOTE: Do not modify manually. In Kueue, the version and appVersion are
1717
# overridden to GIT_TAG when building the artifacts, including the helm charts,
1818
# via Makefile.
19-
version: 0.13.4
19+
version: 0.14.0
2020
# This is the version number of the application being deployed. This version number should be
2121
# incremented each time you make changes to the application. Versions are not expected to
2222
# follow Semantic Versioning. They should reflect the version the application is using.
2323
# It is recommended to use it with quotes.
24-
appVersion: "v0.13.4"
24+
appVersion: "v0.14.0"

0 commit comments

Comments
 (0)