|
| 1 | +## v0.10.0 |
| 2 | + |
| 3 | +Changes since `v0.9.0`: |
| 4 | + |
| 5 | +## Urgent Upgrade Notes |
| 6 | + |
| 7 | +### (No, really, you MUST read this before you upgrade) |
| 8 | + |
| 9 | +- PodSets for RayJobs now account for submitter Job when spec.submissionMode=k8sJob is used. |
| 10 | + |
| 11 | + if you used the RayJob integration you may need to revisit your quota settings, |
| 12 | + because now Kueue accounts for the resources required by the KubeRay submitter Job |
| 13 | + when the spec.submissionMode=k8sJob (by default 500m CPU and 200Mi memory) (#3729, @andrewsykim) |
| 14 | + - Removed the v1alpha1 Visibility API. |
| 15 | + |
| 16 | + The v1alpha1 Visibility API is deprecated. Please use v1beta1 instead. (#3499, @mbobrovskyi) |
| 17 | + - The InactiveWorkload reason for the Evicted condition is renamed to Deactivated. |
| 18 | + Also, the reasons for more detailed situations are renamed: |
| 19 | + - InactiveWorkloadAdmissionCheck -> DeactivatedDueToAdmissionCheck |
| 20 | + - InactiveWorkloadRequeuingLimitExceeded -> DeactivatedDueToRequeuingLimitExceeded |
| 21 | + |
| 22 | + If you were watching for the "InactiveWorkload" reason in the "Evicted" condition, you need |
| 23 | + to start watching for the "Deactivated" reason. (#3593, @mbobrovskyi) |
| 24 | + |
| 25 | +## Changes by Kind |
| 26 | + |
| 27 | +### Feature |
| 28 | + |
| 29 | +- Adds a managedJobsNamespaceSelector to the Kueue configuration that enables namespace-based control of whether Jobs submitted without a `kueue.x-k8s.io/queue-name` label are managed by Kueue for all supported Job Kinds. (#3712, @dgrove-oss) |
| 30 | +- Allow mutating the queue-name label for non-running Deployments. (#3528, @mbobrovskyi) |
| 31 | +- Allowed StatefulSet scaling down to zero and scale up from zero. (#3487, @mbobrovskyi) |
| 32 | +- Extend the GenericJob interface to allow implementations of custom Job CRDs to use |
| 33 | + Topology-Aware Scheduling with rank-based ordering. (#3704, @PBundyra) |
| 34 | +- Introduce alpha feature, behind the LocalQueueMetrics feature gate, which allows users to get the prometheus LocalQueues metrics: |
| 35 | + local_queue_pending_workloads |
| 36 | + local_queue_quota_reserved_workloads_total |
| 37 | + local_queue_quota_reserved_wait_time_seconds |
| 38 | + local_queue_admitted_workloads_total |
| 39 | + local_queue_admission_wait_time_seconds |
| 40 | + local_queue_admission_checks_wait_time_seconds |
| 41 | + local_queue_evicted_workloads_total |
| 42 | + local_queue_reserving_active_workloads |
| 43 | + local_queue_admitted_active_workloads |
| 44 | + local_queue_status |
| 45 | + local_queue_resource_reservation |
| 46 | + local_queue_resource_usage (#3673, @KPostOffice) |
| 47 | +- Introduce the LocalQueue defaulting, enabled by the LocalQueueDefaulting feature gate. |
| 48 | + When a new workload is created without the "queue-name" label, and the LocalQueue |
| 49 | + with name "default" name exists in the workload's namespace, then the value of the |
| 50 | + "queue-name" is defaulted to "default". (#3610, @yaroslava-serdiuk) |
| 51 | +- Kueue-viz: A Dashboard for kueue (#3727, @akram) |
| 52 | +- Optimize the size of the Workload object when Topology-Aware Scheduling is used, and the |
| 53 | + `kubernetes.io/hostname` is defined as the lowest Topology level. In that case the `TopologyAssignment` |
| 54 | + in the Workload's Status contains value only for this label, rather than for all levels defined. (#3677, @PBundyra) |
| 55 | +- Promote MultiplePreemptions feature gate to stable, and drop the legacy preemption logic. (#3602, @gabesaba) |
| 56 | +- Promoted ConfigurableResourceTransformations and WorkloadResourceRequestsSummary to Beta and enabled by default. (#3616, @dgrove-oss) |
| 57 | +- ResourceFlavorSpec that defines topologyName is not immutable (#3738, @PBundyra) |
| 58 | +- Respect node taints in Topology-Aware Scheduling when the lowest topology level is kubernetes.io/hostname. (#3678, @mimowo) |
| 59 | +- Support `.featureGates` field in the configuration API to enable and disable the Kueue features (#3805, @kannon92) |
| 60 | +- Support rank-based ordering of Pods with Topology-Aware Scheduling. |
| 61 | + The Pod indexes are determined based on the "kueue.x-k8s.io/pod-group-index" label which |
| 62 | + can be set by an external controller managing the group. (#3649, @PBundyra) |
| 63 | +- TAS: Support rank-based ordering for StatefulSet. (#3751, @mbobrovskyi) |
| 64 | +- TAS: The CQ referencing a Topology is deactivated if the topology does not exist. (#3770, @mimowo) |
| 65 | +- TAS: support rank-based ordering for JobSet (#3591, @mimowo) |
| 66 | +- TAS: support rank-based ordering for Kubeflow (#3604, @mbobrovskyi) |
| 67 | +- TAS: support rank-ordering of Pods for the Kubernetes batch Job. (#3539, @mimowo) |
| 68 | +- TAS: validate that kubernetes.io/hostname can only be at the lowest level (#3714, @mbobrovskyi) |
| 69 | + |
| 70 | +### Bug or Regression |
| 71 | + |
| 72 | +- Added validation for Deployment queue-name to fail fast (#3555, @mbobrovskyi) |
| 73 | +- Added validation for StatefulSet queue-name to fail fast. (#3575, @mbobrovskyi) |
| 74 | +- Change, and in some scenarios fix, the status message displayed to user when a workload doesn't fit in available capacity. (#3536, @gabesaba) |
| 75 | +- Determine borrowing more accurately, allowing preempting workloads which fit in nominal quota to schedule faster (#3547, @gabesaba) |
| 76 | +- Fix Kueue crashing when the node for an admitted workload is deleted. (#3715, @mimowo) |
| 77 | +- Fix a bug which occasionally prevented updates to the PodTemplate of the Job on the management cluster |
| 78 | + when starting a Job (e.g. updating nodeSelectors), when using `MultiKueueBatchJobWithManagedBy` enabled. (#3685, @IrvingMg) |
| 79 | +- Fix accounting for usage coming from TAS workloads using multiple resources. The usage was multiplied |
| 80 | + by the number of resources requested by a workload, which could result in under-utilization of the cluster. |
| 81 | + It also manifested itself in the message in the workload status which could contain negative numbers. (#3490, @mimowo) |
| 82 | +- Fix computing the topology assignment for workloads using multiple PodSets requesting the same |
| 83 | + topology. In particular, it was possible for the set of topology domains in the assignment to be empty, |
| 84 | + and as a consequence the pods would remain gated forever as the TopologyUngater would not have |
| 85 | + topology assignment information. (#3514, @mimowo) |
| 86 | +- Fix dropping of reconcile requests for non-leading replica, which was resulting in workloads |
| 87 | + getting stuck pending after the rolling restart of Kueue. (#3612, @mimowo) |
| 88 | +- Fix memory leak due to workload entries left in MultiKueue cache. The leak affects the 0.9.0 and 0.9.1 |
| 89 | + releases which enable MultiKueue by default, even if MultiKueue is not explicitly used on the cluster. (#3835, @mimowo) |
| 90 | +- Fix misleading log messages from workload_controller indicating not existing LocalQueue or |
| 91 | + Cluster Queue. For example "LocalQueue for workload didn't exist or not active; ignored for now" |
| 92 | + could also be logged the ClusterQueue does not exist. (#3605, @7h3-3mp7y-m4n) |
| 93 | +- Fix preemption when using Hierarchical Cohorts by considering as preemption candidates workloads |
| 94 | + from ClusterQueues located further in the hierarchy tree than direct siblings. (#3691, @gabesaba) |
| 95 | +- Fix running Job when parallelism < completions, before the fix the replacement pods for the successfully |
| 96 | + completed Pods were not ungated. (#3559, @mimowo) |
| 97 | +- Fix scheduling in TAS by considering tolerations specified in the ResourceFlavor. (#3723, @mimowo) |
| 98 | +- Fix scheduling of workload which does not include the toleration for the taint in ResourceFlavor's spec.nodeTaints, |
| 99 | + if the toleration is specified on the ResourceFlavor itself. (#3722, @PBundyra) |
| 100 | +- Fix the bug which prevented the use of MultiKueue if there is a CRD which is not installed |
| 101 | + and removed from the list of enabled integrations. (#3603, @mszadkow) |
| 102 | +- Fix the flow of deactivation for workloads due to rejected AdmissionChecks. |
| 103 | + Now, all AdmissionChecks are reset back to the Pending state on eviction (and deactivation in particular), |
| 104 | + and so an admin can easily re-activate such a workload manually without tweaking the checks. (#3350, @KPostOffice) |
| 105 | +- Fixed rolling update for StatefulSet integration (#3684, @mbobrovskyi) |
| 106 | +- Make topology levels immutable to prevent issues with inconsistent state of the TAS cache. (#3641, @mbobrovskyi) |
| 107 | +- TAS: Fixed bug that doesn't allow to update cache on delete Topology. (#3615, @mbobrovskyi) |
| 108 | + |
| 109 | +### Other (Cleanup or Flake) |
| 110 | + |
| 111 | +- Eliminate webhook validation in case Pod integration is used on 1.26 or earlier versions of Kubernetes. (#3247, @vladikkuzn) |
| 112 | +- Replace deprecated gcr.io/kubebuilder/kube-rbac-proxy with registry.k8s.io/kubebuilder/kube-rbac-proxy. (#3747, @mbobrovskyi) |
0 commit comments