Releases: kai-scheduler/KAI-Scheduler
Releases · kai-scheduler/KAI-Scheduler
v0.14.2
What's Changed
Added
Changed
- Suppressed noisy
Reconciler errorlogs andPodGrouperWarningevents on transient PodGroup update conflicts. The podgrouper now treatsIsConflicterrors as expected and silently requeues the reconcile instead of surfacing the apiserver's "object has been modified" message.
Fixed
- Fixed kai-operator not reconciling on Prometheus and ServiceMonitor changes. The Config controller now watches owned
PrometheusandServiceMonitorresources, so deletions and drift trigger reconciliation. CRD presence is checked at startup against the API server (the scheme-only check used previously could not detect missing CRDs), and the watch is registered only when the CRDs are installed. #877
Full Changelog: v0.14.1...v0.14.2
v0.12.20
What's Changed
Added
Changed
- Suppressed noisy
Reconciler errorlogs andPodGrouperWarningevents on transient PodGroup update conflicts. The podgrouper now treatsIsConflicterrors as expected and silently requeues the reconcile instead of surfacing the apiserver's "object has been modified" message.
Fixed
- Fixed kai-operator not reconciling on Prometheus and ServiceMonitor changes. The Config controller now watches owned
PrometheusandServiceMonitorresources, so deletions and drift trigger reconciliation. CRD presence is checked at startup against the API server (the scheme-only check used previously could not detect missing CRDs), and the watch is registered only when the CRDs are installed. #877
Full Changelog: v0.12.19...v0.12.20
v0.6.20
What's Changed
Changed
- Suppressed noisy
Reconciler errorlogs andPodGrouperWarningevents on transient PodGroup update conflicts. The podgrouper now treatsIsConflicterrors as expected and silently requeues the reconcile instead of surfacing the apiserver's "object has been modified" message.
Full Changelog: v0.6.19...v0.6.20
v0.4.20
What's Changed
- fix(scheduler): bind plugin server to localhost by @gshaibi in #998
- ci: add approval gatekeeper workflow for external contributor PRs (#973) by @gshaibi in #1007
- chore: auto-resolve CHANGELOG.md merge conflicts with union strategy by @KaiPilotBot in #1056
- chore(deps): bump github.com/NVIDIA/go-nvml from 0.12.4-1 to 0.13.0-1 by @dependabot[bot] in #1065
- chore(deps): bump knative.dev/serving from 0.44.0 to 0.48.1 by @dependabot[bot] in #1071
- chore(deps): bump github.com/gin-contrib/pprof from 1.5.2 to 1.5.3 by @dependabot[bot] in #1134
- chore(deps): bump github.com/grafana/pyroscope-go from 1.2.1 to 1.2.7 by @dependabot[bot] in #1133
- chore(deps): bump github.com/onsi/gomega from 1.38.2 to 1.39.1 by @dependabot[bot] in #1137
- chore(deps): bump github.com/onsi/ginkgo/v2 from 2.28.0 to 2.28.1 by @dependabot[bot] in #1201
- chore(deps): bump google.golang.org/grpc from 1.77.0 to 1.79.2 by @dependabot[bot] in #1194
- fix(scheduler): Do not include resources with a count of 0. by @KaiPilotBot in #1141
- Add dco github action by @KaiPilotBot in #1269
- chore(deps): bump google.golang.org/grpc from 1.79.2 to 1.79.3 by @dependabot[bot] in #1262
- ci: Skip dco checkout for dependabot PRs by @KaiPilotBot in #1276
- build: upgrade Go to 1.25.6, golangci-lint to v2.11.3, controller-gen to v0.20.1 - v0.4 by @davidLif in #1284
- ci: auto-pass DCO check for dependabot on merge_group events by @KaiPilotBot in #1288
- chore(deps): bump github.com/gin-gonic/gin from 1.10.0 to 1.12.0 by @dependabot[bot] in #1259
- ci: Skip DCO check in merge queue commits, due to github shenanigans by @KaiPilotBot in #1296
- ci: Do not skip the DCO action with github action level ifs, but adjust with bash if and an exclude pattern by @KaiPilotBot in #1304
- fix: v0.4- pod invariant predicate implement by @enoodle in #1542
Full Changelog: v0.4.19...v0.4.20
v0.9.17
What's Changed
Changed
- Suppressed noisy
Reconciler errorlogs andPodGrouperWarningevents on transient PodGroup update conflicts. The podgrouper now treatsIsConflicterrors as expected and silently requeues the reconcile instead of surfacing the apiserver's "object has been modified" message.
Full Changelog: v0.9.16...v0.9.17
v0.14.1
What's Changed
Added
- Track memory usage and action duration in snapshot tool v0.14 by @itsomri in #1414
- Allpodsmap caching by @itsomri in #1413
- Min member override by @itsomri in #1381
Fixed
- Check active BindRequests before deleting reservation pods by @KaiPilotBot in #1363
- Account for device count in multi-device GPU memory quota check by @KaiPilotBot in #1376
- Add resourceclaims/binding RBAC for DRA granular status authorization by @KaiPilotBot in #1379
- Call SetNode once per session v0.14 by @itsomri in #1422
- Added missing PVs to snapshot v0.14 by @itsomri in #1425
- Exponential job solver v0.14 by @itsomri in #1442
- Podgroup name for distributed batch jobs v0.14 by @itsomri in #1444
- Do not assume dra claims for completed/failure pods [v0.14] by @davidLif in #1459
- imagePullSecrets fixes by @enoodle in #1470
- Fixed propagate priorityClass, preemptibility by @KaiPilotBot in #1461
Full Changelog: v0.14.0...v0.14.1
v0.12.19
What's Changed
Fixed
- Do not include resources with a count of 0. by @KaiPilotBot in #1142
- Add resourceclaims/binding RBAC for DRA granular status authorization by @KaiPilotBot in #1377
- Fixed account for device count in multi-device GPU memory quota check by @enoodle in #1391
- Check active BindRequests before deleting reservation pods by @enoodle in #1387
- Do not assume dra claims for completed/failure pods by @davidLif in #1457
- imagePullSecrets fixes by @enoodle in #1468
- fix: propagate priorityClass, preemptibility by @SiorMeir in #1479
Full Changelog: v0.12.18...v0.12.19
v0.9.16
What's Changed
Added
Fixed
- Fixed Do not include resources with a count of 0. by @KaiPilotBot in #1139
- Fixed flaky subgroups e2e (v0.9) by @enoodle in #1474
- Fixed imagePullSecrets fixes by @enoodle in #1467
- Fixed account for device count in multi-device GPU memory quota check by @enoodle in #1392
- Fixed check active BindRequests before deleting reservation pods by @enoodle in #1388
Full Changelog: v0.9.15...v0.9.16
v0.6.19
v0.14.0
What's Changed
Added
- Added queue validation webhook to queuecontroller with optional quota validation for parent-child relationships AdheipSingh
- Added support for VPA configuration for the different components of the KAI Scheduler - jrosenboimnvidia
- Users that have VPA installed on their cluster can now utilize it for proper vertical autoscaling
- Added FOSSA scanning for the repository context. Scans will also be performed for submitted PRs. The results can be found here. #1178 - davidLif
- Added support for Ray subgroup topology-aware scheduling by specifying
kai.scheduler/topology,kai.scheduler/topology-required-placement, andkai.scheduler/topology-preferred-placementannotations. - Allow subgroups to have a 0 value for "minAvailable". This means that all pods in this subgroup are "elastic extra pods". #1216 davidLif
Changed
- Auto-enable leader election when
operator.replicaCount> 1 to prevent concurrent reconciliation #1218 - Update go version to v1.26.1, With appropriate upgrades to the base docker images, linter, and controller generator. #1222 - davidLif
Fixed
- Updated resource enumeration logic to exclude resources with count of 0. #1120
- Fixed scheduler on k8s < 1.34 with DRA disabled.
- Fixed pod group controller failing to track DRA GPU resources on Kubernetes 1.32-1.33 clusters. #1214
- Fixed scheduling-constraints signature hashing for
Priorityand containerHostPortby encoding fullint32values, preventing byte-truncation collisions and flaky signature tests. - Fixed rollback in scheduling simulations with DRA #1168 itsomri
- Fixed a potential state corruption in DRA scheduling simulations #1219 itsomri
- Fixed operator reconcile loop caused by status-only updates triggering re-reconciliation. #1229 cypres
- Fixed scheduler not starting on k8s clusters with DRA disabled, due to the ResourceSliceTracker not syncing. #1241 cypres
- Fixed webhook reconcile loop on AKS, by retaining the cloud-provider-injected namespaceSelector rules during reconciliation. #1292 cypres
New Contributors
- @rich7420 made their first contribution in #816
- @Ronkahn21 made their first contribution in #821
- @faizan-exe made their first contribution in #913
- @lalitadithya made their first contribution in #954
- @steved made their first contribution in #972
- @yuanchen8911 made their first contribution in #1035
- @Hagay-RunAI made their first contribution in #1115
- @dougnd made their first contribution in #1123
- @rueian made their first contribution in #1125
- @JRosenboimNVIDIA made their first contribution in #1119
- @itayvallach made their first contribution in #1176
- @david-gang made their first contribution in #1223
- @cypres made their first contribution in #1241
- @AdheipSingh made their first contribution in #857
Full Changelog: v0.13.4...v0.14.0