Releases · kai-scheduler/KAI-Scheduler

24 Jun 16:26

davidLif

v0.16.0

181e80d

v0.16.0 Latest

Latest

Added

Added a bounded scenario generator portfolio for reclaim, preempt, and consolidation search, with SchedulingShard.spec.scenarioSearchBudgets time-budget configuration and production scenario-search metrics.
Added an opt-in deviceaccess admission plugin (--block-nvidia-visible-devices, config field admission.blockNvidiaVisibleDevices, default disabled) that (1) rejects pods overriding the NVIDIA_VISIBLE_DEVICES environment variable with values other than void/none (or via a valueFrom reference), and (2) injects NVIDIA_VISIBLE_DEVICES=void into containers that do not request a GPU, blocking their access to GPUs on the node.
Added support for configuring admission Pod Disruption Budget via Helm values (admission.podDisruptionBudget) #1490 dttung2905
Added an opt-in hamicore binder plugin (depends on gpusharing) to write the HAMI-core GPU memory limit (CUDA_DEVICE_MEMORY_LIMIT) for fractional GPU pods.
Added global.podSecurityContext, global.resourceReservation.namespaceLabels, nodescaleadjuster.labels, crdupgrader.resources, topologyMigration.resources, and postCleanup.resources to the Helm. chart.
Skill to capture and run snapshots
Added kaiConfigDeployer.enabled Helm value (default true) to allow disabling the post-install/post-upgrade hook that applies the kai-config CR, for managing the CR outside of the chart.
Added defaultShard.enabled Helm value (default true) to allow installing KAI without deploying the chart-managed default SchedulingShard CR.
Added NUMA-aware scheduling (v1). A new numa scheduler plugin predicts the kubelet Topology Manager's admission verdict for the single-numa-node and restricted policies from NodeResourceTopology (NRT) data, to prevent TopologyAffinityErrors. The plugin is opt-in per shard (enable the numa plugin in a SchedulingShard). Shipped alongside it is an optional per-node NUMA placement exporter DaemonSet that exports the kubelet podresources API to each pod's observed NUMA placement. This is v1 — feasibility filtering and per-zone correctness; node scoring/optimization is future work. See the design for details: NUMA-Aware Scheduling via NodeResourceTopology.

Changed

Scoped admission runtimeClassName injection to GPU fraction pods only; whole-GPU pods are no longer mutated. admission.gpuPodRuntimeClassName is deprecated in favor of admission.gpuFractionRuntimeClassName. Reservation pod runtimeClassName now defaults to empty. #1543 davidLif
Removed redundant PodDisruptionBudgetImplemented guard from operator PDB creation helper #1613 dttung2905
Updated Go toolchain and base build images to v1.26.3.
Breaking: The podgroup produced for JobSet is now produces as a single PodGroup per JobSet with a two-level SubGroup hierarchy (one parent SubGroup per replicatedJob, one leaf SubGroup per replica) regardless of startupPolicyOrder. The kai.scheduler/batch-min-member annotation on the JobSet now overrides the root minSubGroup; the same annotation on replicatedJobs[].template.metadata.annotations overrides the leaf minMember (defaulting to template.spec.parallelism). #1617 davidLif

Fixed

Reduced scheduler heap retention after scheduling cycles by clearing completed session snapshots and callback references, and by releasing the node scoring pool without waiting for finalizers.
Fixed Helm chart prometheus RBAC always being installed when prometheus.enabled is false, and the kai-prometheus ClusterRoleBinding referencing the prometheus ServiceAccount in hardcoded kai-scheduler namespace instead of the Helm release namespace #1684 dttung2905
Fixed post-delete cleanup hook hardcoding kai-scheduler namespace instead of Helm release namespace on helm uninstall #1619 dttung2905
Improved solver performance in some large reclaim scenarios #1627 itsomri
Grove grouper now sets minSubGroup (equal to the number of child SubGroups) instead of minMember=0 on parent SubGroups generated from topologyConstraintGroupConfigs #1639 davidLif
Fixed Helm chart not wiring podgrouper.queueLabelKey into spec.global.queueLabelKey on the Config CR, so custom queue label keys were ignored at install time #1655 dttung2905
Fixed scheduler nil-pointer panic in the preempt scenario builder when a (partial) job has no tasks to allocate (NewIdleGpusFilter dereferenced a nil scenario); added the missing nil-guard matching the sibling filters #1664 sam-huang1223
Fixed default node-scale-adjuster image name (node-scale-adjuster → nodescaleadjuster) so it matches the image published to GHCR
Fixed duplicate GPU reservation pods being created for a single gpu-group on a node (each reserving a different physical GPU), which corrupted the scheduler's fractional-GPU accounting and left devices unschedulable. Reservation pods are now named deterministically per (node, gpu-group) and treat AlreadyExists as success, so concurrent or retried binds collide on one object instead of duplicating #1673

Assets 3

22 Jun 11:43

SiorMeir

v0.15.3

c80751e

v0.15.3

What's Changed

Fixed

fAccount for native sidecar requests in pod resources by @KaiPilotBot in #1633
Align node-scale-adjuster default image name with GHCR by @KaiPilotBot in #1682
Backport kai-config upgrade fix to v0.15 by @gshaibi in #1704
Backport nil-scenario preempt guard to v0.15 by @enoodle in #1699

Changed

Expand k8s support testing to 1.28+ (v0.15 backport) by @enoodle in #1686
Drop github.com/NVIDIA/gpu-operator dependency(backport) by @KaiPilotBot in #1710

Full Changelog: v0.15.2...v0.15.3

Contributors

enoodle, gshaibi, and KaiPilotBot

Assets 3

22 Jun 11:46

SiorMeir

v0.14.6

fb1a6fb

v0.14.6

What's Changed

Fixed

Account for native sidecar requests in pod resources by @KaiPilotBot in #1632
Align node-scale-adjuster default image name with GHCR by @KaiPilotBot in #1681
Prevent duplicate reservation pods for a single gpu-group (#1693) [v0.14] by @gshaibi in #1703
Backport nil-scenario preempt guard to v0.14 by @enoodle in #1700
Backport kai-config upgrade fix to v0.14 by @gshaibi in #1705

Changed

Expand k8s support testing to 1.28+ (v0.14 backport) by @enoodle in #1685
Drop github.com/NVIDIA/gpu-operator dependency by @KaiPilotBot in #1709

Full Changelog: v0.14.5...v0.14.6

Contributors

enoodle, gshaibi, and KaiPilotBot

Assets 3

22 Jun 12:07

SiorMeir

v0.12.22

bf47e10

v0.12.22

What's Changed

Fixed

Account for native sidecar requests in pod resources by @KaiPilotBot in #1631
Align node-scale-adjuster default image name with GHCR by @KaiPilotBot in #1680
Prevent duplicate reservation pods for a single gpu-group (#1693) [v0.12] by @gshaibi in #1702
Backport kai-config upgrade fix to v0.12 by @gshaibi in #1706

Changed

Drop github.com/NVIDIA/gpu-operator dependency v0.12 by @SiorMeir in #1712

Full Changelog: v0.12.21...v0.12.22

Contributors

gshaibi, SiorMeir, and KaiPilotBot

Assets 3

10 Jun 11:15

enoodle

v0.15.2

ffee198

v0.15.2

What's Changed

Fixed

Improved solver performance in some large reclaim scenarios #1627 itsomri
Grove grouper now sets minSubGroup (equal to the number of child SubGroups) instead of minMember=0 on parent SubGroups generated from topologyConstraintGroupConfigs #1639 davidLif

[v0.15.1] - 2026-06-01

Changed

Updated Go toolchain and base build images to v1.26.3.

Full Changelog: v0.15.1...v0.15.2

Assets 3

02 Jun 07:27

itsomri

v0.14.5

9b97b61

v0.14.5

What's Changed

Improved solver performance in some large reclaim scenarios #1627 itsomri
Improved reclaim, preempt, and consolidation performance by skipping solver work for jobs blocked by victim-invariant pre-predicate failures such as missing PVCs, missing required ConfigMaps, and tasks larger than any node. #1502

Full Changelog: v0.14.4...v0.14.5

Assets 3

01 Jun 07:04

itsomri

v0.15.1

6069ab8

v0.15.1

What's Changed

Updated Go toolchain and base build images to v1.26.3

Full Changelog: v0.15.0...v0.15.1

Assets 3

01 Jun 07:00

itsomri

v0.14.4

1fb6b71

v0.14.4

What's Changed

chore: Update go version in v0.14 by @itsomri in #1611

Full Changelog: v0.14.3...v0.14.4

Contributors

itsomri

Assets 3

01 Jun 06:58

itsomri

v0.12.21

7291c20

v0.12.21

What's Changed

fix(scheduler): backport victim-invariant prefilter to v0.12 by @enoodle in #1569
chore: Update go version in v0.12 by @itsomri in #1609

Full Changelog: v0.12.20...v0.12.21

Contributors

enoodle and itsomri

Assets 3

01 Jun 06:58

itsomri

v0.9.19

c78793b

v0.9.19

What's Changed

Updated Go toolchain and base build images to v1.25.10.

Full Changelog: v0.9.18...v0.9.19

Assets 3

Uh oh!

Releases: kai-scheduler/KAI-Scheduler

v0.16.0

Added

Changed

Fixed

Uh oh!

v0.15.3

What's Changed

Fixed

Changed

Contributors

Uh oh!

v0.14.6

What's Changed

Fixed

Changed

Contributors

Uh oh!

v0.12.22

What's Changed

Fixed

Changed

Contributors

Uh oh!

v0.15.2

What's Changed

Fixed

[v0.15.1] - 2026-06-01

Changed

Uh oh!

v0.14.5

What's Changed

Uh oh!

v0.15.1

What's Changed

Uh oh!

v0.14.4

What's Changed

Contributors

Uh oh!

v0.12.21

What's Changed

Contributors

Uh oh!

v0.9.19

What's Changed

Uh oh!