Skip to content

Releases: kai-scheduler/KAI-Scheduler

v0.16.0

24 Jun 16:26
181e80d

Choose a tag to compare

Added

  • Added a bounded scenario generator portfolio for reclaim, preempt, and consolidation search, with SchedulingShard.spec.scenarioSearchBudgets time-budget configuration and production scenario-search metrics.
  • Added an opt-in deviceaccess admission plugin (--block-nvidia-visible-devices, config field admission.blockNvidiaVisibleDevices, default disabled) that (1) rejects pods overriding the NVIDIA_VISIBLE_DEVICES environment variable with values other than void/none (or via a valueFrom reference), and (2) injects NVIDIA_VISIBLE_DEVICES=void into containers that do not request a GPU, blocking their access to GPUs on the node.
  • Added support for configuring admission Pod Disruption Budget via Helm values (admission.podDisruptionBudget) #1490 dttung2905
  • Added an opt-in hamicore binder plugin (depends on gpusharing) to write the HAMI-core GPU memory limit (CUDA_DEVICE_MEMORY_LIMIT) for fractional GPU pods.
  • Added global.podSecurityContext, global.resourceReservation.namespaceLabels, nodescaleadjuster.labels, crdupgrader.resources, topologyMigration.resources, and postCleanup.resources to the Helm. chart.
  • Skill to capture and run snapshots
  • Added kaiConfigDeployer.enabled Helm value (default true) to allow disabling the post-install/post-upgrade hook that applies the kai-config CR, for managing the CR outside of the chart.
  • Added defaultShard.enabled Helm value (default true) to allow installing KAI without deploying the chart-managed default SchedulingShard CR.
  • Added NUMA-aware scheduling (v1). A new numa scheduler plugin predicts the kubelet Topology Manager's admission verdict for the single-numa-node and restricted policies from NodeResourceTopology (NRT) data, to prevent TopologyAffinityErrors. The plugin is opt-in per shard (enable the numa plugin in a SchedulingShard). Shipped alongside it is an optional per-node NUMA placement exporter DaemonSet that exports the kubelet podresources API to each pod's observed NUMA placement. This is v1 — feasibility filtering and per-zone correctness; node scoring/optimization is future work. See the design for details: NUMA-Aware Scheduling via NodeResourceTopology.

Changed

  • Scoped admission runtimeClassName injection to GPU fraction pods only; whole-GPU pods are no longer mutated. admission.gpuPodRuntimeClassName is deprecated in favor of admission.gpuFractionRuntimeClassName. Reservation pod runtimeClassName now defaults to empty. #1543 davidLif
  • Removed redundant PodDisruptionBudgetImplemented guard from operator PDB creation helper #1613 dttung2905
  • Updated Go toolchain and base build images to v1.26.3.
  • Breaking: The podgroup produced for JobSet is now produces as a single PodGroup per JobSet with a two-level SubGroup hierarchy (one parent SubGroup per replicatedJob, one leaf SubGroup per replica) regardless of startupPolicyOrder. The kai.scheduler/batch-min-member annotation on the JobSet now overrides the root minSubGroup; the same annotation on replicatedJobs[].template.metadata.annotations overrides the leaf minMember (defaulting to template.spec.parallelism). #1617 davidLif

Fixed

  • Reduced scheduler heap retention after scheduling cycles by clearing completed session snapshots and callback references, and by releasing the node scoring pool without waiting for finalizers.
  • Fixed Helm chart prometheus RBAC always being installed when prometheus.enabled is false, and the kai-prometheus ClusterRoleBinding referencing the prometheus ServiceAccount in hardcoded kai-scheduler namespace instead of the Helm release namespace #1684 dttung2905
  • Fixed post-delete cleanup hook hardcoding kai-scheduler namespace instead of Helm release namespace on helm uninstall #1619 dttung2905
  • Improved solver performance in some large reclaim scenarios #1627 itsomri
  • Grove grouper now sets minSubGroup (equal to the number of child SubGroups) instead of minMember=0 on parent SubGroups generated from topologyConstraintGroupConfigs #1639 davidLif
  • Fixed Helm chart not wiring podgrouper.queueLabelKey into spec.global.queueLabelKey on the Config CR, so custom queue label keys were ignored at install time #1655 dttung2905
  • Fixed scheduler nil-pointer panic in the preempt scenario builder when a (partial) job has no tasks to allocate (NewIdleGpusFilter dereferenced a nil scenario); added the missing nil-guard matching the sibling filters #1664 sam-huang1223
  • Fixed default node-scale-adjuster image name (node-scale-adjusternodescaleadjuster) so it matches the image published to GHCR
  • Fixed duplicate GPU reservation pods being created for a single gpu-group on a node (each reserving a different physical GPU), which corrupted the scheduler's fractional-GPU accounting and left devices unschedulable. Reservation pods are now named deterministically per (node, gpu-group) and treat AlreadyExists as success, so concurrent or retried binds collide on one object instead of duplicating #1673

v0.15.3

22 Jun 11:43
c80751e

Choose a tag to compare

What's Changed

Fixed

Changed

  • Expand k8s support testing to 1.28+ (v0.15 backport) by @enoodle in #1686
  • Drop github.com/NVIDIA/gpu-operator dependency(backport) by @KaiPilotBot in #1710

Full Changelog: v0.15.2...v0.15.3

v0.14.6

22 Jun 11:46
fb1a6fb

Choose a tag to compare

What's Changed

Fixed

Changed

Full Changelog: v0.14.5...v0.14.6

v0.12.22

22 Jun 12:07
bf47e10

Choose a tag to compare

What's Changed

Fixed

Changed

  • Drop github.com/NVIDIA/gpu-operator dependency v0.12 by @SiorMeir in #1712

Full Changelog: v0.12.21...v0.12.22

v0.15.2

10 Jun 11:15
ffee198

Choose a tag to compare

What's Changed

Fixed

  • Improved solver performance in some large reclaim scenarios #1627 itsomri
  • Grove grouper now sets minSubGroup (equal to the number of child SubGroups) instead of minMember=0 on parent SubGroups generated from topologyConstraintGroupConfigs #1639 davidLif

[v0.15.1] - 2026-06-01

Changed

  • Updated Go toolchain and base build images to v1.26.3.

Full Changelog: v0.15.1...v0.15.2

v0.14.5

02 Jun 07:27
9b97b61

Choose a tag to compare

What's Changed

  • Improved solver performance in some large reclaim scenarios #1627 itsomri
  • Improved reclaim, preempt, and consolidation performance by skipping solver work for jobs blocked by victim-invariant pre-predicate failures such as missing PVCs, missing required ConfigMaps, and tasks larger than any node. #1502

Full Changelog: v0.14.4...v0.14.5

v0.15.1

01 Jun 07:04
6069ab8

Choose a tag to compare

What's Changed

  • Updated Go toolchain and base build images to v1.26.3

Full Changelog: v0.15.0...v0.15.1

v0.14.4

01 Jun 07:00
1fb6b71

Choose a tag to compare

What's Changed

Full Changelog: v0.14.3...v0.14.4

v0.12.21

01 Jun 06:58
7291c20

Choose a tag to compare

What's Changed

  • fix(scheduler): backport victim-invariant prefilter to v0.12 by @enoodle in #1569
  • chore: Update go version in v0.12 by @itsomri in #1609

Full Changelog: v0.12.20...v0.12.21

v0.9.19

01 Jun 06:58
c78793b

Choose a tag to compare

What's Changed

  • Updated Go toolchain and base build images to v1.25.10.

Full Changelog: v0.9.18...v0.9.19