Releases: kai-scheduler/KAI-Scheduler
Releases · kai-scheduler/KAI-Scheduler
v0.16.0
Added
- Added a bounded scenario generator portfolio for reclaim, preempt, and consolidation search, with
SchedulingShard.spec.scenarioSearchBudgetstime-budget configuration and production scenario-search metrics. - Added an opt-in
deviceaccessadmission plugin (--block-nvidia-visible-devices, config fieldadmission.blockNvidiaVisibleDevices, default disabled) that (1) rejects pods overriding theNVIDIA_VISIBLE_DEVICESenvironment variable with values other thanvoid/none(or via avalueFromreference), and (2) injectsNVIDIA_VISIBLE_DEVICES=voidinto containers that do not request a GPU, blocking their access to GPUs on the node. - Added support for configuring admission Pod Disruption Budget via Helm values (
admission.podDisruptionBudget) #1490 dttung2905 - Added an opt-in
hamicorebinder plugin (depends ongpusharing) to write the HAMI-core GPU memory limit (CUDA_DEVICE_MEMORY_LIMIT) for fractional GPU pods. - Added
global.podSecurityContext,global.resourceReservation.namespaceLabels,nodescaleadjuster.labels,crdupgrader.resources,topologyMigration.resources, andpostCleanup.resourcesto the Helm. chart. - Skill to capture and run snapshots
- Added
kaiConfigDeployer.enabledHelm value (defaulttrue) to allow disabling the post-install/post-upgrade hook that applies the kai-config CR, for managing the CR outside of the chart. - Added
defaultShard.enabledHelm value (defaulttrue) to allow installing KAI without deploying the chart-managed defaultSchedulingShardCR. - Added NUMA-aware scheduling (v1). A new
numascheduler plugin predicts the kubelet Topology Manager's admission verdict for thesingle-numa-nodeandrestrictedpolicies fromNodeResourceTopology(NRT) data, to preventTopologyAffinityErrors. The plugin is opt-in per shard (enable thenumaplugin in aSchedulingShard). Shipped alongside it is an optional per-node NUMA placement exporter DaemonSet that exports the kubelet podresources API to each pod's observed NUMA placement. This is v1 — feasibility filtering and per-zone correctness; node scoring/optimization is future work. See the design for details: NUMA-Aware Scheduling via NodeResourceTopology.
Changed
- Scoped admission
runtimeClassNameinjection to GPU fraction pods only; whole-GPU pods are no longer mutated.admission.gpuPodRuntimeClassNameis deprecated in favor ofadmission.gpuFractionRuntimeClassName. Reservation podruntimeClassNamenow defaults to empty. #1543 davidLif - Removed redundant
PodDisruptionBudgetImplementedguard from operator PDB creation helper #1613 dttung2905 - Updated Go toolchain and base build images to v1.26.3.
- Breaking: The podgroup produced for JobSet is now produces as a single PodGroup per JobSet with a two-level SubGroup hierarchy (one parent SubGroup per
replicatedJob, one leaf SubGroup per replica) regardless ofstartupPolicyOrder. Thekai.scheduler/batch-min-memberannotation on the JobSet now overrides the rootminSubGroup; the same annotation onreplicatedJobs[].template.metadata.annotationsoverrides the leafminMember(defaulting totemplate.spec.parallelism). #1617 davidLif
Fixed
- Reduced scheduler heap retention after scheduling cycles by clearing completed session snapshots and callback references, and by releasing the node scoring pool without waiting for finalizers.
- Fixed Helm chart prometheus RBAC always being installed when
prometheus.enabledis false, and thekai-prometheusClusterRoleBinding referencing theprometheusServiceAccount in hardcodedkai-schedulernamespace instead of the Helm release namespace #1684 dttung2905 - Fixed post-delete cleanup hook hardcoding
kai-schedulernamespace instead of Helm release namespace onhelm uninstall#1619 dttung2905 - Improved solver performance in some large reclaim scenarios #1627 itsomri
- Grove grouper now sets
minSubGroup(equal to the number of child SubGroups) instead ofminMember=0on parent SubGroups generated fromtopologyConstraintGroupConfigs#1639 davidLif - Fixed Helm chart not wiring
podgrouper.queueLabelKeyintospec.global.queueLabelKeyon the Config CR, so custom queue label keys were ignored at install time #1655 dttung2905 - Fixed scheduler nil-pointer panic in the preempt scenario builder when a (partial) job has no tasks to allocate (
NewIdleGpusFilterdereferenced a nil scenario); added the missing nil-guard matching the sibling filters #1664 sam-huang1223 - Fixed default node-scale-adjuster image name (
node-scale-adjuster→nodescaleadjuster) so it matches the image published to GHCR - Fixed duplicate GPU reservation pods being created for a single
gpu-groupon a node (each reserving a different physical GPU), which corrupted the scheduler's fractional-GPU accounting and left devices unschedulable. Reservation pods are now named deterministically per (node, gpu-group) and treat AlreadyExists as success, so concurrent or retried binds collide on one object instead of duplicating #1673
v0.15.3
What's Changed
Fixed
- fAccount for native sidecar requests in pod resources by @KaiPilotBot in #1633
- Align node-scale-adjuster default image name with GHCR by @KaiPilotBot in #1682
- Backport kai-config upgrade fix to v0.15 by @gshaibi in #1704
- Backport nil-scenario preempt guard to v0.15 by @enoodle in #1699
Changed
- Expand k8s support testing to 1.28+ (v0.15 backport) by @enoodle in #1686
- Drop github.com/NVIDIA/gpu-operator dependency(backport) by @KaiPilotBot in #1710
Full Changelog: v0.15.2...v0.15.3
v0.14.6
What's Changed
Fixed
- Account for native sidecar requests in pod resources by @KaiPilotBot in #1632
- Align node-scale-adjuster default image name with GHCR by @KaiPilotBot in #1681
- Prevent duplicate reservation pods for a single gpu-group (#1693) [v0.14] by @gshaibi in #1703
- Backport nil-scenario preempt guard to v0.14 by @enoodle in #1700
- Backport kai-config upgrade fix to v0.14 by @gshaibi in #1705
Changed
- Expand k8s support testing to 1.28+ (v0.14 backport) by @enoodle in #1685
- Drop github.com/NVIDIA/gpu-operator dependency by @KaiPilotBot in #1709
Full Changelog: v0.14.5...v0.14.6
v0.12.22
What's Changed
Fixed
- Account for native sidecar requests in pod resources by @KaiPilotBot in #1631
- Align node-scale-adjuster default image name with GHCR by @KaiPilotBot in #1680
- Prevent duplicate reservation pods for a single gpu-group (#1693) [v0.12] by @gshaibi in #1702
- Backport kai-config upgrade fix to v0.12 by @gshaibi in #1706
Changed
Full Changelog: v0.12.21...v0.12.22
v0.15.2
What's Changed
Fixed
- Improved solver performance in some large reclaim scenarios #1627 itsomri
- Grove grouper now sets
minSubGroup(equal to the number of child SubGroups) instead ofminMember=0on parent SubGroups generated fromtopologyConstraintGroupConfigs#1639 davidLif
[v0.15.1] - 2026-06-01
Changed
- Updated Go toolchain and base build images to v1.26.3.
Full Changelog: v0.15.1...v0.15.2
v0.14.5
What's Changed
- Improved solver performance in some large reclaim scenarios #1627 itsomri
- Improved reclaim, preempt, and consolidation performance by skipping solver work for jobs blocked by victim-invariant pre-predicate failures such as missing PVCs, missing required ConfigMaps, and tasks larger than any node. #1502
Full Changelog: v0.14.4...v0.14.5
v0.15.1
What's Changed
- Updated Go toolchain and base build images to v1.26.3
Full Changelog: v0.15.0...v0.15.1
v0.14.4
v0.12.21
v0.9.19
What's Changed
- Updated Go toolchain and base build images to v1.25.10.
Full Changelog: v0.9.18...v0.9.19