All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog.
- Added
enabledHelm values forbinder,podgrouper,podgroupcontroller,queuecontroller,admission, andschedulerto allow disabling individual components from values.yaml. Previously these were hardcoded totruein the kai-config template. - Added
prometheus.enabledandprometheus.externalPrometheusUrlHelm values to configure Prometheus from values.yaml #907 - Added validation for
subgroupname in podgroup faizanexe - Added memory profile and run duration to snapshot tool #1411
- Added support for configuring pod and container security contexts on resource reservation pods via CLI flags AdheipSingh
- Added
operator.logLevelHelm value to configure the operator log level (maps to--zap-log-levelwhen set) #1446 dttung2905 - The scheduler now implements elastic PodGroups on both the subgroup level (
minSubGroup) and pods (minAvailable). This allows for elasticity on all of the podgroup tree hierarchy. #1416 - davidLif - Allow the configuration of plugins in the binder service. #1480 - davidLif
- Added support for configuring scheduler log level and custom scheduler args via Helm values (
scheduler.args) #1452 dttung2905 - Added
crdupgrader.image.registryHelm value to overrideglobal.registryfor thecrd-upgraderpre-install/pre-upgrade hook image, allowing the hook image to be served from a separate mirror without redirecting all chart images. #1404
- Breaking: JobSet PodGroups no longer auto-calculate
minAvailablefromparallelism × replicas. The default is now 1. Use thekai.scheduler/batch-min-memberannotation to set a custom value. - Bumped
k8s.io/*module group from v0.34.x to v0.35.4,k8s.io/kubernetesto v1.35.4, andsigs.k8s.io/controller-runtimeto v0.23.3, enabling KEP-4671 Workload API types. #1466 - Rebuilt the
crd-upgraderhook image onalpine:3.20instead ofubi9/ubi-minimal. Image size drops from ~165 MB to ~67 MB uncompressed (~60% reduction), shrinking cold-pull latency on ephemeral CI runners. The image is also reused by thetopology-migrationandpost-deletehook jobs as a generickubectl + bashtoolbox, so bash is preserved on the runtime image. #1404
- Fixed
additionalImagePullSecretsin Config CR rendering asmap[name:...]instead of plain strings by extracting.namefromglobal.imagePullSecretsobjects. Also propagatedglobal.imagePullSecretsto all Helm hook jobs (crd-upgrader,topology-migration,post-delete-cleanup) - Added
global.nodeSelector,global.tolerations,global.affinity,global.securityContextsupport to the post-delete job hook. - Fixed Helm template writing
imagesPullSecret(string) instead ofadditionalImagePullSecrets(array) in Config CR, causing image pull secrets to be silently ignored. Added backward-compatible deprecatedimagesPullSecretfield to CRD schema. #942 - Fixed
windowSizefield inSchedulingShardCR to support Prometheus duration format (e.g.1w,7d). Previously, usingwindowSize: 1was shown in the documentation caused the kai-operator to crash-loop withtime: unknown unit "w" in duration "1w". - Race condition where
SyncForGpuGroupcould prematurely delete reservation pods when the informer cache had not yet propagated GPU group labels on recently-bound fraction pods. The binder now checks for active BindRequests referencing the GPU group before deleting a reservation pod. - Fixed non-preemptible multi-device GPU memory jobs being allowed to exceed their queue's deserved GPU quota. The per-node quota check now correctly accounts for all requested GPU devices. #1369
- Added
resourceclaims/bindingRBAC permission to the binder ClusterRole for compatibility with Kubernetes v1.36+, where theDRAResourceClaimGranularStatusAuthorizationfeature gate requires explicit permission on theresourceclaims/bindingsubresource to modifystatus.allocationandstatus.reservedForon ResourceClaims. #1372 praveen0raj - Allow users to override minMember for k8s batch Jobs and JobSets using the
kai.scheduler/batch-min-memberannotation #1308 itsomri - Fixed a bug where nil minMember caused subgroups creation to fail in scheduler #1407 itsomri
- Improved performance by evaluating SetNode once per session instead of on each predicate evaluation #1421 itsomri
- Added persistent volumes to cluster snapshot #1424 itsomri
- Improved scheduling performance for preempt/reclaim/consolidate actions on jobs with many tasks by replacing per-task linear probing with exponential+binary search in the job solver, reducing the number of scenario simulations from O(n) to O(log n) #1435 itsomri
- Avoid expensive solver-backed reclaim/preempt/consolidation work for jobs already blocked by victim-invariant pre-solver failures such as missing PVCs, missing required ConfigMaps, or requests larger than the maximum node size. #1502
- Fixed
skipTopOwnerGroupernot propagating per-type defaults (priority class and preemptibility) for skipped owners (e.g.DynamoGraphDeployment), causing PodGroup spec to retain stale values after defaults ConfigMap updates. - Fixed binder DRA detection on clusters where the upstream
DynamicResourceAllocationfeature gate does not reflect server-side DRA availability. The binder now probes the API server during init (matching the scheduler) so the DRA plugin is gated on the same authoritative decision. #1481 - Suppressed noisy
Reconciler errorlogs andPodGrouperWarningevents on transient PodGroup update conflicts. The podgrouper now treatsIsConflicterrors as expected and silently requeues the reconcile instead of surfacing the apiserver's "object has been modified" message. - Fixed kai-operator not reconciling on Prometheus and ServiceMonitor changes. The Config controller now watches owned
PrometheusandServiceMonitorresources, so deletions and drift trigger reconciliation. CRD presence is checked at startup against the API server (the scheme-only check used previously could not detect missing CRDs), and the watch is registered only when the CRDs are installed. #877 - Added
before-hook-creationto thecrd-upgraderHelm hook delete policy so failed hook Jobs no longer block subsequenthelm upgrade --installretries. Aligns with the policy already used by the chart's other hook resources. #1404 - Fixed kai-operator leader-election event emission by adding RBAC permission for core
events(create,patch,update) so operators can publish leadership events instead of loggingevents is forbidden. #1572 dttung2905
- Added queue validation webhook to queuecontroller with optional quota validation for parent-child relationships AdheipSingh
- Added support for VPA configuration for the different components of the KAI Scheduler - jrosenboimnvidia
- Users that have VPA installed on their cluster can now utilize it for proper vertical autoscaling
- Added FOSSA scanning for the repository context. Scans will also be performed for submitted PRs. The results can be found here. #1178 - davidLif
- Added support for Ray subgroup topology-aware scheduling by specifying
kai.scheduler/topology,kai.scheduler/topology-required-placement, andkai.scheduler/topology-preferred-placementannotations. - Allow subgroups to have a 0 value for "minAvailable". This means that all pods in this subgroup are "elastic extra pods". #1216 davidLif
- Added a display web page for Scale test results for public viewing #1154 SiorMeir
- Auto-enable leader election when
operator.replicaCount> 1 to prevent concurrent reconciliation #1218 - Update go version to v1.26.1, With appropriate upgrades to the base docker images, linter, and controller generator. #1222 - davidLif
- Updated resource enumeration logic to exclude resources with count of 0. #1120
- Fixed scheduler on k8s < 1.34 with DRA disabled.
- Fixed pod group controller failing to track DRA GPU resources on Kubernetes 1.32-1.33 clusters. #1214
- Fixed scheduling-constraints signature hashing for
Priorityand containerHostPortby encoding fullint32values, preventing byte-truncation collisions and flaky signature tests. - Fixed rollback in scheduling simulations with DRA #1168 itsomri
- Fixed a potential state corruption in DRA scheduling simulations #1219 itsomri
- Fixed operator reconcile loop caused by status-only updates triggering re-reconciliation. #1229 cypres
- Fixed scheduler not starting on k8s clusters with DRA disabled, due to the ResourceSliceTracker not syncing. #1241 cypres
- Fixed webhook reconcile loop on AKS, by retaining the cloud-provider-injected namespaceSelector rules during reconciliation. #1292 cypres
- Added
minSubGroupfield to PodGroup and SubGroup API to support specifying the minimum number of child SubGroups required for elastic gang scheduling, along with validation to prevent simultaneous use ofminSubGroupandminMemberfields (#TBD) by KAI Dev Agent - Added
global.nodeSelectorpropagation from Helm values to Config CR, ensuring operator-created sub-component deployments (admission, binder, scheduler, pod-grouper, etc.) receive the configured nodeSelector #1102 yuanchen8911 - Added
pluginsandactionsfields to SchedulingShard spec, allowing per-shard customization of scheduler plugin/action enablement, priority, and arguments gshaibi - Added support for Kubeflow Trainer v2 TrainJob workloads via skipTopOwner grouper pattern
- Added
binder.cdiEnabledHelm value to allow explicit override of CDI auto-detection for environments without ClusterPolicy - Added metric for tracking evicted pods in pod groups, including nodepool, eviction action, and gang size
- Block scheduling of pods with shared (non-template) DRA GPU claims that lack a queue label or have a mismatched queue label gshaibi
- Added the option to disable prometheus service monitor creation #810 itsomri
- Fixed prometheus instance deprecation - ensure single instance #779 itsomri
- Added clear error messages for jobs referencing missing or orphan queues, reporting via events and conditions #820 gshaibi
- Added rule selector for resource accounting prometheus #818 itsomri
- Made accounting labels configurable #818 itsomri
- Added support for Grove hierarchical topology constraints in PodGroup subgroups
- Added support for n-level queue hierarchies #858 gshaibi
- Added labels and annotations propagation from topOwner in SkipTopOwner grouper #861 SiorMeir
- Added scheduler name match conditions to admission webhooks to improve cluster stability
- Add Gpu Dra claims and resource slices accounting for the purpose of resource management and quota guarantees. *** This change doesn't support shared gpu claims or gpu claims with FirstAvailable *** #900 davidLif
- Added DRA resources recording to snapshot #830
- Temporarily Prevent device-plugin GPU pods on DRA-only nodes - until translation between device-plugin notation and DRA is implemented
- Implemented subgroups for pytorchjobs #935 itsomri
- Made KAI images distroless #745 dttung2905
- Allow setting empty gpuPodRuntimeClassName during helm install #972 steved
- Created scale tests scenarios for running scale tests for KAI #967
- Implemented block-level segmentation for pytorchjobs #938 itsomri
- Added scale test environment setup script and updated service monitors for KAI scheduler #1031
- Implemented subgroups for leaderworkerset #1046 davidLif
- Added discovery data to snapshot for more accurate debugging #1047 itsomri
- Implemented subgroup segmentation (with topology segment definitions) for leaderworkerset #1058 davidLif
- Fixed operator status conditions to be kstatus-compatible for Helm 4
--waitsupport: addedReadycondition and fixedReconcilingcondition to properly transition to false after reconciliation completes #1060 - Fixed a bug where the node scale adjuster would not check if a pod was unschedulable before creating a scaling pod leading to unnecessary node scaling #1094 slaupster
- Fixed admission webhook to skip runtimeClassName injection when gpuPodRuntimeClassName is empty #1035
- Fixed topology-migration helm hook failing on OpenShift due to missing
kai-topology-migrationservice account in thekai-systemSCC #1050 - Fixed a bug where queue status did not reflect its podgroups resources correctly #1049
- Fixed helm uninstall does not remove webhooks #959 faizan-exe
- Fixed security vulnerability where PodGang could reference pods in other namespaces, preventing cross-namespace manipulation
- Fixed pod controller logging to use request namespace/name instead of empty pod object fields when pod is not found
- Fixed a bug where topology constrains with equal required and preferred levels would cause preferred level not to be found.
- Fixed GPU memory pods Fair Share and Queue Order calculations
- Interpret negative or zero half-life value as disabled #818 itsomri
- Handle invalid CSI StorageCapacities gracefully #817 rich7420
- Embed CRD definitions in binary for env-test and time-aware-simulations to allow binary portability #818 itsomri
- Fixed missing
podGrouperconfiguration in Helm template that prevented podgrouper values from being applied #860 - Fixed rollback for failed bind attempts #847 itsomri
- Fixed missing
namespace,serviceAccountName, andappLabelfields inresourceReservationsection of kai-config Helm template #860 dttung2905 - If a preferred topology constraint is set, do not try to find a lowest common subtree (as a part of the calculations optimizations) which is lower then the preferred level
- Added dedicated
usage-prometheusservice for scheduler Prometheus access with configurable instance name #896 itsomri - ClusterPolicy CDI parsing for gpu-operator > v25.10.0
- Fixed missing
repository,tag, andpullPolicyfields inresourceReservationImagesection of kai-config Helm template #895 dttung2905 - Fixed a bug in ray gang scheduling where not all worker groups' minMember would be respected #924 itsomri
- cpu-only nodes calculation in DRA enabled clusters #944
- enable DRA flag override fix in snapshot-tool #955
- Fixed ConfigMap predicate to respect the Optional field and now considers ConfigMaps in projected volumes and ephemeral containers
- Fixed simulations that failed due to pod capacity on node #969 itsomri
- Fixed a bug where some resource claims would remain marked as bound to devices forever
- Removed the constraint that prohibited direct nesting of subgroups alongside podsets within the same subgroupset.
- Fixed plugin server (snapshot and job-order endpoints) listening on all interfaces by binding to localhost only.
- Removed redundant
connectionfield fromGlobalConfigin favor ofPrometheus.ExternalPrometheusUrlfor external Prometheus URL configuration
- Introduced native KAI Topology CRD to replace dependency on Kueue's Topology CRD, improving compatibility and simplifying installation
- Added support for having the default "preemptibility" per top-owner-type read from the default configs configmap in the pod-grouper
- Added option to profile CPU when running the snapshot tool #726 itsomri
- GPU resource bookkeeping for DRA enabled resources
- Add a "tumbling window" usage configuration - calculate a tumbling window size based on a start timne configuration and a duration config field.
- Added an option to disable prometheus persistency #764 itsomri
- If enabled, prometheus storage size is not inferred from cluster objects, but defaults to 50Gi unless explicitly set in KAI config #756 itsomri
- When prometheus is disabled, it will remain in the cluster for a grace period equal to it's retention, unless re-enabled #756 itsomri
- Fixed a bug where the snapshot tool would not load topology objects #720 itsomri
- Operator to conditionally watch ClusterPolicy based on its existence, preventing errors in its absence
- Fixed confusing resource division log message #733 itsomri
- Made post-delete-cleanup resources configurable #737 dttung2905
- GPU Memory pods are not reclaimed or consolidated correctly
- Added missing leases permission for the operator #753 dttung2905
- Fixed reclaim/preempt/consolidate actions for topology workloads #739 itsomri
- Fixed a bug where the scheduler would not consider topology constraints when calculating the scheduling constraints signature #761 gshaibi
- Fixed Dynamo integration by adding Dynamo GVKs to SkipTopOwner table
- Keep creating service monitors for deprecated prometheus instances #774 itsomri
- Fix retention duration parsing for deprecated prometheus instances #774 itsomri
- Renamed the previous "tumbling" option for the scheduler usage window type to "cron".
- Removed the requirement to specify container type for init container gpu fractions #684 itsomri
- When a status update for a podGroup in the scheduler is flushed due to update conflict, delete the update payload data as well #691 davidLif
- Fixed scheduler pod group status update conflict #676 davidLif
- Fixed gpu request validations for pods #660 itsomri
- Dependabot configuration to update actions in workflows #651 ScottBrenner
- optimize dependency management by using module cache instead of vendor directory #645 lokielse
- Added parent reference to SubGroup struct in PodGroup CRD to create a hierarchical SubGroup structure
- Added the option to configure the names of the webhook configuration resources.
- Option to configure reservation pods runtime class.
- Added a tool to run time-aware fairness simulations over multiple cycles (see Time-Aware Fairness Simulator)
- Added enforcement of the
nvidiaruntime class for GPU pods, with the option to enforce a custom runtime class, or disable enforcement entirely. - Added a preferred podAntiAffinity term by default for all services, can be set to required instead by setting
global.requireDefaultPodAffinityTerm - Added support for service-level affinities
- Added time aware scheduling capabilities
- Added option to specify container name and type for fraction containers
- (Openshift only) - High CPU usage for the operator pod due to continues reconciles
- Fixed a bug where the scheduler would not re-try updating podgroup status after failure
- Fixed a bug where ray workloads gang scheduling would ignore
minReplicasif autoscaling was not set - KAI Config wrong statuses when prometheus operand is enabled
- GPU-Operator v25.10.0 support for CDI enabled environments
- Option to configure reservation pods runtime class.
- Fixed Helm chart compatibility with Helm 4 wait logic to prevent indefinite hangs during deployment readiness checks
- Added the option of providing the podgrouper app a scheme object to use
- config.kai.scheduler CRD that will describe the installation of all KAI-scheduler services for the operator
- Initial KAI-operator implementation for managing components
- PodGroup Controller, Queue Controller, Admission and Scale Adjuster operands to operator lifecycle management
- Deployment of operator in Helm chart alongside pod group controller
- Deploy PodGroup Controller, Queue Controller, Admission and Scale Adjuster via operator for streamlined deployment
- schedulingshrards.kai.scheduler CRD that describes partitioning the cluster nodes for different scheduling options.
- Moved the CRDs into the helm chart so that they are also installed by helm and not only by the crd-upgrader, but removed the external kueue clone of topology CRD from being automatically installed.
- Updated queue controller image name to align with current deployment standards
- Removed webhook manager component as part of operator-based refactoring
- Added configurable plugins hub for podgrouper using interface and RegisterPlugins
- Added a plugin to reflect joborder in scheduler http endpoint - Contributed by Saurabh Kumar Singh singh1203.ss@gmail.com
- Fixed a bug where workload with subgroups would not consider additional tasks above minAvailable
- Removed unused code that required gpu-operator as a dependency
- Fixed wrong GPU memory unit conversion from node
nvidia.com/gpu.memorylabels - Fixed incorrect MIG GPU usage calculation leading to wrong scheduling decision
- Added a new scheduler flag
--update-pod-eviction-condition. When enabled, a DisruptionTarget condition is set on the pod before deletion
- Fixed scheduler panic in some elastic reclaim scenarios
- Added leader election configuration in all deployments and added global helm value that controls it during installation
- Separated admission webhooks from binder service to a separate
kai-admissionservice
- crd-upgrader respects global values for nodeSelector, affinity and tolerations
- kai-scheduler will not ignore pod spec.overhead field
- Fixed container env var overwrite to cover possible cases where env var with Value is replaced with ValueFrom or the other way
- Fixed a scenario where only GPU resources where checked for job and node, causing it to be bound instead of being pipelined
- Added GPU_PORTION env var for GPU sharing pods
- Fixed a miscalculation where cpu/memory releasing resources were considered idle when requesting GPU fraction/memory
- Changed RUNAI-VISIBLE-DEVICES key in GPU sharing configmap to NVIDIA_VISIBLE_DEVICES
- Removed GPU sharing configmap name resolution from env vars and volumes
- Added LeaderWorkerSet support in the podGrouper. Each replica will be given a separate podGroup.
- Added kueue topology CRD to kai installations
- Fixed cases where reclaim validation operated on outdated info, allowing invalid reclaim scenarios
- Added optional pod and namespace label selectors to limit the scope of monitored pods
- Added a plugin extension point for scheduler plugins to add annotations to BindRequests
- Added support for Grove
- Changed
run.ai/top-owner-metadatatokai.scheduler/top-owner-matadata
- Changed
runai-reservationnamespace tokai-resource-reservation. For migration guide, refer to this doc - Changed
runai/queuelabel key tokai.scheduler/queue. For migration guide, refer to doc
- Fixed pod status scheduled race condition between the scheduler and the pod binding
- Removed redundant
replicaskey for binder fromvalues.yamlas it is not used and not supported
- Removed
runai-job-idandrunai/job-idannotations from pods and podgroups
- Added minruntime plugin, allowing PodGroups to run for a configurable amount of time without being reclaimed/preempted.
- PodGroup Controller that will update podgroups statuses with allocation data.
- Queue Controller that will update queues statuses with allocation data.
- Added support for k8s pod scheduling gates
- nodeSelector, affinity and tolerations configurable with global value definitions
- Added
PreemptMinRuntimeandReclaimMinRuntimeproperties to queue CRD - Scheduler now adds a "LastStartTimestamp" to podgroup on allocation
- Queue order function now takes into account potential victims, resulting in better reclaim scenarios.
- Fixed preempt/reclaim of elastic workloads only taking one pod.
- Scheduler now doesn't label pods' nodepool when nodepool label value is empty